CS520
Fall 2013
Program 1
Due Sunday September 8


Write a C program, called utf_convert, that will read a file in either UTF-32 or UTF-16, and output the same file in the other format. Characters should be written to the file in Little Endian format.

Command Line Arguments

The program will take three arguments, with the third argument being optional.
  1. Name of the input file
  2. Name of the output file
  3. Optional Format of input file (32 or 16)
If there is no third argument, the program is to figure out the most likely format for the first file, and translate the file into the other format. It should also write, to standard output, a brief explanation as to why it made the selection it did. For example, if the file is not a valid UTF-16 file, but is a valid UTF-32 file, it should indicate that it inferred that the file was formatted as UTF-32 because it was not a valid UTF-16 file.

For example:

./utf_convert input32 output16 32
denotes reading the contents of the file input32, which is formatted in UTF-32, and write it to output16 in UTF-16.
./utf_convert input16 output32 16
denotes reading the contents of the file input16, which is formatted in UTF-16, and write it to output32 in UTF-32.
./utf_convert input output
denotes reading the content of input and figuring out if is more likely that input is in UTF-16 format, or UTF-32 format, and writing a translated version to output, translating to whatever input is not.

Note that you should validate the command line arguments. This means making sure that there are the correct number of arguments, and making sure that the arguments are all valid. For example, attempting to read from a file that does not exist should not crash your program, but should instead prompt a graceful exit with a helpful error message. Likewise, specifying an invalid format should also produce a graceful exit with a helpful error message.

Error Handling

An empty input file is okay. You should simply produce an empty output file.

If an error is detected, print an appropriate error message to stderr that includes the offset in the file for the start byte for the sequence that is in error. In the case of an unexpected continuation byte, print the offset of that byte.

You may exit the program after reporting the first error.

Return Status

The program will return a status of -1 if it is detected that the file is not valid; othewise it will return a status of 0.

Source File Name

Put all your source code in the file utf_convert.c. It will be automatically compiled when grading assuming that your source file is named utf_convert.c.

Algorithm Details

The RFC for UTF-16 describes the UTF-16 format in detail. Section 2.1 and 2.2 describe how to encode and decode UTF-16, and I recommend looking at those sections.

Testing

You should write other programs to create interesting test cases. This will be partially covered in Lab 2, but you may want to do some of this on your own as well.


Grading

Your program will be graded primarily by testing it for correct functionality.
  1. 60 points will be awarded for properly handling files containing only characters less than 10 bits.
  2. 20 additional points will be awarded for also handling characters with more than 10 bits.
  3. 10 additional points will be awarded for properly detecting errors in the input file.
  4. 10 additional points will be awarded for figuring out if the input file is in UTF-32 or UTF-16, and translating accordingly. If the program supports this option, it will be able to run with only the input and the output file specified.

In addition, remember, you may lose points if your program is not properly structured or adequately documented. Coding guidelines are given on the course overview webpage.

Your programs will be graded using agate.cs.unh.edu so be sure to test in that environment.

Remember: as always you are expected to do your own work on this assignment. Copying code from another student or from sites on the internet is explicitly forbidden!

Submission

Your programs should be submitted for grading from agate.cs.unh.edu. To turn in this program, type:
% ~cs520/bin/DoSubmission.py prog1 utf_convert.c

This submission script is new. It passed what testing I have done on it, but it may still have issues. If there are any problems, please contact me via email and I will do my best to assist you. If I cannot be reached, please send me a copy of your assignment via email, and we will deal with the submission script later.

Due Date

This assignment is due Sunday September 8. The standard late policy concerning late submissions will be in effect.