problems

Exercises

Programming Note: How to read a file in C

All the files that we consider in this course---text files, audio files, image files, are sequences of bytes. Each byte consists of eight bits and thus represents a value in the range 0 to 255. If we are reading a file from the standard input and want to print all the byte values in the file, then we cand do it with the following code:

int ch;
unsigned char uch;
while((ch=getchar())!=EOF)
{
uch = (unsigned char) ch;
printf("%d\n",uch);
}

Alternatively, if we are opening an external file (let's suppose it's called myfile) and want to print the byte values, we would use this code:

int ch;
unsigned char uch;
FILE *fp;
fp=fopen("myfile","r");
while((ch=fgetc(fp))!=EOF)
{
uch = (unsigned char) ch;
printf("%d\n",uch);
}

( If you don't understand this next part, that's fine---just use the code above as a template for all the work you do reading files.) What's the deal with the int and unsigned char types? Why can't you just use char throughout? The data type char in C also consists of a single byte, but treats the byte values as numbers in the range -128 to 127. If you were just reading from an ordinary text file, where all the byte values are in the range 0 to 127, it really wouldn't matter. But compressed versions of text files, image files, etc., will contain the full range of byte values, and using char in place of unsigned char can get you in trouble. Here's why: The functions getchar and fgetc read the next bye from a file as an unsigned char, and then convert that value to the corresponding int. If no bytes remain to read, then fgetc and getchar return a special end-of-file value, the int -1. Now if you just wrote

char ch;
while((ch=getchar())!=EOF)
...

and you read the byte 11111111 from the input file, the byte value would be treated as -1 and the while loop would terminate prematurely.If, instead, you wrote the same code with ch declared as an unsigned char, then ch would never have the value -1, and the while loop would never terminate. So to get things to work right, you need to play this odd game with types.

No free lunch with WinZip. Here's another proof of the No Free Lunch principle: If the compression algorithm compressed every file, then we could start with a long file, compress it, then compress the compressed file, then compress the newly compressed file....until we reduced the file down to nothing!

Let's see what actually happens in practice: Your Windows machine undoubtedly has some sort of software utility for compressing files in .zip format. Take a long text document, and compress it. Note the exact sizes (in bytes) of the two files. Now take the compressed file and rename it (so that it no longer has the .zip suffix) and compress it again. Once more, note the exact size of the new file compared to the one that preceded it. Continue this process. What do you find?

(You might try the same thing on a UNIX system, using the compress command, but compress recognizes when its argument is a .Z file, so can't be tricked into trying to compress again.)

The Entropy of Bleak House Download the file bleakhouse.txt. (The file is about 2MB, so make sure you're working somewhere where there's a fast download connection. This is the entire text of the novel Bleak House, by Charles Dickens. By the way, these electronic versions of uncopyrighted material come from Project Gutenberg---you can read about the project in the prefatory material at the beginning of the file. You should erase this material for the experiments in these problems, but if you pass the file along to someone else, you're supposed to include it.)

(a) (7 points) Write a program to calculate the entropy of a text file read from the standard input, and apply it to bleakhouse.txt.
(b) (3 points). Try your program out on some other files---for example, C source code, another book downloaded from Project Gutenberg, an image file, an executable file. Record and discuss your results.

A Simple Text-Compression Scheme. This scheme only works for text files and doesn't compress all that much, but studying the implementation will teach you something important about how to manipulate irregular-sized bit fields in C. Ordinary ASCII text encodes each character as a single byte whose most significant bit is 0. This being the case, you can compress the text file with no loss of information by getting rid of the leading 0's and transmitting seven bits for each character. For example, "cat", which is ordinarily represented by the 24 bits

011000110110000101110100

would be written as

110001111000011110100

The problem is, how do you carry this out in practice?

(a) (0 points) Study the code for the compression program. We'll discuss all the bit operations in the class laboratory session. Make sure you understand the code. Compile the program and run it on a few text files. Verify (using the UNIX command ls -l) that the target file is indeed 7/8 as long as the source file. Verify (using the UNIX command od) that it correctly implements this compression algorithm. (We'll discuss od in the lab session.)

(b) (12 points) Write a decompressor. That is, write a program that reads from the standard input the compressed file that this algorithm produces, and writes the original file to the standard output. The C code for the decompressor looks an awful lot like the code for the compressor.

(c) (3 points) What would happen if you tried to encode and then decode a non-text file using this algorithm? Discuss this in the context of the "No Free Lunch" section of the notes.

The Second-Order Entropy of Bleak House. (a) (9 points) Write a program to calculate the second-order entropy of a text file, and use it to calculate the second-order entropy of bleakhouse.txt. You'll need, of course, a table that stores the frequencies of all pairs of ASCII characters. The most straightforward implementation is to use a two-dimensional array whose dimensions are 128 x 128. That should work all right. The result of an ordinary entropy computation gives the result in terms of the number of bits of information per symbol, but here each symbol corresponds to two characters, so you need to divide by 2 to get the number of bits per character. Compare this to the first-order entropy.

(b) (20 points) Write a program to compute the third-order entropy of a text file, and apply it to Bleak House. This time there is no question of using a three-dimensional array implementation, so you will have to figure out some other representation of the frequency information. (If you're interested in doing this problem, I'll discuss some suggestions with you.)

Synthetic English. This problem (which goes back to Shannon's original paper on information theory) has an amusing result, although little to do with data compression. Let's suppose that our English texts consist only of the 26 letters and the space character. We can convert an ordinary text file into such a text by replacing each lower-case letter by its upper-case equivalent, and each sequence of non-letters by a single space character. For example, the following passage:

Historians have argued that Charles Babbage was unable to build his vast mechanical computers because his conception exceeded the capacity of 19th-century engineering. The construction in 1991 of a working, three-ton calculating engine proves that his designs were well within the realm of possibility.

would be replaced by

HISTORIANS HAVE ARGUED THAT CHARLES BABBAGE WAS UNABLE TO BUILD HIS VAST MECHANICAL COMPUTERS BECAUSE HIS CONCEPTION EXCEEDED THE CAPACITY OF TH CENTURY ENGINEERING THE CONSTRUCTION IN OF A WORKING THREE TON CALCULATING ENGINE PROVES THAT HIS DESIGNS WERE WELL WITHIN THE REALM OF POSSIBILITY

(a) (5 points) Write a program that reads a text file from the standard input and writes the replacement text to the standard output.

(b) (8 points) Based on this 27-symbol alphabet, write programs that tabulate (i) the probability of occurrence of each of these symbols in Bleak House; (ii) the probability, given that a particular symbol is x, that the next symbol is y (for example, if x is Q and y is U, this probability will almost surely be 1); (iii) the probability, given that two consecutive symbols are xy, that the next symbol is z (for example, if x is T, y is H, and z is K, the probability will surely be 0).

(c) (10 points) Write a program that produces a random sequence of 100 symbols drawn from this 27-symbol alphabet, with the same distribution of individual symbols that you found in (b-i).. Write a program that produces a random sequence of 100 symbols with the same conditional probabilities on the occurrences of pairs of consecutive symbols that you found in (b-ii). Do the same for (b-iii). Your solution to the first of these three texts will produce incoherent jumbles of letters, but the word lengths and the most common letters will look quite plausible; your solution to the third will produce meaningless texts that look eerily like real English.

Exercise 5. Some mathematical questions.

(8 points per part---if you do these, you have to explain your proof to me and be prepared to answer questions about it).

(a) Prove that the second-order entropy of a message (measured in bits per symbol) is always less than or equal to the first-order entropy.

(b) Prove the inequality

for uniquely decipherable codes.

(c) Prove that the average number of bits per symbol in a message encoded by means of a uniquely decipherable code is always at least as large as the entropy of the message.

Exercise 6. Some theoretical problems on Huffman coding.

(a) (4 points) Show, on paper, the steps in the construction of the Huffman code tree for the the symbols A,E,I,O,U with respective probabilities 0.1,0.2,0.3,0.15,0.35. Then show the resulting prefix code. What is the average number of bits per symbol when we encode a message over this five-letter alphabet using the resulting prefix code. How does this compare with the entropy?

(b) (8 points) Suppose I use Huffman coding to encode a file one million bytes long, considering each byte as a symbol. What is the longest possible code word that can result? (HINT: What matters here is the "million", not the "byte".)

(c) (10 points each) Prove the two properties of the Huffman coding algorithm stated in the notes---the optimality among all prefix codes, and the relation to entropy.

Exercise 7. I've provided a quick-and-dirty implementation of Huffman coding. The three phases of the algorithm are construction of the code tree, construction of the code table from the tree, and encoding of the source file. In addition to writing output onto the destination file, the program writes an awful lot of information to the standard output; in particular, a representation of both the tree and the code table, and some statistics on how much compression took place.

(a) (6 points) Try to understand this code! (I'll answer all your questions!!) You should observe that the program writes a representation of the encoding tree at the beginning of the destination file. Compile the program and run it on some ordinary text files, some executable binaries, image files and audio files. What kind of compression rates do you get? Try it out on some small files with a limited alphabet and study both the diagnostic output and the destination file (using od). Verify that it's doing the right thing or inform me at once if you discover that it isn't.

(b) (20 points) Write a Huffman decoder for use with this encoding program. If you decide to do this, I'll be happy to discuss the plan with you.

Exercise 8.Problems on LZW compression.

(a) (5 points) Show the steps of LZW compression and subsequent decompression applied to the message abacaaadcaabacaabadc over the 4-letter alphabet {a,b,c,d}. (For small messages like this, there may not be any appreciable compression, or indeed, any compression at all.)

(b) (20 points) Write an LZW compressor. The hard thing is figuring out how to implement the dictionary in a manner that is reasonably space-efficient but also reasonably quick to search (and, for our classroom purposes, not too hard to implement). As a slightly simpler problem, write the "compressor" so that it just keeps track of how many times an encoding a dictionary entry would be written to the output, without actually producing the compressed file. This still lets you see how much compression is accomplished.

If you are interested in doing this project, I can advise on how to implement the dictionary, and some related issues (how big should the dictionary be? what should you do when the dictionary gets full?)

Exercise 9. Some simple experiments with .WAV files.

The .WAV format is Microsoft's standard for storing audio files. If you are working on a Windows machine, then you almost surely already have software for playing .WAV files. If by chance, you don't, then the Graphics Viewer that I have for you to download in the next exercise will also play these sounds. Even if you're not working on a Windows machine, you can probably find software to play .WAV files on your system.

The full description of the standard for this file format is incredibly complicated, because there are so many different options. In our experiments we will only use a certain kind of .WAV file: Mono (as opposed to stereo), uncompressed, 11025 samples/second, 8 bits per sample. Files conforming to this restriction have the following format:

A 12-byte header chunk, containing the characters "RIFF", followed by 4 bytes giving the number of bytes remaining in the file, followed by the characters "WAVE".

A 24-byte format chunk, containing the characters "fmt ", followed by 20 bytes that specify the sample rate, number of channels, and number of bits per sample.

The data chunk, containing the characters "data", followed by 4 bytes giving the number of bytes remaining in the file, followed by the sampled data themselves. Each sample is a single byte, representing an integer in the range 0..255.

For your convenience, I've supplied two programs: unpackwave.c and repackwave.c, for processing these files. The first of these reads the file, verifies that the header and format chunks have the right format, and then writes the data to a text file, converting integers in the range 0..255 to doubles in the range -1..1. The data are written in two columns, the first containing the numbers from 0 to one less than the number of samples, and the second containing the converted sample files themselves. The second reverses the process, taking a two-column text file with the described format and converting it back to a .WAV file.

I've also supplied two .WAV files, flute.wav and trumpet.wav, which are brief (about 5 seconds in length) snippets of recorded music.

(a) (0 points) Compile the two programs and name the executables unpackwave and repackwave, respectively. Listen to the .WAV files. Then run the first program by typing

unpackwave<flute.wav>flute.data

and inspect the file flute.data (it's a plain-old text file!).

Now run the second program by typing

repackwave<flute.data > flute2.wav

and listen to the new .WAV file.

(b) (5 points) Make your own .WAV files. In particular, create two .WAV files, about 5 seconds in duration each, containing a sine wave signal at 200 hz and a square wave signal at 200 hz. You need to write a program to produce the two-column format text file, and then use repackwave.

(c) (5 points) Write a program to convert a .WAV file to another one that plays at twice the speed, or at half the speed. (The pitch will also change by an octave. I have no idea how to change the tempo without changing the pitch.)

(d) (8 points) Experiment with the effects of introducing additional quantization error. Specifically, modify repackwave.c so that it first converts the double sample value to an integer in the range 0..15 (i.e., a four-bit integer) or 0..31, and then multiply by 16 or 8 to get the value back in the range 0..255. You can then convert a .WAV file by first running unpackwave, and then the modified repackwave. Compare the sound of the resulting .WAV file to the original. Incidentally, to modify repackwave.c, you don't have to wade through all of that code---most of it is just devoted to assembling the header and format chunks, and these will be unchanged. It's only at the very end that you need to change one (that's one) line of code, and add one more.

Exercise 10. Some simple experiments with .PPM files.

PPM stands for Portable Pixel Map and is a very simple format indeed. The first three lines of a PPM file are text giving information about the format, and the remainder of the file consists of the sampled data. All the files we use will begin with lines

P5
256 256
255

which identifies the file as a grayscale image with 256 rows and 256 columns, with each sample stored as a single byte, representing a value in the range 0 to 255.

I've supplied a sample file for you to experiment with, and I'll add some others (and tell you how to make your own).

Of course you'll need something to view these files with. I found a nice freeware graphics viewer. It displays files in a zillion different image formats, and it even plays .WAV files!

(a) (2 points) So is 0 white and 255 black? Or is it the other way around? Tell me how you figured this out.

(b) (8 points) Write a program that changes the resolution of a .WAV file by dividing the image into 2X2 blocks, and replacing each block of four numbers by their average. For example, if the upper left-hand corner of the image contains the values

180 182
170 178

then this block would be replaced by

178 178
178 178

(The actual average is 177.5.) Repeat this experiment with 4X4 blocks, 8X8 blocks, etc.

(c) (8 points) Write a program that introduces additional quantization error, analogous to 9(d) above. The extreme case is to use a single bit per pixel, representing values above 127 by 255, and values 127 or below by 0.

Exercise 11. Download the following C programs:

complexdefs.h
complex_arithmetic.c
fft.c
ftwavedata.c
iftwavedata.c

These programs contain code defining a complex number data type, for performing the fast Fourier transform, and using the FFT for computing the transform and inverse transform of our data sets from wave files. You never need to look at the code! The way to use these programs is, first, compile them, by typing:

cc ftwavedata.c fft.c complex_arithmetic.c -o ftwavedata
cc iftwavedata.c fft.c complex_arithmetic.c -o iftwavedata

Now suppose x.wav is a wave file containing at least 2¹⁶ = 65536 but fewer than 110250 sampes. If you type

unpackwave < x.wav > x.data

then the resulting text file x.data will display the sample values in the two-column format, with the values scaled to lie in the range between -1 and 1.

Typing

ftwavedata < x.data > x.transform.data

computes the DFT of the first 65536 samples of x.data and displays the result in three-column format: The first column consists of the integers 0,..,65535, the second column the real part of each transform value, and the third column the imaginary part of each transform value.

Typing

iftwavedata < x.transform.data > x2.data

computes the inverse transform. The program iftwavedata assumes that the transform data were produced by transforming real sample values, and thus only displays the real parts of the inverse transform values. In principle, x2.data should be identical to the first 65536 samples of x.data. In practice, there will be some roundoff error.

If you now type

repackwave < x2.data > x2.wav

the resulting file should sound very close to the original, except that it will end abruptly after about 6 seconds. The roundoff error should have no audible effect on the sound.

(a) (0 points) Try this out. Inspect the intermediate files that are generated along the way.

(b) (6 points) Write a program that reads one of the three-column transform data files and produces as output a listing of their moduli (absolute values) in wraparound order, suitable for producing plots like the spectrum you see in the notes. Try this out on several wave files and plot the results; then interpret the results physically (musically?).

Parts (c) and (d) consitute a genuine compression experiment.

(c) (12 points) Produce the transform file as outlined above, then quantize the data by selecting 256 levels between the minimum and
maximum real parts in the file, 256 levels between the minimum and maximum imaginary part, and replacing each value by the lowest value at its level. (For example, suppose we tried this with 16 levels instead of 256. Suppose the minimum real part was -2 and the maximum real part was 6. The 16 levels would then be -2, -1.5,-1,-0.5,0,0.5,1,1.5,2,2.5,3,3.5,4,4.5,5,5.5. A transform value whose real part was 1.4 would be replaced by one whose real part was 1. (A good question is what to do with the transform value 6---my recommendation is replace it by 5.5.) Now inverse transform, and repack. What does the resulting wave file sound like? (I could hear no difference, but obviously this is subjective.) Try varying the number of levels---for example, what if you do use only 16 levels? Try setting the DC component aside first, as suggested in the notes.

(d) (12 points) Repeat (c), but instead of writing out the new transform values, write the byte value (0-255) as an unsigned char to the output. You only have to write these out for half of the transform values, since the others can be computed by conjugate symmetry. The resulting file will have 32768 bytes. Compute its entropy, and compare this with the entropy of the original wave file. Compress it using the UNIX compress utility (which employs some version of Lempel-Ziv encoding). How much compression do you achieve?

(e) (10 points) If the program from (d) also outputs the extreme values used for the quantization, then the transform can be recovered (with quantization error, of course) from the output file. Use this to create a simple compression-decompression system for wave files (with the peculiar restriction that they be exactly 65536 samples long). The steps in compression are

unpack
transform
quantize
UNIX compress

The steps in decompression are

UNIX decompress
reconstruct transform file
inverse transform
repack

Exercise 12. (20 points) Repeat (c) and (d) above, but before quantization, set the 50% (75%, 90%, 95%) of transform values with the lowest magnitudes to 0. How much compression do you achieve? What kind of sound quality do you get?

Exercises 13 and 14 constitute a project problem.

Exercise 15. Some simple exercises on Haar wavelets.

(a) (5 points)Write out the basis vectors for 4-dimensional space. For 16-dimensional space.
(b) (5 points) Compute the transform of (0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0) two different ways: Using inner product, and using the fast algorithm.
(c)(10 points) Prove that reversing the steps of the fast algorithm for computing the Haar wavelet transform actually does invert the transform.
(d) (10 points) Prove that the number of steps in the fast algorithm is proportional to n.
(e) (5 points) Compute the DFT of the vector in (b) and compare the result.