Compression

July 12, 2002

By Karen Kenworthy

IN THIS ISSUE

Compression
Hashes and Fingerprints
Collisions

I'm excited! For the last few months I've been working with a fascinating data compression technique. Called "Message Digest 5," or MD5, this mathematical formula (what computer folks call an algorithm) can compress any amount of data into a short sequence of just 16 bytes!

Think of it ... every book ever published, every photograph ever taken, every song ever written, all compressed into just 128 ones and zeros! What will these people think of next?

I know, I know. This sounds like a hoax. But it's true. The inventor of MD5, Ronald L. Rivest, is a respected mathematician. He teaches at the Massachusetts Institute of Technology (MIT), and is one of the founders of the respected computer security firm, RSA Security, Inc. (he's the "R" in "RSA").

Want more proof? Every web browser, many applications, and Windows itself, contain this algorithm. The largest corporations in the world, and the most powerful government agencies, depend on this algorithm every day.

So, what's the catch? Why haven't hard drive prices dropped to near zero? Why do computers need more than a few bytes of RAM? The answer might surprise you ...

It's easy to see why computers want to squeeze our information into the smallest spaces possible. Turning large blocks of data into small blocks allows disk drives to store more. It also makes data more portable, allowing more data to fit on a diskette, tape, or CD-RW disc.

Smaller blocks of data, traveling down wires and cables in less time, makes network connections, including those that make up the Internet, more efficient. Compressing data also makes computer communication more reliable, since fewer bits transmitted means fewer chances for data to become garbled.

But how can computers perform this miracle? Try as you might, you can't put ten pounds of sugar into a five pounds sack. So how can computers accomplish the binary equivalent, putting 10 megabytes of data into 5 megabyte sacks, err files?

True to their sneaky nature, our computers resort to several tricks, including one they borrowed from us. When you and I write text, we often use shortcuts, such as abbreviations and contractions. For example, I might have written the following sentence, if paper grew on trees:

I will visit doctor Smith at ten, ante meridian.

This version requires a total of 48 characters.

But since paper and ink do cost money, I'm much more likely to convey exactly the same information this way:

I'll visit Dr. Smith at 10 a.m.

This version requires only 31 characters, a savings of over thirty-five percent. Err, make that 35%. :)

The computer version of this trick doesn't involve words, Instead, our binary buddies replace often-repeated long sequences of 1s and 0s with shorter placeholders. Later, when the original data must be recovered, a process called decompression, the placeholders are replaced by the original sequences.

Many popular data compression techniques are referred to as "loss-less." This means no data is lost during the compression/decompression cycle. Compress a file using one of these techniques, then decompress the file, and you'll recover an exact, bit-for-bit duplicate of the original.

They techniques are very effective, often compressing data to one-half its original size. And, as you can imagine, these techniques are popular when compressing programs, spreadsheets, databases, and other information where a single changed bit could be disastrous.

But other data isn't so picky. For example, if one pixel of a picture is a few shades bluer than it ought to be, who will notice? Not many people. The same is true for music and other audio data. Suppose the amplitude of one millisecond of audio is played a few percent louder or softer? Would you be able to hear the difference? Probably not.

Our computers take advantage is these human limitations. When compressing some data, they first alter it, removing small differences between pixels, and moments of sound. This increases the number of repeated sequences in our data, making compression more effective. At the expense of lost fidelity, these "lossy" techniques can often compress a file to one-fifth its original size. And compression ratios of 10:1 are not uncommon.

Hashes and Fingerprints

So, where does the fabulous MD5 algorithm fit? Is it loss-less? Or is it lossy?

Actually, it's neither. Instead, it belongs to a group of compression algorithms known as "hashes." But they might as well be called "total loss" algorithms. That's because data, once compressed using one of these algorithms, cannot be recovered. That's right. These hashes have no corresponding decompression algorithm!

Yipes! Clearly, hashes aren't the best ways to compress your favorite picture or song. And they're a really poor choice when you need to save space storing your valuable programs, documents, or accounting data. So, what good are they? After all, if you wanted to lose data, you could just buy a shredder or eraser.

Before you decide that hashes are useless, let's take a closer look at how they work. Like all hashes, MD5 compresses files into a fixed size, regardless of the file's original size. You can think of hashes as a machine that converts a file of any size into a binary number of a fixed size.

Now here's the magic. This number can be thought of as the file's "digital fingerprint," a number that uniquely identifies the original file, without containing any of the file's data!

Now, fingerprints are wonderful things. No two people have the same fingerprints. And fingerprint copies are small enough to be kept on file. These two facts allow forensic scientists to positively identify a person, or tell two people apart, just by viewing their fingerprints.

If files had fingerprints, that would come in handy too. No need to compare, bit-by-bit, two million-byte files to see if they contain the same information. Instead, just compare their much smaller fingerprints. If the fingerprints are identical, the files are too. If the fingerprints differ, by just one bit, the files must differ too.

Compare a file's fingerprint today, with one computed hours, days, or even years ago. If the fingerprints are the same, the file has not been changed. Want to know if a friend's copy of your file is intact? Send him your file's fingerprint. If it matches the fingerprint of his copy of the file, all's well. If the two fingerprints differ, something's gone awry.

There's no doubt about it. Good digital fingerprints are useful. But how good is the MD5 digital fingerprint? Is each file's MD5 hash value really unique? Or might two files have the same hash value, what computer scientists call a "collision?"

Collisions

No hash algorithm is collision-free. It's always possible, at least in theory, to find two different files with exactly the same digital fingerprint. But MD5 is a what's known as a secure, or cryptographic, hash. In the words of the experts, it's "strongly collision-free." In words you and I might use, it's *very* unlikely two files will ever have the same MD5 hash value.

How unlikely? Let's perform an experiment. We'll have each of the world's 6 billion people sit at their computers, and begin creating disk files. Feel free to type anything you like into each file. Just be sure that each file is unique -- unlike any other file ever created.

To complete the experiment in a reasonable time, let's have each person make 1,000 different disk files every second. They'll need to compute each file's MD5 hash value too. There's no time for sleep. We'll work 24 hours each day, 365 days a year. To obtain accurate results, let's run the experiment for, say, 1 million years ...

[One million years later]
All done? Whew! I'm tired! And I'll bet you're tired too. But been worth it. All our hard work has paid off, and 189,345,600,000,000,000,000,000 unique files have been born! And 189,345,600,000,000,000,000,000 MD5 hash values have been computed.

This looks like a very big number, doesn't it? And it is too, in some circles. But there are 340,282,366,920,938,463,463,374,607,431,768,211,455 possible MD5 hash values! That's because there are that many different 128-bit binary numbers.

So even after 1 million years of frantic, round-the-clock file making, less than 1 in 1,797,000,000,000,000 of the possible MD5 hash values have been used. Put another way, the number of MD5 hashes we've computed is less that 0.00000000000006% of the total available.

It's possible, in theory, that two of those hash values will be the same. But as you can see from the numbers, it's very unlikely. The odds are better that everyone on earth will be hit by lightning, on the same day -- the day you win the Irish Sweepstakes. It's not perfect. But in the real world, a file's MD5 hash makes an excellent digital fingerprint.

There's a lot more to say about MD5 and digital fingerprints. But unfortunately, that will have to wait until our next get-together. In the meantime, if you'd like to learn more about the MD5 algorithm, check out the Internet standards document RFC (Request For Comment) 1321. It's available online at:

http://www.ietf.org/rfc/rfc1321.txt

And if you'd like to see MD5 in action, try the new version of my popular Directory Printer. Now, in addition to file information such as size, attributes, date of last modification, and more, it can also print each file's MD5 hash value! The new Directory Printer also sports a new Printer Setup button, allowing you to select the printer where the report will appear, and choose your paper orientation (landscape or portrait).

To give the new Directory Printer v3.4 a try, visit its home page at:

https://www.karenware.com/powertools/ptdirprn

I've also created a brand new Power Tool, called Karen's Hasher, that lets you put MD5 to work today. It can compute the MD5 hash value for any text string, any file, and even for a group of files. You can also provide it a previously computed MD5 has value, and the program will tell you if it is still valid, allowing you to detect file alterations.

To download your free copy of the new Hasher v1.0, visit its home page at:

https://www.karenware.com/powertools/pthasher

And for fans of the popular Replicator, I've prepared a new version of this Power Tool too! Look for Replicator v1.8.7 at:

https://www.karenware.com/powertools/ptreplicator

This version fixes a nagging problem that could cause the program to crash when asked to delete a read-only folder. And it now handles the "<dow>" (Day of Week) destination tag correctly (thanks to several readers who helped locate these two bugs).

And as always, if you prefer the convenience of a CD, or want to support Karen's Power Tools, visit my CD home page at:

https://www.karenware.com/licenseme

There you can order your own copy of Karen's CD, complete with the latest Directory Printer, Hasher, and Replicator. Your CD will also include the most recent versions of every other Power Tool, plus three bonus Power Tools programs not available anywhere else (one automatically downloads and installs updates to the Power Tools programs!). The CD even has all the back issues of my newsletters, and a special license that lets you use all the Power Tools at work!

Until we meet again, I'm going to try to find some way to compress monthly bills, and waist lines. <grin> And don't forget, if you see me on the 'net, be sure to wave and say "Hi!"

Compression

Hashes and Fingerprints

Collisions

Recently

License for Work

Power Tools Newsletter