All Bits Are Not Created Equal

The difference between a bit of data and a bit of information is far reaching when it comes to speech, video, or any form of data compression.

Most of us think of a bit, or binary digit, as a binary number that can represent one of two states, zero or one, off or on, false or true. But when we discuss a bit in the areas of information theory and compression, the bit takes on new meaning.

Let’s start with an example in which we wish to store the state of a home’s front door over time. We wish to know the state of the door within one second of precision. At any given time, the door is either open or shut. We connect a sensor to a recording device and sample the sensor state once per second. We therefore store 1 bit per second or 86,400 (the number of seconds in a day) bits per day.

Figure 1: speech producing “variables” change at a relatively slow rate compared to the sampling rate.

Figure 1: speech producing “variables” change at a relatively slow rate compared to the sampling rate.

Waste Not, Want Not

But isn’t this wasteful? We know that a home’s front door is typically open or closed for long periods of time. It doesn’t open and shut very often. Instead, we could store only the time of day at each door open or door shut event. There are 86,400 seconds per day, thus each time stamp would require 17 bits because 217 is 131,072 (greater than 86,400). So 17 bits is sufficient to encode each time stamp. We then add a single bit per event to indicate open vs. closed for a total of 18 bits. If, in a given day, the door is opened or closed 10 times, we require 180 bits per day to store the events rather than 86,400 bits.

This example highlights the difference between bits of information and bits of data. If we store one bit per second, we know that there is an extremely high correlation between successive bits. Hence each data bit carries very little new information on average. The number of bits of data decreases and approaches the number of bits of information as the correlation between the set of bits approaches zero. And as this happens, the set of bits appears more random, and in the lingo of Information Theory, we approach maximum entropy.

What happens if we replace the front door example with a once-per-second coin toss? In this case, there is no correlation between the outcomes of the coin tosses, and we must use one bit of data per coin toss to represent the result. There is inherently more information in the coin toss result than the state of the front door. The coin toss result is purely random, and we represent each bit of information with exactly one data bit. No compression is possible.

We can extend this discussion into the field of speech compression. At the simplest, we can sample an audio waveform at a given sampling rate and store the bits. At a sampling rate of 8 kHz using 16-bit samples, we need 128 kbps. But is there truly that much information present? If three words per second are spoken, encoding the word in its alphanumeric representation may require 300 bits per second. Of course each person’s voice sounds different. There are pitch differences, differences in inflection, accent, etc. But is so much information contained in these nuances to take us from 300 bps to 128,000 bps?

Insights and Efficiencies

If we consider how speech is produced, we gain insight into how to represent the speech signal more efficiently—with fewer bits per second. Speech is formed when air passes through the vocal chords and then through the vocal tract as shown in Figure 1. The characteristics of these speech producing “variables” change at a relatively slow rate compared to the sampling rate. Most speech compression techniques model the vocal tract at this slower rate—perhaps once every 10 or 20 milliseconds—using as few bits as are needed to reconstruct the speech from the model with a minimum of distortion. These bits are stored or transmitted to an endpoint where the encoded speech model is used to reconstruct the speech.

The Nyquist-Shannon sampling theorem teaches us that a sequence of samples can accurately represent an arbitrary (band-limited) sound. But speech is anything but arbitrary. We would never confuse the sound of a chainsaw with the sound of a person speaking. By exploiting what we know about the characteristics of speech production, we can compress the samples into a time-varying vocal tract model that requires far fewer bits to represent the speech signal. The bit rate of our vocal tract model more closely represents the amount of information contained in human speech than does the bit rate of a sequence of samples.

If You Give a Child a Hammer…

On the other hand, what happens if we feed an arbitrary sound into a very efficient speech compression model? Anybody who has listened to music while on hold using a cell phone knows that low-bit-rate speech codecs do not represent music very well. This reminds me of the adage “To a child with a hammer, everything looks like a nail.” An efficient speech compression algorithm tries to model any signal using a single tool—the human vocal tract model. The hammer is a great tool for driving a nail into a piece of wood, but not so good at cutting wood. A low-bit-rate speech codec is good at compressing speech, but not necessarily good at compressing other sounds.

Video compression provides us with yet another example. Video is comprised of a sequence of still pictures (frames) sampled at a uniform frame rate—perhaps 30 frames per second. We could quantize the color and brightness of each pixel in each frame using a given number of bits per pixel and store or transmit those bits. The resulting bit rate would be enormous.

If we observe any single frame, we expect to see patterns of pixels rather than random pixels. A given pixel is very likely to be similar in color and intensity to its surrounding pixels. Thus, we have a high degree of correlation, and compression can be done on a spatial basis.

Furthermore, if we look at a single frame and compare it to the frames that have preceded it, we see a high degree of correlation, giving us an opportunity to perform compression on a temporal basis.

The result: digital television with hundreds of channels of programming that arguably contain disappointingly little information of interest. But that is a topic for another day.


headshotScott Kurtz is the founder and president of DSP Soundware, a company dedicated to improving voice quality through the use of digital signal processing. Scott has over thirty years of experience in the fields of telecommunications and digital signal processing in both the commercial and defense industries. Scott earned his bachelor’s degree in electrical engineering from Lehigh University and his master’s degree in electrical engineering from Drexel University.

Share and Enjoy:
  • Digg
  • Sphinn
  • Facebook
  • Mixx
  • Google
  • TwitThis