Floating point is the wrong choice for digital audio

30 October 2022

Audio can be represented digitally in a number of ways, but the most common method today is Pulse Code Modulation, or PCM. It involves taking snapshots of the amplitude of the sound wave (called "samples") at regular intervals, then encoding that amplitude as a numeric value. There are two parameters that control the quality of PCM sound - the sample rate and the bit depth. Sample rate refers to how frequently the amplitude of the sound wave is recorded. This is the most important parameter as it determines the range of possible frequencies captured by the recording.

The Nyquist-Shannon sampling theorem states that the maximum frequency that can be accurately captured by a sampled wave is half of the sample rate. The most commonly-used sample rate is 44.1KHz (notably the rate used by CDs), which can accurately reproduce frequencies up to 22KHz (just over the limit of human hearing). The next step up is 96KHz, and music at this rate can be sourced more frequently than you might expect via the streaming service TIDAL. Mid-range audiophile-grade equipment can frequently reach 192KHz, but finding a recording with a sample rate that high is difficult. The now defunct DVD-Audio format can theoretically store up to this sample rate, but it's unlikely that any commercial albums were released this way. Another option for music with a high sample rate is vinyl, but this is fraught with issues. The output of a vinyl recording will depend heavily on the condition of the equipment, the vinyl, and the environment, making a perfect copy impossible. There's also the issue of the mastering process - the final record is limited by the quality of the recordings used to produce it. Any extra "detail" captured by a recording of a vinyl record is produced by limitations inherent to records. It's impossible to recover any of the information lost when the track was mastered.

The other parameter, bit depth, is less important. This refers to the number of bits in the number used to represent the amplitude at each sample. Assuming you're using an integer, the most common bit depth is 16 bits (again used by CDs), but 24 bits isn't uncommon either. Many tools have support for up to at least 32 bit integers, but finding music at this depth is uncommon. Much more common is 32 bit IEEE 754 floating point numbers. Values in the range -1 to 1 are used to represent negative and positive amplitudes respectively, with 0 at the center. When choosing between 24 bit integers and 32 bit floats, the latter is the best option. The mantissa of a float is 24 bits wide (if you count the sign bit), giving it at least equivalent precision. The exponent can extend this precision, but with two caveats: nonlinearity and subnormal numbers.

Nonlinearity

An IEEE float has 3 sections - the sign, exponent, and mantissa. In a 32 bit float these sections are 1, 8, and 23 bits wide respectively. The sign bit distinguishes between positive and negative numbers. The exponent serves to position the "floating" point within the mantissa, effectively designating some run of bits as above the point (representing 1, 2, 4, 8, etc) and some as below the point (representing 1/2, 1/4, 1/8, etc). The exponent can also place the point outside the range of the bits of the mantissa, inserting virtual zeros to represent very large or small numbers. Because audio uses numbers in the range -1 to 1, there will never actually be any bits above the point (except for 1 and -1). One of the properties of floating point numbers is nonlinear precision - you can much more accurately represent small numbers than large ones. For a number greater than 1/2 (and less than 1), the floating point can only ever be in one position. This means that the exponent can't refine the value number any further, and the only bits that contribute to its value are the 23 mantissa bits. For a number less than 1/2 the full exponent can be utilized, giving a higher effective precision. This repeats itself as you divide the range in half. Numbers under 1/4 are more precisely represented than numbers above 1/4. Numbers under 1/8 are more precisely represented than numbers above 1/8.

This doesn't make a lot of sense for PCM samples. Larger amplitudes correspond to a higher volume, so all of the extra precision floats give us is wasted on parts of the audio that we can't even perceive. This has interesting implications for the field of computer graphics as well - any floating-point coordinates get less precise as they get larger. This means that screen coordinates are less precise towards the bottom right corner of the screen, and vertices far away are positioned less precisely. Because objects closer to the camera will be rendered in front of objects further away, distant objects can experience z-fighting. Programmers are able to solve this by reversing the depth buffer, using 0 to represent values far away and 1 for values immediately in front of the camera. 1

Subnormal Numbers

Floating point numbers have a small trick to save space - aside from 0, every number has at least one bit set. While multiple different exponent-mantissa pairs could represent the same number, floats choose the value of the exponent such that the first set bit is just beyond the mantissa. This way, it doesn't need to actually store it and can assume it based on the value of the exponent. This raises a problem though - what about numbers where the position of this virtual 1 bit is closer to 0 than the exponent can represent? Numbers under this line (~1.8e-38) are called subnormal numbers, and they have special handling. When the exponent is 0, the virtual 1 bit is removed. This allows for smaller numbers to be represented by putting leading zeros in the mantissa. Because subnormal numbers are thought to be uncommon and they require special handling, not as much time has been spent optimizing calculations involving subnormals. This can lead to a significant performance impact 2. An amplitude this close to zero is impossible to distinguish from silence (both by humans and hardware), so it's unlikely that any stray subnormals produced by accumulated rounding error would be noticed in a finished track. This could lead to performance drops in your decoder, resampler, or other software as you listen to or edit the music.

DAC Hardware

Floating point numbers map poorly onto digital-to-analog conversion hardware. The Schiit Yggdrasil, a $2,199 DAC, uses the TI DAC11001A chip. This is a 20-bit R2R DAC, meaning it consists of a resistor ladder directly driven by a 20 bit input. Each bit corresponds to an input on the resistor ladder, and those inputs are weighed and summed such that each bit contributes half as much to the final signal as the one before it (i.e. the input is a 20-bit integer). Any floating point data needs to be converted back to an integer anyway before it can be rendered into an analog signal. Even if you could swing floating point's non-linearity as a benefit, all of that extra precision is lost regardless. Given the extra complexity and space requirements for 32-bit floating point, it's hard to justify choosing this over 24-bit integers. This does, however, bring up something more important...

Overkill

A 32-bit floating point number is well over the limit of human perception, even if we only consider the mantissa. It can perfectly represent a 24-bit integer, which is itself more than enough 3. The only reason to consider having music in a format like this is out of principle. Personally, I like my media as close to the original form as possible. All the music I own I've ripped as FLACs from CDs so there's no information loss. I have a collection of blu-rays that I remux into 20+ GB files (even though they could be made much smaller with no perceptible difference) because I like the idea that I have a copy of something without degradation. It's because I care about my media in this strange way that this even matters to me, that I even bring it up. But if you feel the same way, maybe I've given you a new thing to worry about - that somebody converted your 32-bit integer samples to 32-bit float and messed up 8 perfectly good bits of precision.

1

This trick might work for audio as well - floats can represent signed zero, so we could map 1 and -1 to the center and 0 and -0 to the outer edge. This representation isn't supported by any hardware or software though.

2

This seems to be true for more recent processors too, but it's possible to round subnormals to zero in hardware and avoid the penalty. Processor floating point flags and -ffast-math is its own can of worms, though.

3

A very good explanation of how bit depth actually affects the output signal and listening experience.