On the perception of sound and music. Sound Perception and Compression
Simple compression methods
Traditional lossless compression methods (Huffman, LZW, etc.) are usually poorly applied to compressing audio information (for the same reasons as when compressing visual information).
Some lossy compression methods are listed below:
Compression of silence (pauses) – defines periods of “silence”, works similarly to run-length coding.
ADPCM – Adaptive Differential Pulse Code Modulation (the term adaptive delta-pulse-code modulation (ADPCM) is used in Russian literature. For example, the CCITT G.721 standard is from 16 to 32 Kbits / sec:
Encoding the difference between two or more consecutive samples; then the difference is quantized -> when quantizing, part of the information is lost. Quantization is adaptive (changes the parameters depending on the signal), as a result, fewer bits are necessary to achieve a better SNR. It is necessary to predict how the sound will change -> difficult
Apple has developed a proprietary system called ACE / MACE. Lossy compression, trying to predict what the value of the next count will be. Compression of the order of 2: 1.
Linear Predictive Coding (LPC) – tries to describe the signal using the “speech model” and transfers the parameters of the model -> sounds like computer-synthesized speech, 2.4 kbits / sec.
Code Excited Linear Predictor (CELP) is the same as LPC, however it additionally transmits a quantization error (using a predefined set of “code words”) -> telephone quality at 4.8 kbits / sec.
Psychoacoustic Based Compression Techniques
Representatives: MPEG layers 2, MPEG layer 3 (MP3), AAC (Advanced audio coding), TwinVQ, Ogg Vorbis, etc.
A codec algorithm using psychoacoustics usually consists of the following steps:
Calculation of the psychoacoustic model (masking).
Signal division into frequency subbands (FFT, DCT / MDCT, FilterBanks, etc.).
The quantization of the signal in the subbands in accordance with the results of the psychoacoustic model. It is possible to use one quantum level. for several input values at once (vector quantization – Vector Quantization) – TwinVQ.
Some facts about sound perception
The frequency spectrum perceived by a person is (approximately) from 20 Hz to 20 kHz, the highest sensitivity in the range from 2 to 4 KHz.
The dynamic range (from the quietest perceived sounds to the loudest) is about 96 dB (more than 1 in 30,000 on a linear scale).
It is well known that a person is able to distinguish between a frequency change of 0.3% at a frequency of the order of 1kHz.
If two signals differ by less than 1dB in amplitude, they are difficult to distinguish. The resolution in amplitude depends on the frequency and the highest sensitivity is observed in the range from 2 to 4 KHz.
Spatial resolution (ability to localize the sound source) – up to 1 degree.
Sounds of different frequencies travel through the air at different speeds. As a result, the high-frequency part of the spectrum from the source located at a distance from the listener is somewhat delayed.
A person is not able to notice the sudden disappearance of high frequencies if it does not exceed about 2ms.
Some studies show that a person is able to sense frequencies above 20kHz. With age, the frequency range narrows.
Frequency spectrum carrying information in human speech: from 500 Hz to 2 kHz Low frequencies – bass and vowels
The best compression of speech is achieved using parametric encoders (LPC, CELP, etc.), trying to represent speech as a set of parameters of some speech model. General purpose codecs (MPEG, etc.) tend to produce worse compression.
On the perception of sound and music (perception and compression of sound)
In the general case, the ear is a non-linear system and cannot be accurately described using only linear elements (such as filters and delay lines). As a by-product of non-linearity, for example, the following effect may occur: when two tones with a frequency of 1000 and 1200 Hz are applied, a third tone with a frequency of 800 Hz can also be heard. However, in the range of amplitudes of interest to us, the nonlinearity is rather weak and is usually neglected.
The ear consists of three parts: the auricle (also called the outer ear), the middle ear and the inner ear – the cochlea. Passing through various parts of the ear, the sound undergoes a change.
One of the functions of the outer ear (auricle) is to improve the localization of the sound source in space. Due to its asymmetric shape, the frequency response of signals coming from different points in space varies differently. The auricle can only affect signals with a long wavelength comparable to the size of the ear (> 3kHz). The external ear canal resonates at a frequency of about 2kHz, which gives increased sensitivity in this range.
The middle ear acts as a hydraulic booster. Since there is liquid in the cochlea and air outside, it is necessary to coordinate the resistance of the medium. The middle ear also protects against low-frequency sounds of excessive amplitude.
The inner ear is the cochlea. In expanded form it will be a tube, with a diameter gradually decreasing to one of the ends.