Audio coding refers to methods of storing and transmitting audio data. The article below describes how these encodings work.
Note that this is a rather complicated topic - "Audio Encoding Depth". The definition of this concept will also be given in our article. The concepts presented in this article are for a general overview only. Let's reveal the concepts of audio coding depth. Some of this reference data may be helpful in understanding how the API works and how to formulate and process audio in your applications.
How to find the audio encoding depth
Audio format is not equivalent to audio encoding. For example, a popular file format such as WAV defines the header format of an audio file, but is not itself an audio encoding. WAV audio files often, but not always, use linear encodingPCM.
In turn, FLAC is both a file format and an encoding, which sometimes leads to some confusion. Within the FLAC Speech API, the audio encoding depth is the only encoding that requires the audio data to include a header. All other encodings indicate silent audio data. When we refer to FLAC in the Speech API, we always refer to a codec. When we refer to the FLAC file format, we will use the ". FLAC" format.

You are not required to specify the encoding and sampling rate for WAV or FLAC files. If this parameter is omitted, the Cloud Speech API automatically determines the encoding and sample rate for WAV or FLAC files based on the file header. If you specify an encoding or sample rate value that does not match the value in the file header, the Cloud Speech API will return an error.
Audio encoding depth - what is it?
Audio consists of oscillograms, consisting of interpolation of waves of different frequencies and amplitudes. To represent these waveforms in digital environments, the signals must be sampled at a rate that can represent the highest frequency sounds you want to reproduce. They also need to store enough bit depth to represent the correct amplitude (loudness and softness) of the waveforms in the sound pattern.
The ability of an audio processing device to recreate frequencies is known as its frequency response, and the ability to create proper volume and softness is known asdynamic range. Together, these terms are often referred to as audio device fidelity. Audio coding depth is the means by which audio can be reconstructed using these two basic principles and the ability to efficiently store and transmit such data.
Sampling rate
Sound exists as an analog waveform. The digital audio segment approximates this analog waveform and samples its amplitude at a fast enough rate to mimic the wave's natural frequencies. The sampling rate of a digital audio signal determines the number of samples taken from the original audio material (per second). High sampling rates increase the ability of digital audio to accurately represent high frequencies.

As a consequence of the Nyquist-Shannon theorem, it is usually necessary to try at least twice the frequency of any sound wave that needs to be recorded digitally. For example, to represent sound in the human hearing range (20-20,000 Hz), a digital audio format must display at least 40,000 times per second (which is the reason CD audio uses a sampling rate of 44,100 Hz).
Bit Depth
Audio encoding depth is the effect on the dynamic range of a given audio sample. The higher bit depth allows more accurate amplitudes to be represented. If you have a lot of loud and soft sounds in the same sound sample, you will need more bits to get those sounds right.
Higher beatsdepths also reduce the signal-to-noise ratio in audio samples. If the audio coding depth is 16 bits, CD music audio is transmitted using these values. Some compression methods can compensate for smaller bit depths, but these are usually losses. DVD Audio uses 24 bits of depth, while most phones encode audio at 8 bits.

Uncompressed audio
Most digital audio processing uses these two methods (sample rate and bit depth) to simply store audio data. One of the most popular digital audio technologies (popularized using the CD) is known as Pulse Code Modulation (or PCM). Audio is sampled at set intervals, and the amplitude of the sampled wave at that point is stored as a digital value using the bit depth of the sample.
Linear PCM (which indicates that the amplitude response is linearly uniform across the sample) is the standard used on CDs and in the LINEAR16 Speech API encoding. Both encodings produce an uncompressed byte stream corresponding directly to the audio data, and both standards contain 16 bits of depth. Linear PCM uses a sampling rate of 44,100 Hz on CDs, which is suitable for recomposing music. However, the sampling rate of 16000 Hz is more suitable for speech recomposition.
Linear PCM (LINEAR16) is an example of uncompressed audio becausedigital data is stored in a similar way. When reading a single-channel byte stream encoded using Linear PCM, you can count every 16 bits (2 bytes) to get a different signal amplitude value. Almost all devices can manipulate such digital data natively - it is possible to trim Linear PCM audio files with a text editor, but uncompressed audio is not the most efficient way to transport or store digital audio. For this reason, most audio uses digital compression methods.
Compressed audio
Audio data, like all data, is often compressed to make it easier to store and transport. Compression in audio coding can be either lossless or lossy. Lossless compression can be decompressed to restore digital data to its original form. Compression necessarily removes some information during the decompression procedure and is parameterized to indicate the degree of tolerance for the compression technique to remove data.

No loss
Losslessly compresses digital audio recordings using complex permutations of stored data without degrading the quality of the original digital sample. With lossless compression, no information will be lost when decompressing data into its original digital form.
So why do lossless compression methods sometimes have optimization options? These options often handle the file size for the decompression time. For example, FLAC uses a compression level setting from 0 (fastest) to 8(smallest file size). Higher level FLAC compression will not lose any information compared to lower level compression. Instead, the compression algorithm would simply need to expend more computational energy building or deconstructing the original digital audio.
The Speech API supports two lossless encodings: FLAC and LINEAR16. Technically, LINEAR16 is not "lossless compression" because there is no compression involved in the first place. If file size or data transfer is important to you, choose FLAC as your audio encoding option.
Loss of compression
Audio data compression eliminates or reduces certain types of information when building compressed data. The Speech API supports several lossy formats, although these should be avoided as data loss can affect recognition accuracy.

The popular MP3 codec is an example of a lossy encoding method. All MP3 compression methods remove sound from outside the normal human audio range and adjust the level of compression by adjusting the effective bit rate of the MP3 codec or the number of bits per second to preserve the date of the audio.
For example, a stereo CD using 16-bit linear PCM has an effective bit rate. Audio Depth Formula:
4410002 channels16 bits=1411200 bps (bps)=1411 kbps
For example, MP3 compression removes such digital data using a bit rate such as 320 kbps, 128 kbps or 96 kbps,resulting in degraded sound quality. MP3 also supports variable bit rates, which can further compress the audio. Both methods lose information and can affect quality. It's safe to say that most people can tell the difference between 96Kbps and 128Kbps encoded MP3 music.

Other forms of compression
MULAW is an 8-bit PCM encoding where the sample amplitude is modulated logarithmically rather than linearly. As a result, uLaw reduces the effective dynamic range of compressed audio. Although uLaw was introduced specifically to optimize speech encoding as opposed to other types of audio, 16-bit LINEAR16 (uncompressed PCM) is still far superior to 8-bit compressed uLaw audio.
AMR and AMR_WB modulate the encoded audio cassette by introducing a variable bit rate into the original audio sample.

Although the Speech API supports several lossy formats, you should avoid them if you have control over the source audio. Although the removal of such data through lossy compression may not have a noticeable effect on the sound heard by the human ear, the loss of such data to the speech recognition engine may significantly degrade accuracy.