Audio function in network camera

The audio function has basically become the standard configuration of network cameras. Network cameras with audio functions usually provide a built-in microphone/pickup, or provide an audio input interface, users can choose to use other types or higher quality external microphone/pickup.

On the other hand, the network camera can also have a built-in speaker or provide an audio output interface, and the user can choose to connect to other types of speakers/speakers.

Audio working mode

Depending on the application, one-way or two-way audio transmission may be required, which can accomplish two-way audio transmission at the same time or one direction at a time.

There are three basic modes of audio communication:

  1. Simplex mode, can only send audio in one direction. In most cases, the audio is sent from the camera, but it can also be sent from the user.
  2. Half-duplex mode. Indicates that audio can be sent and received in both directions from the camera and the operator, but only in one direction at a time. The type of communication is similar to that of a walkie-talkie. To speak, the operator must press and hold the call button. Releasing the button allows the operator to receive audio from the camera. With half-duplex, there is no risk of echo problems.
  3. Full duplex mode. Means that users can send and receive audio at the same time (listen and speak at the same time). The communication mode is similar to that of a telephone conversation. Full-duplex requires the client PC to be able to handle full-duplex audio.

Audio coding

Sampling rate, sample size

Sound is an energy wave with characteristics of frequency and amplitude. The frequency corresponds to the time axis, and the amplitude corresponds to the level axis. The wave is infinitely smooth, and the string can be seen as composed of countless points. To digitally transmit or save the sound through the network, it must first be encoded and the points of the string must be sampled. The sampling process is to extract the frequency value of a certain point.

Obviously, the more points extracted in one second, the more abundant frequency information can be obtained. In order to restore the waveform, there must be two sampling points in a vibration. The highest frequency that the human ear can feel is 20kHz, so To meet the hearing requirements of the human ear, at least 40k samplings per second are required, expressed in 40kHz, and this 40kHz is the sampling rate. Our common CD has a sampling rate of 44.1kHz, and the default sampling rate for audio encoding of many security cameras is also 44.1KHz.

It is not enough to have frequency information. We must also obtain the energy value of this frequency and quantify it to express the signal strength. The number of quantization levels is an integer power of 2, our common CD bit 16bit sampling size, that is, 2 to the 16th power. To give a simple example: Suppose that a wave is sampled 8 times, and the corresponding energy values ​​of the sample points are 1-8, but we only use the 2bit sample size. As a result, we can only keep the value of 4 points and discard the other 4 Piece. If we take a sample size of 3bit, then all the information of just 8 points will be recorded. The larger the value of sampling rate and sampling size, the closer the recorded waveform is to the original signal.

Audio stream calculation

Audio stream = sampling rate value × sampling size value × channel number bps.

A WAV file with a sampling rate of 44.1KHz, a sampling size of 16bit, and dual-channel PCM encoding, its code stream is 44.1K×16×2 = 1411.2 Kbps. We often say that 128K MP3, the corresponding WAV parameter, is this 1411.2 Kbps, this parameter is also called data bandwidth, it is a concept with the bandwidth in ADSL. Divide the code rate by 8, and you can get the data rate of this WAV, which is 176.4KB/s. This means that the storage of one second sampling rate is 44.1KHz, the sampling size is 16bit, and the two-channel PCM encoded audio signal requires 176.4KB of space, and 1 minute is about 10.34M.

The amount of data is very large. To reduce the amount of data, there are only two methods, reducing the sampling index and compression. It is not advisable to reduce the index, so compression coding can only be used.

Encoding algorithm

There are many audio compression coding methods, which can be roughly divided into three categories: waveform coding, parameter coding, and hybrid coding. It will not be expanded here. If you want to know more about it, you can visit the reference materials after reading the article.

TechnologyEncoding algorithmStandardBitrate(KBIT/S)QualityApplication
Waveform codingPCMG.711644.8PSTN、ISDN
Parameter encodingLPC2.42.5Secret voice
Mixed codingCELPC4.83.2civil aviation
VSELPCGIA83.8Mobile communication, voice mail
Comparison of common audio coding algorithms

Coding standard

Here we focus on some common audio coding formats, especially those frequently used in security monitoring systems. Audio generally exists with video, so the same as video coding, audio coding is mainly specified by these two organizations (see: H.265 video encoding.), one is ITU-T, the other is ISO/IEC MPEG.

The audio coding specified by ITU-T is mainly G.7xx series, and ISO/IEC MPEG is MPEG-1, -2,-4 series.


In computer applications, the highest level of fidelity is PCM encoding, and the standard is formulated by ITU-T. It is widely used for material preservation and music appreciation, CD, DVD and our common WAV files. PCM has become a lossless encoding by convention, because PCM represents the best fidelity level in digital audio, but PCM can only achieve the greatest degree of infinite proximity. The bit rate of a two-channel PCM audio stream (sampling rate is 44.1KHz, sampling size is 16bit) is a fixed value: 44.1K×16×2 =1411.2Kbps.


Adopt logarithmic pulse-code modulation (logarithmic pulse-code modulation) sampling standard, use pulse code modulation to sample audio, sampling rate is 8k per second, code rate is 64kbps, theoretical delay: 0.125msec, quality: MOS value 4.10.

G711 is the mainstream wave sound codec. There are two compression algorithms under the G711 standard, one is u-law algorithm (also known as offien u-law, ulaw, mu-law), that is, G.711u, which is mainly used in North America and Japan; the other is A-law algorithm, namely G.711a, is mainly used in China, Europe and other parts of the world. Among them, the latter is specially designed to facilitate computer processing.

The compression ratio of G711 is fixed: 8/14 = 57% (G.711u), 8/13 = 62% (G.711a).


PCM is not compressed, usually the amount of data is relatively large, ADPCM (Adaptive Differential Pulse Code Modulation), adaptive differential pulse coding, can compress audio data to reduce bandwidth and storage pressure. G.726 is an audio coding algorithm defined by ITU-T, which is essentially an ADPCM. G.726 is proposed on the basis of G.721 and G.723 standards, which can convert 64kbps PCM signals into 40kbps, 32kbps, 24kbps, and 16kbps ADPCM signals.


G.722 is a multi-frequency speech coding algorithm that supports bit rates of 64, 56 and 48kbps. In G.722, the sampling rate of the voice signal is 16000 samples per second. Compared with 3.6kHz frequency speech coding, G.722 can handle wideband audio signals with frequencies up to 7kHz. The G.722 encoder is based on the principle of sub-band adaptive differential pulse coding (SB-ADPCM). The signal is divided into two subbands, and the ADPCM technology is used to encode the samples of the two subbands.


G.728 is a 16 kbps compression standard based on the low-delay code-excited linear prediction (LD-CELP) compression principle, and has an algorithmic coding delay of 0.625 ms.


The G.729 coding scheme is a standard for coding voice signals of telephone bandwidth. The analog signal of the input voice nature is quantized by 8kHz, sampling, and 16-bit linear PCM. G.729A is a simplified version of the latest ITU speech coding standard G.729. Unlike G.711, which is completely free to use, you need to pay to use G.729.


Linear predictive coding (LPC, linear predictive coding) is a tool mainly used in audio signal processing and speech processing to express the spectral envelope of digital speech signals in a compressed form based on the information of the linear predictive model. It is one of the most effective speech analysis techniques and one of the most useful methods for high-quality speech encoding at low bit rates. It can provide very accurate speech parameter prediction.

The bandwidth required for LPC is 2Kbps-4.8Kbps.


CELPC: Code Excited Linear Predictive Coding, Code Excited Linear Predictive Coding, belongs to the class of vocoders. This type of encoder extracts important features from the time waveform, and it is most suitable for low bit rate encoders. Maker: European Telecommunications Standards Institute (ETSI), required bandwidth: 4~16Kbps rate.

MPEG series

MPEG Audio is divided into two categories: MPEG-1 and MPEG-2. Each category can be divided into three layers, Layer1, Layer2 and Layer3.

The main difference between the different layers of MPEG-1 audio encoding methods is that the compression rate of the audio file and the data rate required to play the media are different from the outside, and the internal algorithms are also very different. The number of layers increases and becomes more and more complex.

The suffix of the audio file encoded by Layer1 is MP1, and the other two are MP2 and MP3.

The new audio features of MPEG-2 are “low sampling frequency extension” and “multi-channel extension”. “The extension of low sampling frequency” refers to the occasion of serving applications with very low bit rates that limit bandwidth requirements. The new sampling frequency is 16, 22.05 or 24kHz, and the bit rate is extended to below 8kbps. “Multi-channel expansion” refers to surround sound systems that have 5 main channels (left, right, center, left surround, and right surround). Some surround sound systems even add an extra low-frequency booster. Channels are used to process low-frequency audio signals. For this system, “multi-channel expansion” allows up to 7 channels to be included.

The required bandwidth of MP1 is 384kbps (compressed 4 times), MP2 is 256-192kbps (compressed 6-8 times), and MP3 is 128-112kbps (compressed 10-12 times).

The audio coding of MPEG-4 is mainly composed of AAC, AAC+, VQF and so on.


AAC (Advanced Audio Coding), also known as ACC. Chinese name: Advanced Audio Coding. Appeared in 1997, based on MPEG-2 audio coding technology. It was jointly developed by Fraunhofer IIS, Dolby Laboratories, AT&T, Sony and other companies to replace the MP3 format.

In 2000, after the emergence of the MPEG-4 standard, AAC re-integrated its characteristics and added SBR technology and PS technology. In order to distinguish it from the traditional MPEG-2 AAC, it is also called MPEG-4 AAC.

AAC-LC: Low-complexity advanced audio decoding. It is a high-performance audio codec with low bit rate and high-quality audio. The bit rate of AAC-LC is up to 256kbit/s per channel, and the sampling rate is 8 to 96kHz.

AAC-HE: High-efficiency advanced audio decoding, also known as AAC+. It focuses on low bit stream encoding and is very suitable for multi-channel files, mixing AAC and SBR technology.

The key to SBR is to provide full-bandwidth encoding at low bit rates without generating redundant signals.

Audio in security cameras

In security cameras, Hikvision supports G.722.1, G.711ulaw, G.711alaw, MP2L2 (MPEG2-Layer2), G.726, AAC, PCM and MP3 encoding.

Uniview supports three formats: G.711U, G.711A and ACC-LC. The sampling rates of G.711U and G.711A only support 8K, and ACC-LC supports 8K/16K/48K.

Input and output type

The audio input type of the camera can generally choose Line in and Mic in. If you use an active pickup, choose Line in. If you use a passive microphone, choose Mic in.

When connecting the audio input of the network camera, it is recommended to use a 3.5mm mono microphone plug; if using a dual-channel microphone plug, pay attention to connecting the effective signal to the left channel (L) line.

For audio output, it is recommended to use 3.5mm dual-channel headphones or speaker plugs.

Audio Line

The audio line generally uses 4-core shielded wire (RVVP) or unshielded digital communication cable (UTP), with a larger conductor cross-sectional area, such as 0.5 square. It is recommended to use shielded dedicated audio cable with a cable length of 100m. Common audio cables are: RCA general audio cables and ordinary coaxial cables.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published.