At the moment, few people think about how file compression works. Compared to the past, using a personal computer has become much easier. And almost every person working with the file system uses archives. But few people think about how they work and on what basis files are compressed. The very first version of this process was Huffman codes, and they are used to this day in various popular archivers. Many users do not even think about how easy it is to compress a file and how it works. In this article, we will look at how compression occurs, what nuances help speed up and simplify the encoding process, and also understand the principle of building an encoding tree.

## History of the algorithm

The very first algorithm for efficient coding of electronic information was the code proposed by Huffman back in the middle of the twentieth century, namely in 1952. It is he who is currently the main basic element of most programs designed to compress information. At the moment, some of the most popular sources using this code are ZIP, ARJ, RAR archives and many others.

This Huffman algorithm is also used to compress JPEG images and other graphic objects. Well, all modern faxes also use encoding, invented in 1952. Despite the fact that so much time has passed since the creation of the code, it is still used in the newest shells and on old and modern types of equipment.

## Efficient Coding Principle

The Huffman algorithm is based on a scheme that allows you to replace the most probable, most frequently occurring characters with binary system codes. And those that are less common are replaced by longer codes. The transition to long Huffman codes occurs only after the system uses all the minimum values. This technique allows you to minimize the length of the code for each character of the original message as a whole.

The important point is that at the beginning of coding the probabilities of the occurrence of letters must already be known. It is from them that the final message will be compiled. Based on these data, a Huffman coding tree is being constructed, on the basis of which the process of encoding letters in the archive will be carried out.

## Huffman code example

To illustrate the algorithm, let's take a graphical version of building a code tree. To use this method effectively, it is worth clarifying the definition of some values necessary for the concept of this method. The set of many arcs and nodes that are directed from node to node is called a graph. The tree itself isgraph with a set of specific properties:

- each node can contain no more than one of the arcs;
- one of the nodes must be the root of the tree, that is, it must not include arcs at all;
- if you start moving along the arcs from the root, this process should allow you to get completely into any of the nodes.

There is also such a concept included in Huffman codes as a tree leaf. It represents a node from which no arc should come out. If two nodes are connected by an arc, then one of them is a parent, the other is a child, depending on which node the arc leaves and enters. If two nodes have the same parent node, they are called sibling nodes. If, in addition to leaves, the nodes have several arcs, then this tree is called binary. This is exactly what the Huffman tree is. A feature of the nodes of this construction is that the weight of each parent is equal to the sum of the weights of all its node children.

## Huffman tree construction algorithm

The construction of the Huffman code is done from the letters of the input alphabet. A list of those nodes that are free in the future code tree is formed. The weight of each node in this list must be the same as the probability of occurrence of the message letter corresponding to that node. At the same time, among several free nodes of the future tree, the one that weighs the least is selected. Moreover, if the minimum indicators are observed in several nodes, then you can freely choose any of the pairs.

After that, the parent node is created, which should weigh the same as the sum of this pair of nodes. After that, the parent is sent to the list with free nodes, and the children are removed. In this case, the arcs receive the corresponding indicators, ones and zeros. This process is repeated just enough to leave only one node. After that, binary digits are written from top to bottom.

## Improve compression efficiency

To improve compression efficiency, it is necessary to use all the data regarding the probability of occurrence of letters in a particular file attached to the tree during the construction of the code tree, and not allow them to be scattered over a large number of text documents. If you first go through this file, you can immediately calculate the statistics of how often the letters from the object to be compressed occur.

## Speed up the compression process

To speed up the algorithm, the definition of letters should be carried out not by the probability of occurrence of a particular letter, but by the frequency of its occurrence. Thanks to this, the algorithm becomes simpler, and work with it is significantly accelerated. It also avoids floating point and division operations.

Besides, working in this mode, the dynamic Huffman code, or rather the algorithm itself, is not subject to any changes. This is mainly due to the fact that the probabilities are directly proportional to the frequencies. It is worth paying special attention tothat the final weight of the file or the so-called root node will be equal to the sum of the number of letters in the object to be processed.

## Conclusion

Huffman codes are a simple and long-established algorithm that is still used by many well-known programs and companies. Its simplicity and comprehensibility allow you to achieve effective compression results for files of any size and significantly reduce their storage space. In other words, the Huffman algorithm is a long-studied and developed scheme, the relevance of which has not decreased to this day.

And thanks to the ability to reduce the size of files, transferring them over the network or in other ways becomes easier, faster and more convenient. Working with the algorithm, you can compress absolutely any information without harm to its structure and quality, but with the maximum effect of reducing the file weight. In other words, Huffman encoding has been and remains the most popular and current file size compression method.