Algorithms – Huffman encoding

huffman

Hello everyone, let’s talk about the Huffman encoding.

The algorithm was taken from here  and slightly modified to be more clear.
As this algorithm is fairly simple and Internet is full of explanation regarding it, I will not go in too much detail.

The algorithm is compressing the number of bits that are used for representing a string by assigning each letter a unique binary code, the one being more frequent receiving the smallest binary representation of that ASCII character.

First of all, we need to create a map of frequencies.
This is done by counting the number of times a character appear in the sentence we want to encode.

The frequency looks like this, for the string ‘hello’ ->
‘h’  : 1
‘e’  : 1
‘l’   : 2
‘o’  : 1

 

Secondly, we will create a priority tree, where:

  • The leafs are the actual characters, with their assigned frequency
  • The parent nodes will contain the frequency sum of the two child leafs assigned to it.

The tree is created by keeping a priority queue, containing at start all the characters and their given frequency, retrieved for a specific string.
Until the queue contain only one element (the root), we will take the two most frequent leafs/nodes and we will create a parent node that will contain them.
Once we have the tree created, we can generate the codes for the used characters, by iterating through each node until all the leafs were visited.
Starting from the root, all the left nodes will receive the value 0, while all the right nodes will receive the value 1. (or opposite way, just be consistent!).

Let’s see some code.

Generate Tree function:

  • Receives the frequency map
  • Returns the root node

The code generator will receive the root node and will insert data into the HuffmanCodes structure
(which is a map with key = character, and value = it’s corresponding Huffman code)

The full code can be found in the github repository:

https://github.com/badearobert/Algorithms/tree/master/Huffman

Leave a comment

Your email address will not be published.