Huffman Coding is a lossless data compression algorithm where each character in the data is assigned a variable length prefix code. The least frequent character gets the largest code and the most frequent one gets the smallest code. Encoding the data using this technique is very easy and efficient. However, decoding the bitstream generated using this technique is inefficient.Decoders(or Decompressors)require the knowledge of the encoding mechanism used in order to decode the encoded data back to the original characters. Hence information about the encoding process needs to be passed to the decoder along with the encoded data as a table of characters and their corresponding codes. In regular Huffman coding of a large data, this table takes up a lot of memory space and also if a large no. of unique characters are present in the data then the compressed(or encoded) data size increases because of the presence of the codebook. Therefore to make the decoding process computationally efficient and still maintain a good compression ratio, Canonical Huffman codes were introduced.
In Canonical Huffman coding, the bit lengths of the standard Huffman codes generated for each symbol is used. The symbols are sorted first according to their bit lengths in non-decreasing order and then for each bit length, they are sorted lexicographically. The first symbol gets a code containing all zeros and of the same length as that of the original bit length. For the subsequent symbols, if the symbol has a bit length equal to that of the previous symbol, then the code of the previous symbol is incremented by one and assigned to the present symbol. Otherwise, if the symbol has a bit length greater than that of the previous symbol, after incrementing the code of the previous symbol is zeros are appended until the length becomes equal to the bit length of the current symbol and the code is then assigned to the current symbol.
This process continues for the rest of the symbols.
以下示例说明了该过程:
考虑以下数据:
Character | Frequency |
---|---|
a | 10 |
b | 1 |
c | 15 |
d | 7 |
用位长生成的标准霍夫曼码:
Character | Huffman Codes | Bit lengths |
---|---|---|
a | 11 | 2 |
b | 100 | 3 |
c | 0 | 1 |
d | 101 | 3 |
步骤 1:根据位长对数据进行排序,然后对每个位长按字典顺序对符号进行排序。
Character | Bit lengths |
---|---|
c | 1 |
a | 2 |
b | 3 |
d | 3 |
步骤2:为第一个符号的代码分配与位长相同数量的’0’。
‘c’ 的代码:0
下一个符号“a”的位长为 2 > 前一个符号“c”的位长为 1。将前一个符号的代码加 1 并附加 (2-1)=1 个零并将代码分配给“a” .
‘a’ 的代码:10
下一个符号“b”的位长为 3 > 前一个符号“a”的位长为 2。将前一个符号的代码加 1 并附加 (3-2)=1 个零并将代码分配给“b” .
‘b’ 的代码:110
下一个符号“d”的位长为 3 = 前一个符号“b”的位长为 3。将前一个符号的代码加 1 并将其分配给“d”。
‘d’ 的代码:111
第 3 步:最终结果。
Character | Canonical Huffman Codes |
---|---|
c | 0 |
a | 10 |
b | 110 |
d | 111 |
这种方法的基本优点是可以使传递给解码器的编码信息更加紧凑和内存高效。例如,可以简单地将字符或符号的位长传递给解码器。规范代码可以很容易地从长度中生成,因为它们是连续的。
有关使用霍夫曼树生成霍夫曼代码的信息,请参阅此处。
方法:一种简单有效的方法是为数据生成一个哈夫曼树,并使用类似于JavaTreeMap的数据结构来存储符号和位长,使信息始终保持排序。然后可以使用递增和按位左移操作获得规范代码。
Java
// Java Program for Canonical Huffman Encoding
import java.io.*;
import java.util.*;
// Nodes of Huffman tree
class Node {
int data;
char c;
Node left;
Node right;
}
// comparator class helps to compare the node
// on the basis of one of its attribute.
// Here we will be compared
// on the basis of data values of the nodes.
class Pq_compare implements Comparator {
public int compare(Node a, Node b)
{
return a.data - b.data;
}
}
class Canonical_Huffman {
// Treemap to store the
// code lengths(sorted) as keys
// and corresponding(sorted)
// set of characters as values
static TreeMap > data;
// Constructor to initialize the Treemap
public Canonical_Huffman()
{
data = new TreeMap >();
}
// Recursive function
// to generate code lengths
// from regular Huffman codes
static void code_gen(Node root, int code_length)
{
if (root == null)
return;
// base case; if the left and right are null
// then its a leaf node.
if (root.left == null && root.right == null) {
// check if key is present or not.
// If not present add a new treeset
// as value along with the key
data.putIfAbsent(code_length, new TreeSet());
// c is the character in the node
data.get(code_length).add(root.c);
return;
}
// Add 1 when on going left or right.
code_gen(root.left, code_length + 1);
code_gen(root.right, code_length + 1);
}
static void testCanonicalHC(int n, char chararr[], int freq[])
{
// min-priority queue(min-heap).
PriorityQueue q
= new PriorityQueue(n, new Pq_compare());
for (int i = 0; i < n; i++) {
// creating a node object
// and adding it to the priority-queue.
Node node = new Node();
node.c = chararr[i];
node.data = freq[i];
node.left = null;
node.right = null;
// add functions adds
// the node to the queue.
q.add(node);
}
// create a root node
Node root = null;
// extract the two minimum value
// from the heap each time until
// its size reduces to 1, extract until
// all the nodes are extracted.
while (q.size() > 1) {
// first min extract.
Node x = q.peek();
q.poll();
// second min extract.
Node y = q.peek();
q.poll();
// new node f which is equal
Node nodeobj = new Node();
// to the sum of the frequency of the two nodes
// assigning values to the f node.
nodeobj.data = x.data + y.data;
nodeobj.c = '-';
// first extracted node as left child.
nodeobj.left = x;
// second extracted node as the right child.
nodeobj.right = y;
// marking the f node as the root node.
root = nodeobj;
// add this node to the priority-queue.
q.add(nodeobj);
}
// Creating a canonical Huffman object
Canonical_Huffman obj = new Canonical_Huffman();
// generate code lengths by traversing the tree
code_gen(root, 0);
// Object array to store the keys
Object[] arr = data.keySet().toArray();
// Set initial canonical code=0
int c_code = 0, curr_len = 0, next_len = 0;
for (int i = 0; i < arr.length; i++) {
Iterator it = data.get(arr[i]).iterator();
// code length of current character
curr_len = (int)arr[i];
while (it.hasNext()) {
// Display the canonical codes
System.out.println(it.next() + ":"
+ Integer.toBinaryString(c_code));
// if values set is not
// completed or if it is
// the last set set code length
// of next character as current
// code length
if (it.hasNext() || i == arr.length - 1)
next_len = curr_len;
else
next_len = (int)arr[i + 1];
// Generate canonical code
// for next character using
// regular code length of next
// character
c_code = (c_code + 1) << (next_len - curr_len);
}
}
}
// Driver code
public static void main(String args[]) throws IOException
{
int n = 4;
char[] chararr = { 'a', 'b', 'c', 'd' };
int[] freq = { 10, 1, 15, 7 };
testCanonicalHC(n, chararr, freq);
}
}
c:0
a:10
b:110
d:111
如果您希望与专家一起参加现场课程,请参阅DSA 现场工作专业课程和学生竞争性编程现场课程。