📅  最后修改于: 2023-12-03 15:07:22.292000             🧑  作者: Mango
图片、文本、视频等文件都需要压缩,以便节省存储空间和提高传输效率。我们可以通过多种算法实现文件压缩,本文介绍两种流行的压缩算法:Huffman编码和Lempel-Ziv压缩算法。
Huffman编码是一种基于频率的压缩算法,在压缩过程中,较常见的字符使用较短的编码,不常见的字符使用较长的编码。算法的核心是构建Huffman树,通过对字符集进行统计,得到每个字符的频率,然后将频率构建成一棵Huffman树,树的根节点为编码前的字符集合,叶子节点为字符集中的每个字符。然后从根节点开始,为每个字符建立一条路径,路径上0表示向左,1表示向右,得到每个字符对应的Huffman编码,由于较常见的字符编码较短,因此压缩后的文件大小更小。
示例代码:
import heapq
from typing import Dict, Tuple, Union
def get_freq(text: str) -> Dict[str, int]:
"""
统计文本中每个字符出现频率
"""
freq: Dict[str, int] = {}
for ch in text:
freq[ch] = freq.get(ch, 0) + 1
return freq
def build_huffman_tree(freq: Dict[str, int]) -> Tuple[Union[str, Tuple], Dict[str, str]]:
"""
构建Huffman树,并返回每个字符的Huffman编码
"""
heap = [(f, ch) for ch, f in freq.items()]
heapq.heapify(heap)
while len(heap) > 1:
freq1, left = heapq.heappop(heap)
freq2, right = heapq.heappop(heap)
heapq.heappush(heap, (freq1 + freq2, (left, right)))
_, tree = heap[0]
code_map: Dict[str, str] = {}
def dfs(node, code):
if isinstance(node, tuple):
dfs(node[0], code + '0')
dfs(node[1], code + '1')
else:
code_map[node] = code
dfs(tree, '')
return tree, code_map
def encode(text: str, code_map: Dict[str, str]) -> str:
"""
使用Huffman编码对文本进行压缩
"""
return ''.join([code_map[ch] for ch in text])
def decode(encoded: str, tree: Union[str, Tuple]) -> str:
"""
使用Huffman编码解压缩文本
"""
text = []
node = tree
for bit in encoded:
if isinstance(node, tuple):
node = node[int(bit)]
if isinstance(node, str):
text.append(node)
node = tree
return ''.join(text)
def compress(text: str) -> Tuple[str, Dict[str, str]]:
"""
对文本进行压缩
"""
freq = get_freq(text)
tree, code_map = build_huffman_tree(freq)
encoded = encode(text, code_map)
return encoded, code_map
def decompress(encoded: str, code_map: Dict[str, str], tree: Union[str, Tuple]) -> str:
"""
对已压缩文本进行解压缩
"""
return decode(encoded, tree)
Lempel-Ziv压缩算法是一种基于字典的算法,通过将文本分割成短语(phrase),并使用一个字典来存储其中已经出现过的短语,每个短语用一个索引来表示,从而实现压缩。算法可以分为两种实现方式:LZ77和LZ78。其中LZ77通过一个滑动窗口来寻找重复的短语,LZ78则直接在字典中查找已有的短语。
示例代码:
def lz_compress(text: str) -> str:
"""
对文本进行LZ压缩
"""
dic = {}
seq = []
i = 0
while i < len(text):
j = i
w = ''
while j < len(text) and w in dic:
w += text[j]
j += 1
if w in dic:
seq.append(dic[w])
i = j - 1
else:
seq.append(ord(w[-1]))
dic[w] = len(dic) + 1
i = j - 1
return ''.join([chr(c) for c in seq])
def lz_decompress(compressed: str) -> str:
"""
对已LZ压缩文本进行解压缩
"""
dic = {i + 1: chr(i) for i in range(256)}
text = ''
i = 0
while i < len(compressed):
if ord(compressed[i]) in dic:
text += dic[ord(compressed[i])]
if i + 1 < len(compressed) and ord(compressed[i+1]) not in dic:
dic[len(dic) + 1] = dic[ord(compressed[i])] + compressed[i+1]
i += 1
else:
text += compressed[i]
i += 1
return text
以上就是Huffman编码和Lempel-Ziv压缩算法的使用示例,读者可以根据自己的需要选择使用。