Entropy coding refers to a family of lossless compression schemes that attempt to achieve compression rates close to the entropy of the data. Entropy encoders are more efficient while encoding probable symbols at the expense of being less efficient when encoding unlikely symbols. Since no entropy coder can compress below the entropy of its model, the model used is to estimate symbol probabilities is of critical importance.
Rather than using a static model that makes a global estimate of probabilities, adaptive coders leverage a model that returns symbol distributions as a function of some window of nearby data. Adaptive coding can make entropy encoders more efficient in terms of both computation and compression rates. Moreover, as context dependent predictive models continue to improve, adaptive coding will too.
This is an implementation based project focused on lossless compression of natural language text documents. Embedded within is a presentation and analysis of core ideas in adaptive entropy encoding. Since it is particularly amenable to adaptive models, Arithmetic coding is the heart of the project. We implement an Arithmetic coder that treats the model as a black box and is capable of achieving compression rates very close to entropy. To plug into the encoder, we implement ngrams count based models as well as more powerful LSTMs using Keras.
Before moving into the Arithmetic coding discussion, it will be important to develop ideas and build models. For these reasons we begin with Huffman encoding, a much simpler alternative to Arithmetic encoding that can also be made adaptive. Our implementation of Huffman encoding also gives us a richer context within which to analyze the performance of our Arithmetic encoder.
The focus of this project is not to achieve the best compression rates or computational efficiency. Rather, the goal is to express fundamental ideas in python code and experiment with our understanding of those ideas. To this end I have sought to not reference others' implementations and instead prefered to explore by writing my own code whenever possible. In particular, I have not seen any other Huffman implementations and only referenced pseudocode for the base of my Arithmetic implementation. I have commented any occasions where I did reference code or pseudocode with credit to the source.
My hope is that anyone who has a basic understanding of how Huffman and Arithmetic coding work will be able to read this as a self contained work and deepen their knowledge of how to implement these algorithms. I also hope to convey in concrete terms how well these algorithms work in practice, and weigh the pros and cons.
import sys
from collections import defaultdict, Counter
import heapq
from bitstring import BitArray
from copy import copy
import pandas as pd
import time
import matplotlib.pyplot as plt
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import random
from functools import lru_cache
import requests
import math
import itertools
# for plotting style
pd.set_option('display.max_rows', 10)
plt.rcParams["font.size"] = 22
plt.rcParams["figure.figsize"] = (12,6)
plt.rcParams["ytick.labelsize"] = 19
plt.rcParams["xtick.labelsize"] = 19
plt.rcParams["lines.linewidth"] = 2
plt.rcParams["axes.titlesize"] = 26
We download books from Project Gutenberg and use them as example data throughout the project. We also create CHARS
which will act as our symbol universe throughout.
def download_book(url):
return requests.get(url).content.decode("utf-8")
URLS = {
'a_christmas_carol': 'https://www.gutenberg.org/files/24022/24022-0.txt',
'great_expectations': 'https://www.gutenberg.org/files/1400/1400-0.txt',
'a_tale_of_two_cities': 'https://www.gutenberg.org/files/98/98-0.txt',
'leviathan': 'http://www.gutenberg.org/cache/epub/3207/pg3207.txt',
'war_and_peace': 'https://www.gutenberg.org/files/2600/2600-0.txt',
'walden': 'https://www.gutenberg.org/files/205/205-0.txt',
'peter_pan': 'https://www.gutenberg.org/files/16/16-0.txt',
'the_war_of_the_worlds': 'https://www.gutenberg.org/files/36/36-0.txt'
}
# a dictionary holding all the above texts, keyed by title
TEXTS = dict((title, download_book(url)) for title,url in URLS.items())
# the universe of symbols
CHARS = set().union(*[set(text) for text in TEXTS.values()])
print('Length of CHARS: {}'.format(len(CHARS)))
Here we will get a sense of how long the books are. Our compression algorithms will work at the character level, so we use character as the unit of length. Also relevant is the entropy of each book under some baseline model. The model we use with which to do this computes the probability of a character as its percent frequency in the text.
Note that with about 128 characters, the most naive encoding would assign a codeword of length $\log(128)=7$ to each character. By later comparing our models and compression rates to this count based entropy, we can get a less naive comparison point.
# for showing data in pandas table
def show_table(names, rows, d=3):
Round = lambda x,d: round(x,d) if type(x)!=str else x
data = {names[i]:[Round(row[i],d) for row in rows] for i in range(len(names))}
display(pd.DataFrame(data).set_index(names[0]))
# get the entropy of a text assuming a naive count based probability model
# this will be a useful point of comparison to evaluate our compression rates
def count_entropy(text):
counts = Counter(text)
probs = dict((ch,cnt/len(text)) for ch,cnt in counts.items())
return -sum([np.log2(probs[ch]) for ch in text])
names = ['title', 'length', 'count entropy / char', 'probability of \'e\'']
rows = [[title, len(text), count_entropy(text)/len(text), text.count('e')/len(text)]
for title,text in TEXTS.items()]
show_table(names, rows)
We see that the frequency based entropy is already much lower than that of a uniform model. This motivates us to start with a count based model when we begin compressing these books.
# distribution of top symbols
to_show = 35
all_texts = ''.join([text for text in TEXTS.values()])
freqs = sorted([(ch, all_texts.count(ch) / len(all_texts)) for ch in CHARS],
reverse=1, key=lambda x: x[1])
top_freqs = dict(freqs[:to_show])
fix, ax = plt.subplots(1,1, figsize=(14,4))
ax.bar(*zip(*top_freqs.items()), color='b', width=.45, alpha=.5, label='encoding')
ax.set_ylabel('Frequency', fontsize=19)
ax.set_title('Frequency of Characters in All Books');
As expected ' '
is the most frequenct character, followed by e
. Characters become quite infrequent after the top 35 or so. We will be able to take advantage of such a distribution while compressing.
Designed by David Huffman while a student at MIT, Huffman encoding is among the simplest of entropy encoders. By beginning with Huffman, we will have the opportunity to experiment with some fundamental ideas and models before progressing to trickier Arithmetic algorithms.
Huffman encoding assigns to each symbol a binary code which is not the prefix of any other code. Because the codes are prefix-free, decoding a concatenated string of codewords is unambiguous. Thus the code for the whole message can just be this concatenated string. Moreover, the expected length of a Huffman codeword over the symbol distribution is optimal among all schemes that map symbols to distinct codewords. As we'll later discuss, Huffman encoding can get very close to entropy, but the discrete nature of it does have a potentially problematic cost if probabilities cannot be well approximated with powers of 2.
A set of prefix-free codewords can be mapped to a binary tree in which each path from root to leaf maps to a distinct codeword and associated symbol. Huffman's innovation was to build this tree bottom up. The algorithm is to initialize each symbol and corresponding probability as a node. We then take the two nodes with smallest probabilities and combine them into a single node with probability equal to the sum of probabilities. Inside of this node is now a tree with branches for each component symbol; however, from the global scope of the algorithm, this interior structure is abstracted away. We can then recurse. On each iteration we decrease the number of nodes by 1 until we've combined all the nodes into a single tree.
# get codewords for each character present in the counter
def get_codewords(counts):
# codewords will be a bit string for each character
# we start with each codeword an empty string and build according to Huffman's scheme
codewords = defaultdict(str)
# sort counts into heap for easy access of the two smallest
# initially each of these counts represents a leaf in the huffman tree
counts = sorted([(cnt, [ch]) for ch,cnt in counts.items()])
# until all nodes are merged, grab two smallest count nodes and merge their counts and chars
while len(counts) > 1:
cnt1, chars1 = heapq.heappop(counts)
cnt2, chars2 = heapq.heappop(counts)
# after merging these char sets all future bits added to their
# strings will be the same. The following creates the fork in the tree.
for ch in chars1: codewords[ch] += '0'
for ch in chars2: codewords[ch] += '1'
heapq.heappush(counts, (cnt1+cnt2, chars1+chars2))
# we reverse all codewords since we started from the leaves and worked backwards
codewords = dict((ch, codeword[::-1]) for ch,codeword in codewords.items())
return codewords
The above code follows the exact plan we stated above, except that we use counts as a substitute for probabilities. As mentioned, our first model is count based. Because of the sorting and heappops we get codewords in $n\log n$ time, where $n$ is the number of symbols.
text = TEXTS['a_christmas_carol']
# we want codewords for all characters in our universe so that we can potentially
# encode any of our books with the same codewords.
# we append the characters 'A Christmas Carol' is missing to it.
missing_chars = CHARS - set(text)
counts = Counter(text + ''.join(missing_chars))
codewords = get_codewords(counts)
# show codewords
cols = ['character','count','codeword']
rows = [(repr(ch), cnt, codewords[ch]) for ch,cnt in sorted(list(counts.items()), reverse=1, key=lambda x:x[1])]
show_table(cols, rows)
As planned, we can see that the most frequent characters have the shortest codewords. Also note that importantly no codeword is the prefix of any other.
With codewords in hand, we'll trying encoding our first text, A Christmas Carol. The encoding now only needs to build the code by concatenating the codewords for each character in the text. This runs in time linear in the length of the text.
# encoding is now straightforward and lightning fast
def encode(text, codewords):
return ''.join([codewords[ch] for ch in text])
code = encode(text, codewords)
print('Code: {}...'.format(code[:50]))
print('Compressed text length: {} bits, or {:.3} bits/character'
.format(len(code), len(code)/len(text_train)))
print('Entropy of text under count based model is {:.3} bits/character'
.format(count_entropy(text)/len(text)))
Considering UTF-8 encodes at a rate of at least 8 bits per character, we're doing OK even with this context oblivious counts model. We can also see that we've managed to get very close to the entropy of our counts model, as Huffman is expected to do.
The decoding is slightly more involved. The simplest idea is perhaps to invert the character to codewords dictionary. Then, as we read each bit of the code, we can query this dictionary as to whether we have yet read in a codeword that has a match. Once we do, we can append the corresponding character to our output and start fresh reading the next bit. This would be very easy to code and runs in linear time, although there is the expense of doing a hash read at each bit.
Alternatively, and perhaps more instructive and faster for decoding, we opt to actually make the Huffman tree. Now decoding becomes an act of walking down the tree as we read in the code bits. When we reach a leaf, we can append the character at that leaf to the output, and start fresh again at the top of the tree. This saves us from having to do any hashing while decoding, and as we can see in the code, each step of the decoder is computationally very cheap. The downside is that to build the tree takes time $n\log n$ where $n$ is the number of characters.
Our tree building algorithm is recursive. We first separate all the codewords by whether they start with 0 or 1. Then we recursively build the left tree from those starting with 0 and the right tree from those starting with 1. Finally, we output a root with those trees as its left and right children.
# define node for the huffman tree
class node:
def __init__(self, ch=None, left=None, right=None):
self.ch = ch
self.left = left
self.right = right
# build huffman tree recursively from codewords
# the huffman tree let's us efficiently decode
# we can walk down the tree to reach chars at the leaves
def get_huffman_tree(codewords):
# if there's only 1 codeword, we've reached a leaf and just return that node
if len(codewords) == 1:
return node(codewords.popitem()[0])
# otherwise the tree is a root with two huffman tree children
# the left child's leaves will be chars whose codewords start with a 0
# the right child's leaves will be chars whose codewords start with a 1
left = get_huffman_tree({k:v[1:] for k,v in codewords.items() if v[0]=='0'})
right = get_huffman_tree({k:v[1:] for k,v in codewords.items() if v[0]=='1'})
return node(left=left, right=right)
# to decode, we just keep walking down the huffman tree as we read the code
def decode(code, tree):
output, cur = '', tree
for bit in code:
# walk down the current tree path
cur = cur.left if bit=='0' else cur.right
# if we reach a leaf, save that char to output and restart at the root
if cur.ch:
output += cur.ch
cur = tree
return output
tree = get_huffman_tree(codewords)
output = decode(code, tree)
print('First 50 chars of text:', text[:50])
print('First 50 chars of output:', output[:50])
print('Decode matches original text: {}'.format(text == output))
And we have successfully decoded A Christmas Carol! It may still be somewhat unpleasing that our code was a Python string, where each bit really takes up much more than 1 bit of space. Thus we provide a quick demo of how to use the Python bitstring
library to actually output a code file that takes up only a number of bits equal to the length of the code.
# file write demo, just to show we can actually write a file with that many bits
def write(code, filename):
compressed = BitArray(bin=code)
with open(filename, 'wb') as f:
compressed.tofile(f)
def read(file):
return BitArray(filename=file)
write(code, 'text_compressed')
output = decode(str(read('text_compressed').bin)[:len(code)], tree)
print('Decode matches original text: {}'.format(text_train == output))
Since we cannot expect Huffman encoding to beat the entropy of the underlying model, to get a more efficient encoding we need a better model. Thus, we will now develop an adaptive model that will serve us for both Huffman and Arithmetic encoding.
One of the most natural ideas is to use an Ngram based model, and this is in fact what we will do. More concretely, we pick a context length $n$. Then to compute the probability of a character ch
, we will look at the $n$ characters before ch
and call this string context
or ngram
. The probability is then the number of times ch
appears after ngram
in the text divded by the number of times ngram
appears. To ensure that no model probabilities ever go to 0, we add 1 to all counts. We also need to take care to still be able to predict in the case of unseen context. For this we use a default mapping of probabilities to characters based on global frequencies.
# create an ngrams model that returns a probability of next char for a given context
def counts2probs(counts):
total = sum(counts.values())
return dict((ch, cnt/total) for ch,cnt in counts.items())
# base class to be used for Huffman and Arithmetic encoding
class Ngrams_Model:
def __init__(self, n, chars):
self.n = n # context length
self.chars = chars
self.probs = None
self.default_probs = None
def train(self, text):
n = self.n
# get count for each next char for each observed context
# start each count at 1 so as not to have any zero probabilities
context_cnts = defaultdict(lambda:{ch:1 for ch in self.chars})
for i in range(n, len(text)):
context_cnts[text[i-n:i]][text[i]] += 1
# rescale counts to probabilities
# this will generalize the model so we can also apply it to arithmetic coding
context_probs = dict((ngram, counts2probs(cnts))
for ngram, cnts in context_cnts.items())
self.probs = context_probs
# we use default probs in case of unseen context
self.default_probs = dict((ch, (text.count(ch)+1)/len(text))
for ch in self.chars)
def predict(self, context):
return self.probs.get(context) or self.default_probs
ngrams_model = Ngrams_Model(n=3, chars=CHARS)
ngrams_model.train(TEXTS['a_christmas_carol'])
We can now check to see how our model is behaving.
def get_top_preds(model, context, k):
preds = [(ch,p) for ch,p in model.predict(context).items()]
preds.sort(reverse=1, key=lambda x:x[1])
return preds[:k]
col_names = ['next char pred', 'prob']
top_preds = get_top_preds(ngrams_model, 'the', 3)
show_table(col_names, [(repr(ch),p) for ch,p in top_preds])
Indeed our model seems to give us reasonable predictions for this context. Let's evaluate it further by looking at its accuracy, confidence and entropy.
def evaluate(model, text):
n = model.n
relax = 3
correctness, relax_correctness, confidence = [], [], []
for start in range(len(text) - n - 1):
context = text[start:start + n]
top_preds = get_top_preds(model, context, relax)
confidence.append(top_preds[0][1])
correctness.append(text[start+n] == top_preds[0][0])
relax_correctness.append(text[start+n] in [pred[0] for pred in top_preds])
return np.mean(correctness), np.mean(relax_correctness), np.mean(confidence)
def adaptive_entropy(model, text):
Sum = 0
for i,ch in enumerate(text):
context = text[i-model.n:i] if i>=model.n else ''
p = model.predict(context)[ch]
Sum += -np.log2(p)
return Sum/len(text)
test_title = 'a_christmas_carol'
col_names = ['test title', 'accuracy', 'relaxed accuracy', 'mean confidence', 'entropy / char']
metrics = evaluate(ngrams_model, TEXTS[test_title])
show_table(col_names, [[test_title] + list(metrics) + [adaptive_entropy(ngrams_model, TEXTS[test_title])]])
test_title = 'walden'
metrics = evaluate(ngrams_model, TEXTS[test_title])
show_table(col_names, [[test_title] + list(metrics) + [adaptive_entropy(ngrams_model, TEXTS[test_title])]])
As we can see above, our model has about a 50% chance of predicting the next character. For comparison, our global frequency model would always predict ' '
as the most likely character, and would only be correct around 17% of the time. As expected, accuracy
is better for A Christmas Carol
because this is the text we trained the model on.
Note that relaxed accuracy
is the probability that one of the model's top 3 predictions for the next character is correct, and mean confidence
refers to the probability the model assigns to its top next character prediction.
The most revealing dimension of these results is the entropy / char
. We've managed to push these numbers down by about 1 bit from what they were under the global count model. This will surely result in better compression.
To conclude our ngrams model inspection, let's use the model to generate language. To do this we start with a prompt, sample a character from the model's character distribution, append this character to the prompt and keep iterating.
# For sampling from model preds. If we generate language with
# the top pred we always would get the same repetitive result
def sample(preds):
chars = list(preds.keys())
probs = [preds[ch] for ch in chars]
probs = [v/sum(probs) for v in probs] # make sure probs are normalized
# take multinomial sample
sampled_idx = np.argmax(np.random.multinomial(1, probs, 1))
return chars[sampled_idx]
# For using a model to generate language
def generate_text(model, prompt):
n = model.n
# just keep generating the next char from the current context
for _ in range(100):
preds = model.predict(prompt[-n:])
prompt += sample(preds)
prompt = prompt.replace('\n', ' ').replace('\r', ' ')
return 'Generated language: ' + prompt
prompt = 'The queen is'
print(generate_text(ngrams_model, prompt))
By using models to generate language, we can get a more intuitive sense of how well our models understand text. Apparently our ngrams model is still relatively weak. It's predictions are often far from what a human would anticipate. This is at least in part due to the fact that we set $n=3$. Perhaps with a greater context length our model would be stronger, and we will certainly try $n>3$.
Recall that for Huffman, we need not just character probabilities, but also codewords, and in our implementation, trees. Here we add methods to the ngrams base class to provide these features. Now to prepare a model for use, we must train it and call these methods to create the underlying data structures.
# to write this code, we can rely on the functions we wrote earlier
# namely get_codewords() and get_huffman_trees()
class Ngrams_Huffman_Model(Ngrams_Model):
def make_context_codewords(self):
if not self.probs or not self.default_probs:
print('Error: model must be trained before making context codewords')
self.context_codewords = dict((ngram, get_codewords(probs))
for ngram,probs in self.probs.items())
self.default_codewords = get_codewords(self.default_probs)
def make_context_trees(self):
if not self.context_codewords or not self.default_codewords:
print('Error: model must make context codewords before trees')
self.context_trees = dict((ngram, get_huffman_tree(codewords))
for ngram,codewords in self.context_codewords.items())
self.default_tree = get_huffman_tree(self.default_codewords)
# all_dickens will just be a concatenation of the Dicken's texts
TEXTS['all_dickens'] = TEXTS['great_expectations'] + \
TEXTS['a_christmas_carol'] + \
TEXTS['a_tale_of_two_cities']
text_train = TEXTS['all_dickens']
max_n = 6
# we record training times and space required
ngram_huffman_models = dict()
training_times = dict()
model_sizes = dict()
for n in range(1, max_n + 1):
start = time.perf_counter()
model = Ngrams_Huffman_Model(n, CHARS)
model.train(text_train)
model.make_context_codewords()
model.make_context_trees()
training_times[n] = time.perf_counter() - start
sizes[n] = sys.getsizeof(model.context_codewords) + sys.getsizeof(model.context_trees)
sizes[n] /= 1000
ngram_huffman_models[n] = model
def plot_dict(data, title, xlabel, ylabel, ax=None):
if not ax:
fig, ax = plt.subplots(1,1)
ax.plot(*zip(*data.items()), color='k')
ax.set_ylabel(ylabel)
ax.set_xlabel(xlabel)
ax.set_title(title)
fig,ax = plt.subplots(1,2, figsize=(15,6))
fig.tight_layout(w_pad=4)
plot_dict(training_times, 'Time to Build Huffman Models', 'context length', 'second', ax[0])
plot_dict(sizes, 'Model Space Requirements', 'context length', 'kilobytes', ax[1])
As we increase the context length, training gets more expensive in terms of both time and space. This may or may not be acceptable. If we intend to train 1 model and use it to compress many texts, then the training resources become negligable. However, if we wish to imporve compression rates by training models for each text we compress greater context lengths become prohibitive. In fact, even just the model space can exceed the space of the original text, defeating our whole purpose.
We must not make slight modifications to our Huffman encoding and decoding algorithms. For each, we need to get the underlying datastructures based on the current context. Importantly, because the context is only based on previous characters, the decoder will see all the same exact contexts that the encoder did, making faithful decoding possible.
Note that we need not worry about the context before the first character because of our default codewords and trees.
def encode(text, model):
code = ''
for i,ch in enumerate(text):
# get context if there is enough of it to get
context = text[i-model.n:i] if i>=model.n else ''
# get codewords for this context
codewords = model.context_codewords.get(context) or model.default_codewords
code += codewords[ch]
return code
def decode(code, model):
output = ''
# start with default tree
cur = model.default_tree
for bit in code:
cur = cur.left if bit=='0' else cur.right
if cur.ch:
output += cur.ch
context = output[-model.n:] if i>=model.n else ''
# get tree for this context
cur = model.context_trees.get(context) or model.default_tree
return output
We now test adaptive Huffman encoding for all the ngrams models we've trained (n from 1 to 6). For each test we time encoding and decoding, as well as collect other performance metrics. All the heavy lifting is done by functions we've already created.
def test_adaptive_huffman(test_text, model):
# code
start = time.perf_counter()
code = encode(test_text, model)
code_time = time.perf_counter() - start
bits_per_char = len(code) / len(test_text)
# decode
start = time.perf_counter()
decoded = decode(code, model)
decode_time = time.perf_counter() - start
# see if output matches the original
decode_success = 'yes' if decoded == test_text else 'no'
entropy = adaptive_entropy(model, test_text)
return (model.n, bits_per_char, entropy, code_time, decode_time, decode_success)
ADAPTIVE_HUFFMAN_RESULTS = dict()
def run_huffman_tests(test_title, models, show=True):
test_text = TEXTS[test_title]
max_n = max(models.keys())
rows = [test_adaptive_huffman(test_text, models[n]) for n in range(1, max_n + 1)]
col_names = ['context_length', 'coding bits / char', 'entropy / char',
'code time', 'decode time', 'decoded matches original']
ADAPTIVE_HUFFMAN_RESULTS[test_title] = {'col_names':col_names, 'rows':rows}
if show: show_table(col_names, rows)
run_huffman_tests(test_title='all_dickens', models=ngram_huffman_models)
run_huffman_tests(test_title='leviathan', models=ngram_huffman_models)
run_huffman_tests(test_title='war_and_peace', models=ngram_huffman_models)
run_huffman_tests(test_title='peter_pan', models=ngram_huffman_models)
We have successfuly encoded and decoded all the texts we tested, even War and Peace
, which is more than 3 million characters long. Our code lengths are all very close to the entropy of the models, and we see substantial improvement compared to the non-adaptive Huffman encoding. Coding and decoding time increases with $n$ because the our data structures are growing quickly with $n$. For $n<4$ we have reasonable coding and decoding times, each at most a few seconds even for millions of characters.
Finally, we will run two more tests without tables, and then look at the code lengths as a function of $n$ for each text.
run_huffman_tests(test_title='the_war_of_the_worlds', models=ngram_huffman_models, show=False)
run_huffman_tests(test_title='walden', models=ngram_huffman_models, show=False)
fix, ax = plt.subplots(1,1, figsize=(14,8))
for title, data in ADAPTIVE_HUFFMAN_RESULTS.items():
x = [row[0] for row in data['rows']]
y = [row[1] for row in data['rows']]
ax.plot(x,y,label=title.replace('_',' '), lw=3, alpha=.8)
i += 1
ax.set_xlabel('context length')
ax.set_ylabel('code bits / char')
ax.set_title('Adaptive Huffman Results for Ngram Models')
ax.legend();
As expected performance is best for the all Dicken's concatenation because this is what we trained the models on. For all texts $n=3$ appears to be ideal. Beyond that our model starts to overfit and we need to rely on the default codewords too often because we will be faced with unseen contexts.
We do not pay too much of a price for using a model trained on one text for another. In fact, Dicken's seems to be a pretty good proxy for all of these other books. This is encourging because using a single model means we do not have to train and store a model for each encoding job.
To improve Huffman's performance, we could consider a technique called 'blocking', which considers symbols in chunks. For example, if we used a word level model, our encoder could theoretically code a very likely next word as a single bit. By contrast working at the character level forced us to encode each word with a number of bits at least equal to its length. However, since these texts are not just words each separated by a single space, a word level model would introduce new problems.
This concludes our work with Huffman encoding. With an understanding of the basics and a working ngrams model base class, we now turn to attempting to improve compression rates and overhead with Arithmetic encoding.
An Arithmetic encoder will code a document as a number between 0 and 1. We will see exactly how this works with the code below, but at a high level consider the following toy example:
Suppose we have just 10 characters, a
through j
and a uniform distribution model. If we want to encode the word 'hi' we can deal with the characters in turn. For the h
, since it's the 8th of our characters, we restrict our final output to be in $[.7, .8)$. This way, no matter what else we do, at decoding time we will see that our final code is in the 70-80th percentile interval between 0 and 1, and hence we can infer that the first character of the message was h
. Now to encode our next character i
, we note that i
is our 9th character. Therefor we pick a final code in $[.78, .79)$, say .785. When decoding, after recognizing the first character as h
, we'll see that the code .785 is in the 80-90th percentile of the $[.7, .8)$ interval that h
restricted us to. This is precisely the information we need to realize the encoder must have seen i
as the second character. Assuming infinite precision arithmetic, we can continue like this for any message length.
One can imagine how easily this can extend to arbitrary probability distributions. The characters that are very likely will 'own' large intervals that will require only few bits to represent, which is what we want. By contrast, the less probable characters will 'own' smaller intervals, which will in turn require more bits of precision to represent in the final output code, as we should expect.
The beautiful thing about Arithmetic coding is that we no longer have codewords or trees to create. For this reason, Arithmetic coding lends itself much more naturally to adaptive models than Huffman coding does. Also, as a consequence of not using codewords, we are not forced to assign integer bit lengths to each character. Consider the benefit we get from this additional freedom. For example, if we had characters a
,b
and c
with equal probabilities, with Huffman the best we could do is length 1
,2
,2
codewords. Intuitively, breaking the symmetry seems like a waste and we should be better off if only we could assign each of these characters a codeword of length between 1 and 2. By comparison, Arithmetic encoding would have no trouble assigning each of these characters a 1/3 interval.
Also, as we mentioned earlier, Huffman must assign a codeword of at least 1 bit to each character. For an example of how wasteful this can be, suppose that we only have two characters, a
and b
with probabilities .999 and .001. Huffman must resign to assigning the codewords 0 and 1 and reaps no benefit from the very low binary entropy of the distribution. In our context, to encode a word, no matter how probable, a character level Huffman encoder must use a number of bits at least the length of the word. Arithmetic encoding does not have this wasteful restriction.
In fact, with Arithmetic encoding we are able to approach within a few bits of the entropy of the entire message, no matter how long the message is. We will further discuss this fact in chapter 6. Also, as we will see, we can encode and decode in time linear in the message length. These facts together make a very strong case for Arithmetic coding.
Assuming for a moment we have infinite precision, programming an Arithmetic encoder is quite easy. We start by giving each character in our alphabet an interval of length proportional to its probability.
During encoding, for each character we reduce the scope of our current interval as discussed above. Finally, we just output a code in the middle of our final interval.
To decode, to determine each output character, we iterate through all possible candidates and see which one is consistent with the current code. Then we rescale the code to be its percentile in the last interval we've determined we're in. We use an end of file symbol so that the decoder can determine when to stop.
See the references section for a more complete discussion of Arithmetic coding fundamentals.
# End of file symbol. Importantly this does not appear in any of our texts
EOF = '•'
chars = 'abcdefghijklmnopqrstuvwxyz!' + EOF # simple char set for this example
# create a model with a uniform distribution over these chars
width = 1/len(chars)
# the intervals are really all the model needs to provide
intervals = dict((ch, (width*i, width*(i+1))) for i,ch in enumerate(chars))
# example text to try encoding and decoding
text = 'peekaboo!' + EOF
# low and high tracks our current interval
# initially the text can encode to anywhere on the 0 to 1 interval
# encoding is an iterative process of narrowing this interval
def encode(text, chars):
low, high = 0, 1
for ch in text:
l,h = intervals[ch]
width = high-low
low, high = low + width*l, low + width*h
# output any number in the interval for simplicity
return (low + high)/2
# this function naively iterates over all chars to find the one that matches
# the interval found in the code for the current iteration
def decode_step(code, intervals):
for ch, interval in intervals.items():
if interval[0] <= code < interval[1]:
return ch, interval
# get the char that matches the current interval, then rescale the code
def decode(code, intervals):
output = ''
while True:
ch, interval = decode_step(code, intervals)
output += ch
width = interval[1] - interval[0]
code = (code - interval[0])/width
if ch == EOF: break
return output
code = encode(text, intervals)
output = decode(code, intervals)
print('Original text: {}\nCode: {}\nDecoded text: {}'.format(text, code, output))
We can see that our encoding and decoding worked; however, none of this was practical. Clearly if our original text was longer we would run out of precision with which to do the arithmetic and output the final code. Also note that our code is longer than the original text! We should be more careful as to what exactly we choose as the final output value. Moreover, we'll also need the code to be in binary.
Before moving on, let's be sure to add the end of file char we introduced to our global CHAR
set.
CHARS |= {EOF}
The first step to extending these ideas to finite precision is to note how binary numbers work after the decimal point. We can think of each bit after the decimal point as specifying which half of a possible interval we are in. Just as $.1$ in decimal is $1/10$ and $.01$ is $1/10^2$, $.1$ in binary is $1/2$ and $.01$ is $1/2^2$. So .101 in binary tells us that we're in the second half of 0 to 1, then the first half of .5 to 1, and then the second half of that, or .625.
For the sake of practicality, to express numbers between 0 and 1 in binary with finite precision, we'll drop the decimal point from all of our calculations. Instead, we'll pick an amount of precision $p$ and refer to $2**p$ as whole
. We then simply think of whole
as being 1 in our units. If we choose $p=64$, with 64 bits we can express any number between 0 and $2^{64}-1$, which can in turn be mapped to numbers between 0 and 1 with a precision on the order of $1/2^{64}$.
The core to our finite precision encoder is the rescale
function below. Note that as we encode, we narrow the interval we're in. To continue to calculate at narrower and narrower intervals, we don't actually need to be concerned with how we got to the current range; that's already encoded in our code output so far. Instead, all we need to do is to keep adding bits to the code to specify which half of the current interval we need to go to next. We then rescale that half interval to the new whole.
The tricky part arises when the next interval straddles 1/2 and the interval is narrow enough that we must rescale to get more precision. For example, if the current character during encoding is telling us, after all rescaling, that our next interval is from .4 to .6, we cannot yet know whether our next code bit should be a 1 or a 0. In fact, the next code bit can only be determined from looking at more characters of the text. In this case, we actually wait to output more code bits, and instead start appending bits to saved_bits
. Rather than scale one or another half of the interval to the new whole, we rescale .25 to .75 as the new whole. Recall that our interval of interest, .4 to .6, is entirely contained in this middle half. In a pathological case, which is sure to arise when decoding a text even just thousands of characters long, we will have to rescale the middle multiple times before we can safely determine that our interval of interest lies entirely on one side or the other. In these cases, the length of saved_bits
will effectively track how many of these middle rescalings we do.
For a more thorough explanation of finite precision Arithmetic coding, see Mark Nelson's Data Compression with Arithmetic Coding (2014) url: https://marknelson.us/posts/.
Although the python code below and modifications are my own, I credit the pseudocode used as a base to Mathematicalmonk's YouTube Arithmetic Coding playlist: https://youtu.be/ouYV3rBtrTI. This is also a great resource for a more detailed explanation than is in the scope of this project.
# a and b represent the ends of the current interval
# pseudocode from reference (5)
def rescale(a, b, saved_bits, code, half, quarter):
while b < half or a > half:
# we can rescale
if b < half:
# interval of interest is in first half, output 0 and saved bits, then rescale
code += '0' + '1'*saved_bits
saved_bits, a, b = 0, 2*a, 2*b
# interval of interest is in second half, output 1 and saved bits, then rescale
elif a > half:
code += '1' + '0'*saved_bits
saved_bits, a, b = 0, 2*(a-half), 2*(b-half)
# interval of interest straddles the middle, blow up the middle and save bits
while a > quarter and b < 3*quarter:
saved_bits, a, b = saved_bits+1, 2*(a-quarter), 2*(b-quarter)
return a , b, saved_bits, code
# given the current interval [a,b) and the new interval to narrow to [c/r,d/r)
# this function returns the new a and b after narrowing
# r represents 1 whole in the units of precision we use for c and d, and will be further discussed
def narrow(a,b,width,c,d,r):
b = a + (width*d)//r
a = a + (width*c)//r
return a,b
For the encoder we progress straight to assuming an adaptive model. We treat the model as a black box, but it must have a property n
which tells us how much previous context to feed it. The model must also support the method get_intervals
, which takes as input a context and returns intervals. Each interval corresponds to the probability of the character that interval is assigned to. Since we wish to do all arithmetic with integers, the space over which we divide these character intervals is over the integer r
. For example, if a character has a probability of 1/10 and r is 500, we may assign that character the interval $[100,150)$. It is the responsibility of the model to produce these intervals from its underlying probability distribution and to decide on an appropriate precision for r.
# we move straight to adaptive models for this Arithmetic encoder
# hence the first step on each character is to get the previous context
# pseudocode from reference (5)
def encode(text, model, precision):
whole, half, quarter = 2**precision, 2**(precision-1), 2**(precision-2)
code = ''
a, b, saved_bits = 0, whole, 0
for i,ch in enumerate(text):
# get current context
context = text[i-model.n:i] if i >= model.n else ''
# get model prediction
intervals, r = model.get_intervals(context)
# get interval for current character being encoded
c, d = intervals[model.char2idx[ch]]
# narrow and rescale
a, b = narrow(a, b, b-a, c, d, r)
a, b, saved_bits, code = rescale(a, b, saved_bits, code, half, quarter)
# finish
saved_bits += 1
return code + ('0' + '1'*saved_bits if a <= quarter else '1' + '0'*saved_bits)
Assuming some fixed model prediction time, it should be fairly clear that our encoder runs in time linear in the length of the text. We now breifly discuss why the expected length of the encoding will be very close to the entropy of the text under the model. We will keep this discussion to a very high level as it is mostly out of our scope here.
Suppose that we're encoding a text with probability $p$. Roughly speaking, by the time we rescale until the interval that represents the text is at least half the size of our entire interval of interest, we can output a number within the interval of the text with just 1 or 2 more bits. Now, note that every time we rescale, the interval representing the text blows up by a factor of 2. Thus the number of rescalings we will need will be $\log(1/p)$. Since our code is within a few bits of the number of rescalings, our code is of length roughly ceil($\log(1/p)$). Taken over all possible texts $t$ we could have seen, our expected code length becomes $\sum_t p_t \log(1/p_t)$, which is just the entropy of the text.
Of critical importance, the decoder does the same arithmetic that the encoder did. It rescales in exactly the same way, and it rounds in the narrowing function in exactly the same way. This control we have when working with integer arithmetic is the reason why these algorithms work for arbitrarily long texts.
These helper functions are analogous to what we had for the encoder. z
is the current chunk of code we're trying to decode, within the precision we're working in. The binary search is an addition helper function that I introduced to speed up decoding. Recall from the infinite precision example that the decoder must search through the characters to find the one that is associated with the current interval specified by the code. With our 126 characters, binary search takes this process from 126 steps down to about 7 steps. The net effect from the binary search is to speed up the whole decoder by a factor of about 3.
# pseudocode from reference (5)
def decoder_half_rescale(a,b,z,i,code,half):
if b < half:
a*=2; b*=2; z*=2
elif a > half:
a,b,z = 2*(a-half), 2*(b-half), 2*(z-half)
i += 1
if i < len(code) and code[i] == '1':
z += 1
return a, b, z, i
# pseudocode from reference (5)
def decoder_quarter_rescale(a, b, z, i, code, quarter):
a,b,z = 2*(a-quarter), 2*(b-quarter), 2*(z-quarter)
i += 1
if i < len(code) and code[i] == '1':
z += 1
return a, b, z, i
# to speed up the search for the character that fits the current interval
# specified by z, the chunk of code currently considered
def binary_search(a, b, z, intervals, r):
low, high = 0, len(intervals)-1
while low != high:
mid = (low + high) // 2
c,d = intervals[mid]
A, B = narrow(a,b,b-a,c,d,r)
if z < B: # must be lower than mid
high = mid
else: # must be at least mid
low = mid+1
return low
# decoding is made more straightforward with the helper functions
# importantly we must use the same precision we used for the encoding
# to output each character, we feed the model the current context to get intervals,
# binary search for the character/ interval that fits the code
# and then rescale a and b in exactly the same way we did while coding
# when we hit EOF, we return
# pseudocode from reference (5)
def decode(code, model, precision):
whole, half, quarter = 2**precision, 2**(precision-1), 2**(precision-2)
i = precision-1
z = int(code[:i+1].ljust(precision, '0'),2)
output = ''
a, b = 0, whole
while 1:
context = output[-model.n:] if len(output)>=model.n else ''
intervals, r = model.get_intervals(context)
I = binary_search(a, b, z, intervals, r)
ch = model.chars[I]
c,d = intervals[I]
a,b = narrow(a,b,b-a,c,d,r)
output += ch
if ch == EOF: return output
while b < half or a > half:
a, b, z, i = decoder_half_rescale(a,b,z,i,code, half)
while a > quarter and b < 3*quarter:
a, b, z, i = decoder_quarter_rescale(a, b, z, i, code, quarter)
Note that assuming a constant number of characters and a fixed time for the model to compute the intervals, the decoder runs in time linear in the output.
Recall that our arithmetic encoder treats the model as a black box; however, the model must support the get_intervals
method. Given a context, get_intervals
returns the interval it assigns to each character. These intervals are put into a list, where the interval at each index corresponds to the character at that same index in the model.chars list. To implement this functionality below, we take the probabilities our ngrams base model already has, and essentially blow up the space by a factor of r
. Each unnormalized probability is then estimated as the rounded r
times probs[ch]
, or 1 if this is zero.
For example, if we had probabilities .2, .5 and .3 and if r
=100, then we would output model intervals as the list of tuples [(0,20), (20,70), (70,100)]. We also are sure to make the model return default intervals in the case of a context unseen during training. To produce the default intervals, we use the base class' default_probs
.
class Ngrams_Arithmetic_Model(Ngrams_Model):
def _make_intervals(self, probs, r):
unnormalized_probs = [(int(r*probs[ch]) or 1) for ch in self.chars]
r_true = sum(unnormalized_probs)
intervals, start = [], 0
for prob in unnormalized_probs:
intervals.append((start, start + prob))
start += prob
return intervals, r_true
def make_context_intervals(self, r):
if not self.probs or not self.default_probs:
print('Error: model must be trained before making context intervals')
self.chars = sorted(list(self.chars))
self.char2idx = dict((ch,i) for i,ch in enumerate(self.chars))
default_intervals, default_r = self._make_intervals(self.default_probs, r)
self.intervals = defaultdict(lambda:default_intervals)
self.r = defaultdict(lambda:default_r)
for context, probs in self.probs.items():
intervals, r_true = self._make_intervals(probs, r)
self.intervals[context] = intervals
self.r[context] = r_true
def get_intervals(self, context):
return self.intervals[context], self.r[context]
Let's train a simple model and see how the choice of r
influences the length of an encoding. For this example, we will set the context length to 3 and encode A Christmas Carol
. We are not yet concerned with decoding, but will soon rigorously test our algorithms.
Recall that r
controls how precisely the model can express probabilities. If r
=500 for example, the model must estimate all probabilities to the closest 1/500th.
model = Ngrams_Arithmetic_Model(n=3, chars=CHARS)
model.train(TEXTS['a_christmas_carol'])
bits_per_char = dict()
for r in [0,50,100,200,400,800]:
model.make_context_intervals(r=r)
# note that we must now allows append EOF to our texts
# here we use an encoder precision of 64 bits, going to 32 or 128 makes little difference
code = encode(TEXTS['a_christmas_carol'] + EOF, model, 64)
bits_per_char[r] = len(code)/len(TEXTS['a_christmas_carol'] + EOF)
plot_dict(bits_per_char, 'The Effect of R', 'r', 'bits per character')
Note that the r
we input into make_context_intervals
is not exactly the true r
the model will have. The true r
will take account of rounding the model needed to do in making the intervals, including rounding all unnormalized probabilities less than 1 up to 1. Thus, when r
=0, we essentially force the model to give uniform intervals, and with 126 characters, an encoding of 7 bits per character is expected.
As we increase r
our compression rate generally improves. Once r
is at least 400, the effect of more precision drops off. As per this analysis, we set the default r
in our models to be 500 and leave it at that.
Similar to what we did for Huffman encoding, we will train ngram Arithmetic models for a context length from 1 to 6. If we intend to train a model everytime we encode a text, the training performance is of great significance, so we analyze it and compare to what we saw with the Huffman models. Again we train on the concatenation of all the Dicken's texts.
max_n = 6
text_train = TEXTS['all_dickens'] + EOF
ngram_arithmetic_models = dict()
training_times = dict()
model_sizes = dict()
for n in range(1, max_n + 1):
start = time.perf_counter()
model = Ngrams_Arithmetic_Model(n=n, chars=CHARS)
model.train(text_train)
model.make_context_intervals(r=500)
training_times[n] = time.perf_counter() - start
sizes[n] = sys.getsizeof(model.intervals)
sizes[n] /= 1000
ngram_arithmetic_models[n] = model
fig,ax = plt.subplots(1,2, figsize=(15,6))
fig.tight_layout(w_pad=4)
plot_dict(training_times, 'Time to Build Arithmetic Models', 'context length', 'second', ax[0])
plot_dict(sizes, 'Model Space Requirements', 'context length', 'kilobytes', ax[1])
For point of comparison, the Huffman ngrams model for n=6 took about 600 seconds to train. For high $n$ we are now training about 20 times faster. The great savings in training cost comes because we don't need to make codewords or trees, instead, now we must only make intervals, which are cheap operations. This is also reflected in the model space for n=6 coming down by a factor of 2.
Note that for short contexts, training a model for each encoding becomes practical. This is potentially useful as we've already seen that ngram models provide lower entropies for texts they were trained on.
It's now time to test all the Arithmetic encoding tools we've developed. We will choose a number of texts to encode and decode from our books set. We will time the encoding and decoding computations, and also check the entropies of the models and that the decoding worked faithfully.
# simple test script to time encoding, decoding and check correctness
def test_arithmetic(test_text, model):
start = time.perf_counter()
code = encode(test_text, model, precision=64)
code_time = time.perf_counter() - start
bits_per_char = len(code) / len(test_text)
start = time.perf_counter()
output = decode(code, model, precision=64)
decode_time = time.perf_counter() - start
entropy = adaptive_entropy(model, test_text)
success = 'yes' if test_text == output else 'no'
return model.n, bits_per_char, entropy, code_time, decode_time, success
# function to run test on a dictionary of models
# here we'll use our ngrams models that we just trained
def run_arithmetic_tests(test_title, models, results_dict, show=True):
test_text = TEXTS[test_title] + EOF
max_n = max(models.keys())
min_n = min(models.keys())
rows = [test_arithmetic(test_text, models[n]) for n in range(min_n, max_n + 1)]
col_names = ['context_length', 'coding bits / char', 'entropy / char',
'code time', 'decode time', 'decoded matches original']
results_dict[test_title] = {'col_names':col_names, 'rows':rows}
if show: show_table(col_names, rows)
# place to store results
results_bucket = dict()
run_arithmetic_tests('a_christmas_carol', ngram_arithmetic_models, results_bucket)
run_arithmetic_tests('walden', ngram_arithmetic_models, results_bucket)
run_arithmetic_tests('peter_pan', ngram_arithmetic_models, results_bucket)
run_arithmetic_tests('the_war_of_the_worlds', ngram_arithmetic_models, results_bucket)
Unfortunately our compression rates are about the same as they were for Huffman encoding with the ngrams models. However, both encoders are compressing at rates very close to entropy, so there is little room for improvement. In fact, we can see that our Arithmetic encoder even occasionally compresses below entropy. The reason for this is that whereas we compute the entropy with the model's probabilities, we compute the compression with the model's intervals. Recall that when we create the intervals we round off probabilities. Also, when we compute the compression, we further round when dividing intervals by r
. This rounding occasionally has enough of a serendipitous effect to push the compression rate below the raw model probability entropy.
One serious issue with our implementation is the code and decode times. Although we can train models much faster for the Arithmetic encoder, the encoding and decoding times are far slower than for Huffman encoding. Recall that for our Huffman implementation, once the model is trained, encoding is just a process of hashing to get the codeword for each character. By comparison, although it also runs in linear time, we have far more overhead for our Arithmetic encoder. Recall that our Arithmetic encoder must narrow and rescale the interval for each character in the text. Even though these are still just a matter of simple arithmetic with 64 bit precision, they are far more computation than just hashing for codewords. Likewise, we've introduced the same arithmetic in our decoder, in addition to a binary search for the right character on each output iteration. By comparison the Huffman decoder only needed to walk down a tree approximately $\log(CHARS)$ length to decode each character.
There are a number a things we could do to speed up our implementation. First, we could use binary arithmetic explicitly and do the rescaling operations with bit shifting. Also, as suggest by reference (1) and (3) we could allow for more rounding and less precision for certain operations. These improvements are outside of the scope of this project.
fix, ax = plt.subplots(1,1, figsize=(14,8))
for title, data in results_bucket.items():
x = [row[0] for row in data['rows']]
y = [row[1] for row in data['rows']]
ax.plot(x,y,label=title.replace('_',' '), lw=3, alpha=.8)
i += 1
ax.set_xlabel('context length')
ax.set_ylabel('code bits / char')
ax.set_title('Adaptive Arithmetic Results for Ngram Models')
ax.legend();
We see a similar plot of compression performance as a function of n
as we did for Adaptive Huffman encoding. Larger context lengths don't work with ngrams models because while decoding the current context becomes increasingly unlikely to have been seen during training. Recall that in those cases the model must default to global counts to infer probabilities.
One way of taking advantage of larger contexts, as suggested by reference (3), is to train a model for all context lengths up to n
. Then, when we come upon a novel context of lenth n
during compression, we default to intervals for n
=n
-1 rather than for n
=0. We can continue to decrease n
until our current context was seen during model training. In this way we can take great advantage of larger contexts when they are predictive. For example, suppose the word 'summer' appears in a text we're encoding. An n
=6 context would be able to inform us that almost certainly the r
comes after the e
. We would also have a very low entropy distribution for characters after the r
, peaked at ' '
. On the other hand, n
=6 will likely be poor for proper nouns unseen during training, but n
=2 or n
=1 would still be helpful here to pick up on spelling conventions in english, such as u
comes after q
.
Our Arithmetic encoder would be able to take better advantage of these peaked distributions for greater n
than Huffman encoding because it need not compromise itself to approximating probabilities with powers of 2.
We will now go on to explore an idea potentially more interesting and modern than using variable length n
. We will develop and LSTM model and plug it into our Arithmetic encoder.
Developed in the 1990's, and LSTM is a recurrent neural network for making predictions from sequence inputs. Today LSTMs are some of the most powerful langauge models and are used by many companies for text predictions. They are also well known for their ability to generate langauge.
We use keras to build a simple character LSTM model.
# should ideally give the model 1M char text to train on
class LSTM_Model:
def __init__(self, chars, n=20, units=128, lr=.01):
self.chars = sorted(list(chars))
self.char2idx = dict((c, i) for i, c in enumerate(self.chars))
self.idx2char = self.chars
self.n = n
self.model = self._compile_model(n, self.chars, units, lr)
# code from reference (9)
def _compile_model(self, n, chars, units, lr):
model = keras.Sequential()
model.add(keras.Input(shape=(n, len(chars))))
model.add(layers.LSTM(units))
model.add(layers.Dense(len(chars), activation="softmax"))
optimizer = keras.optimizers.RMSprop(learning_rate=lr)
model.compile(loss="categorical_crossentropy", optimizer=optimizer)
return model
# code from reference (9)
# we use the training text to generate context and next char pairs
# the contexts play the role of x and next chars the role of y
# x and y are really one-hot-encoded binary vectors that represent
# characters by index
def _get_x_y(self, text):
n = self.n
chars = self.chars
step = 3
contexts = [text[i : i + n] for i in range(0, len(text) - n, step)]
next_chars = [text[i + n] for i in range(0, len(text) - n, step)]
x = np.zeros((len(contexts), n, len(chars)), dtype=np.bool)
y = np.zeros((len(contexts), len(chars)), dtype=np.bool)
for i, context in enumerate(contexts):
for t, char in enumerate(context):
x[i, t, self.char2idx[char]] = 1
y[i, self.char2idx[next_chars[i]]] = 1
return x, y
def train(self, train_text, epochs):
x, y = self._get_x_y(train_text)
self.model.fit(x, y, batch_size=128, epochs=epochs)
# code from reference (9)
# here we build X as the binary vector representation of the context
# then we feed X to the model.predict function provided by keras
def _predict(self, context):
X = np.zeros((1, self.n, len(self.chars)))
for t, ch in enumerate(context):
X[0, t, self.char2idx[ch]] = 1.0
return self.model.predict(X, verbose=0)[0]
def predict(self, context):
probs = self._predict(context)
return dict((self.chars[i], prob) for i,prob in enumerate(probs))
# getting intervals is similar to what we've done earlier
# we use lru_cache so that the model does not have to recompute intervals
# for contexts it has already been asked to predict on
@lru_cache()
def get_intervals(self, context, r=500):
padded_context = context[-self.n:].rjust(self.n, ' ')
preds = self._predict(padded_context)
unnormalized_probs = [int(r*prob) or 1 for prob in preds]
start = 0
intervals = []
r_true = sum(unnormalized_probs)
for i, prob in enumerate(unnormalized_probs):
intervals.append((start, start + prob))
start += prob
return intervals, r_true
We use n
=7 for model training. Note that for shorter contexts input to the model will simply be padded with ' '. One of the key benefits of the LSTM model is that now we need not worry about unseen contexts; we can predict on them all the same. This allows us to use a larger n
than was practical for the ngrams models.
# train on 'a christmas carol' for 20 epochs
lstm_model = LSTM_Model(CHARS, n=7)
epochs = 20
text_train = TEXTS['a_christmas_carol']
# test text is a slice to speed up model evaluation times during training
TEXTS['a_christmas_carol_slice'] = TEXTS['a_christmas_carol'][5000:10000]
text_test = TEXTS['a_christmas_carol_slice']
acc_hist = dict()
entropy_hist = dict()
for e in range(1, epochs+1):
lstm_model.train(text_train, epochs=1)
# evaluate model every 4 training epochs
if not e%4:
acc_hist[e] = evaluate(lstm_model, text_test)[0]
entropy_hist[e] = adaptive_entropy(lstm_model, text_test)
fig,ax = plt.subplots(1,2, figsize=(15,6))
fig.tight_layout(w_pad=4)
plot_dict(acc_hist, 'LSTM Accuracy', 'Training Epochs', 'Accuracy', ax[0])
plot_dict(entropy_hist, 'LSTM Entropy', 'Training Epochs', 'Entropy', ax[1])
We observe that model accuracy increases during training, while entropy of the text under the model stays quite low after the first few epochs.
LSTM_models = {lstm_model.n: lstm_model}
run_arithmetic_tests('a_christmas_carol_slice', LSTM_models)
We have successfully used our LSTM model to encode and decode a slice of a christmas carol
. With the more powerful model, we've managed to push both our entropy and compression rate to near 2 bits per character. Note that this is a savings of about .4 bits per character from our best earlier compression rate on this text.
One serious problem, however, is that this compression rate savings came at an unacceptable cost in compute time. It took us about a minute to encode and another minute to decode even this small slice of the text. Earlier we could have encoded and decoded a text of this length in a small fraction of a second.
We will continue by evaluating the affect of n
on LSTM performance.
We will now train LSTMs for different values of n
and understand the affect of context length on accuracy and entropy.
acc_hist = dict()
entropy_hist = dict()
for n in [1,3,7,10,15]:
model = LSTM_Model(CHARS, n=n)
model.train(text_train, epochs=15)
acc_hist[n] = evaluate(model, text_test)[0]
entropy_hist[n] = adaptive_entropy(model, text_test)
fig,ax = plt.subplots(1,2, figsize=(15,6))
fig.tight_layout(w_pad=4)
plot_dict(acc_hist, 'LSTM Accuracy', 'Context Length', 'Accuracy', ax[0])
plot_dict(entropy_hist, 'LSTM Entropy', 'Context Length', 'Entropy', ax[1])
We can see that our LSTM architecture is too simple to take advantage of very long context lengths. However, we are able to do better than we did with the ngrams models, where we could not take advantage of context lengths more than about 3.
We'll take a context length of 7 to be about ideal and train a final LSTM model.
final_lstm_model = LSTM_Model(CHARS, n=7)
# again we'll train on just all of the Dicken's books
final_lstm_model.train(TEXTS['all_dickens']+EOF, epochs=20)
# since our LSTM arithmetic encoding is very slow, we'll create slices
# of all texts to use for all tests
s1, s2 = 5000, 10000
for title in list(TEXTS.keys()):
TEXTS[title + '_slice'] = TEXTS[title][s1:s2]
test_title = 'a_christmas_carol_slice'
col_names = ['test title', 'accuracy', 'relaxed accuracy', 'mean confidence', 'entropy / char']
test_text = TEXTS[test_title]
metrics = evaluate(final_lstm_model, test_text)
show_table(col_names, [[test_title] + list(metrics) + [adaptive_entropy(final_lstm_model, test_text)]])
test_title = 'the_war_of_the_worlds_slice'
test_text = TEXTS[test_title]
metrics = evaluate(final_lstm_model, test_text)
show_table(col_names, [[test_title] + list(metrics) + [adaptive_entropy(final_lstm_model, test_text)]])
Recall that our ngrams models has accuracies around .55 and .45 for text in the training and text outside respectively. We also see an improvement in our entropy rates, improving them by a few tenths from what we saw from our ngrams models.
prompt = 'The queen is'
print(generate_text(final_lstm_model, prompt))
More intuitively speaking, we can see from out LSTM's generated text that the model is far more coherent than our ngrams models were. It can spell fairly well, and understands the lengths of words and some punctuation expected in english.
LSTM_models = {final_lstm_model.n: final_lstm_model} # just 1 model to test
results_dict = dict()
titles_to_test = ['a_christmas_carol', 'great_expectations', 'a_tale_of_two_cities',
'leviathan', 'war_and_peace', 'walden', 'peter_pan', 'the_war_of_the_worlds']
# run tests
for title in titles_to_test:
run_arithmetic_tests(title + '_slice', LSTM_models, results_dict, show=False)
# show test results
col_names = ['title'] + results_dict['a_christmas_carol_slice']['col_names']
rows = [[title] + list(d['rows'][0]) for title, d in results_dict.items()]
show_table(col_names, rows)
All of our encoding and decodings were successful, although again, the time to encode and decode even small slices of text is prohibitive for the LSTM model. Compression rates are close to entropy, but not as close as they were earlier. Perhaps we should have used a greater r
in this final more powerful model.
fix, ax = plt.subplots(1,1, figsize=(14,8))
code_lengths = dict((row[0], row[2]) for row in rows)
ax.bar(*zip(*code_lengths.items()), color='tan', width=.35, alpha=.3, label='encoding', align='edge')
entropies = dict((row[0], row[3]) for row in rows)
ax.bar(*zip(*entropies.items()), color='orange', width=.35, alpha=.3, label='entropy')
plt.xticks(rotation=90)
ax.set_ylabel('bits / char')
ax.set_title('Arithmetic Results for LSTM Model')
ax.legend();
Finally we will compare our best compression rates with Huffman encoding to our best with Arithmetic encoding. Note that this is not just a comparison of Huffman vs Arithmetic, but also takes into account the extra work we put into the modeling for our Arithmetic encoder.
# get average compression rate for the arithmetic encoder on Dickens texts
code_lengths['all_dickens_slice'] = np.mean([code_lengths['a_christmas_carol_slice'],
code_lengths['great_expectations_slice'],
code_lengths['a_tale_of_two_cities_slice']])
code_lengths_renamed = dict((title[:-6], length) for title,length in code_lengths.items())
for title in ['a_christmas_carol', 'great_expectations', 'a_tale_of_two_cities']:
del code_lengths_renamed[title]
# get the best of earlier Huffman compression rates
best_huffman_results = dict()
for title in ADAPTIVE_HUFFMAN_RESULTS.keys():
rows = ADAPTIVE_HUFFMAN_RESULTS[title]['rows']
min_length = min(row[1] for row in rows)
best_huffman_results[title] = min_length
# plot comparison
fix, ax = plt.subplots(1,1, figsize=(14,8))
ax.bar(*zip(*code_lengths_renamed.items()), color='tan', width=.35, alpha=.3, label='Arithmetic LSTM', align='edge')
ax.bar(*zip(*best_huffman_results.items()), color='orange', width=.35, alpha=.3, label='Huffman Ngrams')
plt.xticks(rotation=90)
ax.set_ylabel('bits / char')
ax.set_title('Arithmetic LSTM vs Huffman Ngrams Encoding Rates')
ax.legend();
We have a mixed bag of results. By our work with the Arithmetic encoder and the LSTM model we've managed to push down compression rates for Walden
, Peter Pan
, The War of the Worlds
and all of Dickens to the best we've seen yet. However, our performance now suffers for Leviathan
and War and Peace
. Perhaps the reason for this is that these titles are less similar to the Dickens we trained on than the rest of the titles. Note that the LSTM with context length 7 should be more sensitive to overfitting than an ngrams model with context length 3.
The goal of this project was not to achieve the very best compression rates or speeds, but rather to implement the main ideas of Adaptive Huffman and Arithmetic encoding and experiment. Throughout I have sought to keep this work self contained and readable by progressively building on ideas and code. It is my hope that anyone with a basic understanding of Huffman and Arithmetic coding will be able to follow my implementations and understand how they work in detail.
We have implemented Huffman and Arithmetic character level encoders for natural language text compression. Our encoders are capable of encoding at less than 3 bits per character. Considering we have nearly 128 characters in our alphabet, it is noteworthy that our encoders can specify characters with just 2 or 3 bits. In fact, this compression rate is better than we get from the standard zip algorithm built into MacOS, which compresses text at about 3-4 bits per character.
Our compression rates can be attributed to both our models and the fact that our encoders approach the entropy of the models. We trained ngrams context models as well as LSTMs, all capable of achieving text entropies considerably below those of a global count based model.
Our Huffman encoder and decoder both operated within a few seconds, even to code and decode War and Peace
, which is over 3 million characters long. And although our Arithmetic encoder is slower than our Huffman implementation, this is a well known drawback to Arithmetic encoding. We also note that our Arithmetic encoder models can train much faster than those for Huffman.
There are several improvements we can make to speed up our Arithmetic encoder, namely using explicit bit-shift arithmetic and more sophisticated rounding. We may also look to ANS encoding as a promising alternative (reference 8). While Arithmetic encoding uses two values to represent an interval, ANS uses a single value to represent the current state. This simplifies the required arithmetic to encode and decode each character.
One of the key benefits of all our encoding algorithms is that they treat the model as a black box that need only implement one main method to return character distributions. As language models continue to rapidly improve in terms of both accuracy and speed, we'll be able to harness these models and employ the very same code developed here to improve our performance. This decoupling of the encoder from the model is powerful because it allows us to take advantage of the vast work being done in NLP.