COMSE 6998 Information Theory with Omri Weinstein

Final Project: Adaptive Entropy Encoding Implementation

Author: John Daciuk



Introduction

Entropy coding refers to a family of lossless compression schemes that attempt to achieve compression rates close to the entropy of the data. Entropy encoders are more efficient while encoding probable symbols at the expense of being less efficient when encoding unlikely symbols. Since no entropy coder can compress below the entropy of its model, the model used is to estimate symbol probabilities is of critical importance.

Rather than using a static model that makes a global estimate of probabilities, adaptive coders leverage a model that returns symbol distributions as a function of some window of nearby data. Adaptive coding can make entropy encoders more efficient in terms of both computation and compression rates. Moreover, as context dependent predictive models continue to improve, adaptive coding will too.

This is an implementation based project focused on lossless compression of natural language text documents. Embedded within is a presentation and analysis of core ideas in adaptive entropy encoding. Since it is particularly amenable to adaptive models, Arithmetic coding is the heart of the project. We implement an Arithmetic coder that treats the model as a black box and is capable of achieving compression rates very close to entropy. To plug into the encoder, we implement ngrams count based models as well as more powerful LSTMs using Keras.

Before moving into the Arithmetic coding discussion, it will be important to develop ideas and build models. For these reasons we begin with Huffman encoding, a much simpler alternative to Arithmetic encoding that can also be made adaptive. Our implementation of Huffman encoding also gives us a richer context within which to analyze the performance of our Arithmetic encoder.

The focus of this project is not to achieve the best compression rates or computational efficiency. Rather, the goal is to express fundamental ideas in python code and experiment with our understanding of those ideas. To this end I have sought to not reference others' implementations and instead prefered to explore by writing my own code whenever possible. In particular, I have not seen any other Huffman implementations and only referenced pseudocode for the base of my Arithmetic implementation. I have commented any occasions where I did reference code or pseudocode with credit to the source.

My hope is that anyone who has a basic understanding of how Huffman and Arithmetic coding work will be able to read this as a self contained work and deepen their knowledge of how to implement these algorithms. I also hope to convey in concrete terms how well these algorithms work in practice, and weigh the pros and cons.

Questions to Address

  1. What are the problems with Huffman coding, and how does Arithmetic coding help?
  2. What are the implementation details required to make Arithmetic coding work in practice?
  3. How can we tweak Arithmetic coding to make it more efficient?
  4. How much can adaptive models improve compression rates?
  5. What modifications can we make to our algorithms to tailor to text based data?
  6. How might future advancements in NLP improve text compression?

Table of Contents

  1. Exploration of Books to Encode
  2. Huffman Encoding
  3. Ngrams Context Aware Model
  4. Adaptive Huffman Encoding with Ngrams Model
  5. Arithmetic Encoding Basics
  6. Finite Precision Arithmetic Encoding
  7. Finite Precision Arithmetic Decoding
  8. Adaptive Arithmetic Encoding with Ngrams Model
  9. LSTM Model
  10. Test Adaptive Arithmetic Encoding with LSTM Model
  11. Huffman vs Arithmetic Encoding
  12. Conclusion

References

  1. Paul Howard and Jeffrey Vitter. “Practical Implementations of Arithmetic Coding”. In: (1992).
  2. David A. Huffman. “A Method for the Construction of Minimum-Redundancy Codes”. In: Proceedings of the I.R.E. 1952.
  3. Alistair Moffat, Radford M. Neal, and Ian H. Willen. “Arithmetic Coding Revisited”. In: ACM Trans- actions on Information Systems (1998).
  4. Mark Nelson. Data Compression with Arithmetic Coding. 2014. url: https://marknelson.us/posts/ 2014/10/19/data-compression-with-arithmetic-coding.html (visited on 11/06/2020).
  5. Mathematicalmonk YouTube Arithmetic Coding playlist: https://youtu.be/ouYV3rBtrTI (visited on 12/18/2020)
  6. Amir Said. In: Lossless Compression Handbook. 2004. Chap. Introduction to Arithmetic Coding - Theory and Practice.
  7. Ian H. Willen, Radford M. Neal, and John G. Cleary. “Arithmetic Coding for Data Compression”. In: Communications of the ACM. 1987.
  8. Jarek Duda. “Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding”. In: (2014).
  9. https://keras.io/examples/generative/lstm_character_level_text_generation/ (visited on 12/10/20)



Imports

In [1]:
import sys 
from collections import defaultdict, Counter
import heapq
from bitstring import BitArray
from copy import copy
import pandas as pd
import time
import matplotlib.pyplot as plt
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import random
from functools import lru_cache
import requests
import math
import itertools

# for plotting style
pd.set_option('display.max_rows', 10)
plt.rcParams["font.size"] = 22
plt.rcParams["figure.figsize"] = (12,6) 
plt.rcParams["ytick.labelsize"] = 19
plt.rcParams["xtick.labelsize"] = 19 
plt.rcParams["lines.linewidth"] = 2
plt.rcParams["axes.titlesize"] = 26



Chapter 1: Exploration of Books to Encode

1.1 Download books and establish character set

We download books from Project Gutenberg and use them as example data throughout the project. We also create CHARS which will act as our symbol universe throughout.

In [39]:
def download_book(url):
    return requests.get(url).content.decode("utf-8") 

URLS = {
           'a_christmas_carol':     'https://www.gutenberg.org/files/24022/24022-0.txt',
           'great_expectations':    'https://www.gutenberg.org/files/1400/1400-0.txt',
           'a_tale_of_two_cities':  'https://www.gutenberg.org/files/98/98-0.txt',
           'leviathan':             'http://www.gutenberg.org/cache/epub/3207/pg3207.txt',
           'war_and_peace':         'https://www.gutenberg.org/files/2600/2600-0.txt',
           'walden':                'https://www.gutenberg.org/files/205/205-0.txt',
           'peter_pan':             'https://www.gutenberg.org/files/16/16-0.txt',
           'the_war_of_the_worlds': 'https://www.gutenberg.org/files/36/36-0.txt'
        }

# a dictionary holding all the above texts, keyed by title
TEXTS = dict((title, download_book(url)) for title,url in URLS.items())
# the universe of symbols
CHARS = set().union(*[set(text) for text in TEXTS.values()])

print('Length of CHARS: {}'.format(len(CHARS)))
Length of CHARS: 125

1.2 Inspect books

Here we will get a sense of how long the books are. Our compression algorithms will work at the character level, so we use character as the unit of length. Also relevant is the entropy of each book under some baseline model. The model we use with which to do this computes the probability of a character as its percent frequency in the text.

Note that with about 128 characters, the most naive encoding would assign a codeword of length $\log(128)=7$ to each character. By later comparing our models and compression rates to this count based entropy, we can get a less naive comparison point.

Get lengths and frequency based entropies

In [3]:
# for showing data in pandas table
def show_table(names, rows, d=3):
    Round = lambda x,d: round(x,d) if type(x)!=str else x
    data = {names[i]:[Round(row[i],d) for row in rows] for i in range(len(names))}
    display(pd.DataFrame(data).set_index(names[0]))
    
# get the entropy of a text assuming a naive count based probability model
# this will be a useful point of comparison to evaluate our compression rates
def count_entropy(text):
    counts = Counter(text)
    probs = dict((ch,cnt/len(text)) for ch,cnt in counts.items())
    return -sum([np.log2(probs[ch]) for ch in text])
    
names = ['title', 'length', 'count entropy / char', 'probability of \'e\'']
rows = [[title, len(text), count_entropy(text)/len(text), text.count('e')/len(text)]
       for title,text in TEXTS.items()]
show_table(names, rows)
length count entropy / char probability of 'e'
title
a_christmas_carol 189055 4.608 0.089
great_expectations 1035188 4.556 0.090
a_tale_of_two_cities 793715 4.548 0.094
leviathan 1254851 4.573 0.099
war_and_peace 3293673 4.557 0.095
walden 664852 4.443 0.095
peter_pan 284841 4.593 0.095
the_war_of_the_worlds 363747 4.511 0.096

We see that the frequency based entropy is already much lower than that of a uniform model. This motivates us to start with a count based model when we begin compressing these books.

Get distribution of most frequent characters

In [38]:
# distribution of top symbols
to_show = 35
all_texts = ''.join([text for text in TEXTS.values()])
freqs = sorted([(ch, all_texts.count(ch) / len(all_texts)) for ch in CHARS], 
               reverse=1, key=lambda x: x[1])
top_freqs = dict(freqs[:to_show])

fix, ax = plt.subplots(1,1, figsize=(14,4))
ax.bar(*zip(*top_freqs.items()), color='b', width=.45, alpha=.5, label='encoding')
ax.set_ylabel('Frequency', fontsize=19)
ax.set_title('Frequency of Characters in All Books');

As expected ' ' is the most frequenct character, followed by e. Characters become quite infrequent after the top 35 or so. We will be able to take advantage of such a distribution while compressing.



Chapter 2: Huffman Encoding

Designed by David Huffman while a student at MIT, Huffman encoding is among the simplest of entropy encoders. By beginning with Huffman, we will have the opportunity to experiment with some fundamental ideas and models before progressing to trickier Arithmetic algorithms.

Huffman encoding assigns to each symbol a binary code which is not the prefix of any other code. Because the codes are prefix-free, decoding a concatenated string of codewords is unambiguous. Thus the code for the whole message can just be this concatenated string. Moreover, the expected length of a Huffman codeword over the symbol distribution is optimal among all schemes that map symbols to distinct codewords. As we'll later discuss, Huffman encoding can get very close to entropy, but the discrete nature of it does have a potentially problematic cost if probabilities cannot be well approximated with powers of 2.

A set of prefix-free codewords can be mapped to a binary tree in which each path from root to leaf maps to a distinct codeword and associated symbol. Huffman's innovation was to build this tree bottom up. The algorithm is to initialize each symbol and corresponding probability as a node. We then take the two nodes with smallest probabilities and combine them into a single node with probability equal to the sum of probabilities. Inside of this node is now a tree with branches for each component symbol; however, from the global scope of the algorithm, this interior structure is abstracted away. We can then recurse. On each iteration we decrease the number of nodes by 1 until we've combined all the nodes into a single tree.

2.1 Get Huffman codewords

In [1150]:
# get codewords for each character present in the counter
def get_codewords(counts):
    # codewords will be a bit string for each character
    # we start with each codeword an empty string and build according to Huffman's scheme
    codewords = defaultdict(str)
    # sort counts into heap for easy access of the two smallest
    # initially each of these counts represents a leaf in the huffman tree
    counts = sorted([(cnt, [ch]) for ch,cnt in counts.items()])
    # until all nodes are merged, grab two smallest count nodes and merge their counts and chars
    while len(counts) > 1:
        cnt1, chars1 = heapq.heappop(counts)
        cnt2, chars2 = heapq.heappop(counts)
        # after merging these char sets all future bits added to their
        # strings will be the same.  The following creates the fork in the tree.
        for ch in chars1: codewords[ch] += '0'
        for ch in chars2: codewords[ch] += '1'
        heapq.heappush(counts, (cnt1+cnt2, chars1+chars2))
    # we reverse all codewords since we started from the leaves and worked backwards
    codewords = dict((ch, codeword[::-1]) for ch,codeword in codewords.items())
    return codewords

The above code follows the exact plan we stated above, except that we use counts as a substitute for probabilities. As mentioned, our first model is count based. Because of the sorting and heappops we get codewords in $n\log n$ time, where $n$ is the number of symbols.

2.2 Create codewords for A Christmas Carol

In [1153]:
text = TEXTS['a_christmas_carol']

# we want codewords for all characters in our universe so that we can potentially 
# encode any of our books with the same codewords.  
# we append the characters 'A Christmas Carol' is missing to it.
missing_chars = CHARS - set(text)
counts = Counter(text + ''.join(missing_chars))
codewords = get_codewords(counts)

# show codewords
cols = ['character','count','codeword']
rows = [(repr(ch), cnt, codewords[ch]) for ch,cnt in sorted(list(counts.items()), reverse=1, key=lambda x:x[1])]
show_table(cols, rows)
count codeword
character
' ' 31332 110
'e' 16878 000
't' 12282 1001
'o' 11229 1000
'a' 10356 0110
... ... ...
'”' 1 111101100011101010
'ê' 1 101111110101111010
'ü' 1 111101100011100011
'ý' 1 111101100011100100
'ô' 1 111101100011100000

125 rows × 2 columns

As planned, we can see that the most frequent characters have the shortest codewords. Also note that importantly no codeword is the prefix of any other.

2.3 Encode A Christmas Carol

With codewords in hand, we'll trying encoding our first text, A Christmas Carol. The encoding now only needs to build the code by concatenating the codewords for each character in the text. This runs in time linear in the length of the text.

In [1154]:
# encoding is now straightforward and lightning fast
def encode(text, codewords):
    return ''.join([codewords[ch] for ch in text])

code = encode(text, codewords)
print('Code: {}...'.format(code[:50]))
print('Compressed text length: {} bits, or {:.3} bits/character'
      .format(len(code), len(code)/len(text_train)))
print('Entropy of text under count based model is {:.3} bits/character'
      .format(count_entropy(text)/len(text)))
Code: 11110110001110101111110110100100001101011111010111...
Compressed text length: 879077 bits, or 4.65 bits/character
Entropy of text under count based model is 4.61 bits/character

Considering UTF-8 encodes at a rate of at least 8 bits per character, we're doing OK even with this context oblivious counts model. We can also see that we've managed to get very close to the entropy of our counts model, as Huffman is expected to do.

2.4 Decode A Christmas Carol

The decoding is slightly more involved. The simplest idea is perhaps to invert the character to codewords dictionary. Then, as we read each bit of the code, we can query this dictionary as to whether we have yet read in a codeword that has a match. Once we do, we can append the corresponding character to our output and start fresh reading the next bit. This would be very easy to code and runs in linear time, although there is the expense of doing a hash read at each bit.

Alternatively, and perhaps more instructive and faster for decoding, we opt to actually make the Huffman tree. Now decoding becomes an act of walking down the tree as we read in the code bits. When we reach a leaf, we can append the character at that leaf to the output, and start fresh again at the top of the tree. This saves us from having to do any hashing while decoding, and as we can see in the code, each step of the decoder is computationally very cheap. The downside is that to build the tree takes time $n\log n$ where $n$ is the number of characters.

Our tree building algorithm is recursive. We first separate all the codewords by whether they start with 0 or 1. Then we recursively build the left tree from those starting with 0 and the right tree from those starting with 1. Finally, we output a root with those trees as its left and right children.

In [1155]:
# define node for the huffman tree
class node:
    def __init__(self, ch=None, left=None, right=None):
        self.ch = ch
        self.left = left
        self.right = right

# build huffman tree recursively from codewords
# the huffman tree let's us efficiently decode
# we can walk down the tree to reach chars at the leaves
def get_huffman_tree(codewords):
    # if there's only 1 codeword, we've reached a leaf and just return that node
    if len(codewords) == 1:
        return node(codewords.popitem()[0])
    # otherwise the tree is a root with two huffman tree children
    # the left child's leaves will be chars whose codewords start with a 0
    # the right child's leaves will be chars whose codewords start with a 1
    left = get_huffman_tree({k:v[1:] for k,v in codewords.items() if v[0]=='0'})
    right = get_huffman_tree({k:v[1:] for k,v in codewords.items() if v[0]=='1'})
    return node(left=left, right=right)

# to decode, we just keep walking down the huffman tree as we read the code
def decode(code, tree):
    output, cur = '', tree
    for bit in code:
        # walk down the current tree path
        cur = cur.left if bit=='0' else cur.right
        # if we reach a leaf, save that char to output and restart at the root
        if cur.ch:
            output += cur.ch
            cur = tree
    return output

tree = get_huffman_tree(codewords)
output = decode(code, tree)
print('First 50 chars of text:', text[:50])
print('First 50 chars of output:', output[:50])
print('Decode matches original text: {}'.format(text == output))
First 50 chars of text: The Project Gutenberg EBook of A Christmas Carol,
First 50 chars of output: The Project Gutenberg EBook of A Christmas Carol,
Decode matches original text: True

And we have successfully decoded A Christmas Carol! It may still be somewhat unpleasing that our code was a Python string, where each bit really takes up much more than 1 bit of space. Thus we provide a quick demo of how to use the Python bitstring library to actually output a code file that takes up only a number of bits equal to the length of the code.

2.5 Demo of code output to file

In [1156]:
# file write demo, just to show we can actually write a file with that many bits
def write(code, filename):
    compressed = BitArray(bin=code)
    with open(filename, 'wb') as f:
        compressed.tofile(f)
        
def read(file):
    return BitArray(filename=file)

write(code, 'text_compressed')
output = decode(str(read('text_compressed').bin)[:len(code)], tree)
print('Decode matches original text: {}'.format(text_train == output))
Decode matches original text: True



Chapter 3: Ngrams Context Aware Model

Since we cannot expect Huffman encoding to beat the entropy of the underlying model, to get a more efficient encoding we need a better model. Thus, we will now develop an adaptive model that will serve us for both Huffman and Arithmetic encoding.

One of the most natural ideas is to use an Ngram based model, and this is in fact what we will do. More concretely, we pick a context length $n$. Then to compute the probability of a character ch, we will look at the $n$ characters before ch and call this string context or ngram. The probability is then the number of times ch appears after ngram in the text divded by the number of times ngram appears. To ensure that no model probabilities ever go to 0, we add 1 to all counts. We also need to take care to still be able to predict in the case of unseen context. For this we use a default mapping of probabilities to characters based on global frequencies.

3.1 Code base class for Ngrams models

In [1158]:
# create an ngrams model that returns a probability of next char for a given context

def counts2probs(counts):
    total = sum(counts.values())
    return dict((ch, cnt/total) for ch,cnt in counts.items())
    
# base class to be used for Huffman and Arithmetic encoding
class Ngrams_Model:
    def __init__(self, n, chars):
        self.n = n # context length
        self.chars = chars
        self.probs = None
        self.default_probs = None
        
    def train(self, text):
        n = self.n
        # get count for each next char for each observed context
        # start each count at 1 so as not to have any zero probabilities
        context_cnts = defaultdict(lambda:{ch:1 for ch in self.chars})
        for i in range(n, len(text)):
            context_cnts[text[i-n:i]][text[i]] += 1
            
        # rescale counts to probabilities
        # this will generalize the model so we can also apply it to arithmetic coding
        context_probs = dict((ngram, counts2probs(cnts)) 
                             for ngram, cnts in context_cnts.items())
        self.probs = context_probs
        
        # we use default probs in case of unseen context
        self.default_probs = dict((ch, (text.count(ch)+1)/len(text)) 
                                  for ch in self.chars)
                             
    def predict(self, context):
        return self.probs.get(context) or self.default_probs

3.2 Instantiate and train a ngrams model

In [1159]:
ngrams_model = Ngrams_Model(n=3, chars=CHARS)
ngrams_model.train(TEXTS['a_christmas_carol'])

3.3 Inspect the model

We can now check to see how our model is behaving.

Look at top predictions for context = 'the'

In [1160]:
def get_top_preds(model, context, k):
    preds = [(ch,p) for ch,p in model.predict(context).items()]
    preds.sort(reverse=1, key=lambda x:x[1])
    return preds[:k]

col_names = ['next char pred', 'prob']
top_preds = get_top_preds(ngrams_model, 'the', 3)
show_table(col_names, [(repr(ch),p) for ch,p in top_preds])
prob
next char pred
' ' 0.602
'r' 0.122
'y' 0.061

Indeed our model seems to give us reasonable predictions for this context. Let's evaluate it further by looking at its accuracy, confidence and entropy.

Further evaluate model's accuracy, confidence and entropy

In [1161]:
def evaluate(model, text):
    n = model.n
    relax = 3
    correctness, relax_correctness, confidence = [], [], []
    for start in range(len(text) - n - 1):
        context = text[start:start + n]
        top_preds = get_top_preds(model, context, relax)
        confidence.append(top_preds[0][1])
        correctness.append(text[start+n] == top_preds[0][0])
        relax_correctness.append(text[start+n] in [pred[0] for pred in top_preds])
    return np.mean(correctness), np.mean(relax_correctness), np.mean(confidence)

def adaptive_entropy(model, text):
    Sum = 0
    for i,ch in enumerate(text):
        context = text[i-model.n:i] if i>=model.n else ''
        p = model.predict(context)[ch]
        Sum += -np.log2(p)
    return Sum/len(text)

test_title = 'a_christmas_carol'
col_names = ['test title', 'accuracy', 'relaxed accuracy', 'mean confidence', 'entropy / char']
metrics = evaluate(ngrams_model, TEXTS[test_title])
show_table(col_names, [[test_title] + list(metrics) + [adaptive_entropy(ngrams_model, TEXTS[test_title])]])
accuracy relaxed accuracy mean confidence entropy / char
test title
a_christmas_carol 0.553 0.795 0.254 3.295
In [1162]:
test_title = 'walden'
metrics = evaluate(ngrams_model, TEXTS[test_title])
show_table(col_names, [[test_title] + list(metrics) + [adaptive_entropy(ngrams_model, TEXTS[test_title])]])
accuracy relaxed accuracy mean confidence entropy / char
test title
walden 0.454 0.69 0.236 3.715

As we can see above, our model has about a 50% chance of predicting the next character. For comparison, our global frequency model would always predict ' ' as the most likely character, and would only be correct around 17% of the time. As expected, accuracy is better for A Christmas Carol because this is the text we trained the model on.

Note that relaxed accuracy is the probability that one of the model's top 3 predictions for the next character is correct, and mean confidence refers to the probability the model assigns to its top next character prediction.

The most revealing dimension of these results is the entropy / char. We've managed to push these numbers down by about 1 bit from what they were under the global count model. This will surely result in better compression.

To conclude our ngrams model inspection, let's use the model to generate language. To do this we start with a prompt, sample a character from the model's character distribution, append this character to the prompt and keep iterating.

Use the ngrams model to generate language

In [1163]:
# For sampling from model preds. If we generate language with 
# the top pred we always would get the same repetitive result
def sample(preds):
    chars = list(preds.keys())
    probs = [preds[ch] for ch in chars]
    probs = [v/sum(probs) for v in probs] # make sure probs are normalized
    # take multinomial sample
    sampled_idx = np.argmax(np.random.multinomial(1, probs, 1))
    return chars[sampled_idx]

# For using a model to generate language
def generate_text(model, prompt):
    n = model.n
    # just keep generating the next char from the current context
    for _ in range(100):
        preds = model.predict(prompt[-n:])
        prompt += sample(preds)
    prompt = prompt.replace('\n', ' ').replace('\r', ' ')
    return 'Generated language: ' + prompt

prompt = 'The queen is'
print(generate_text(ngrams_model, prompt))
Generated language: The queen isBdue# cIhyaje.hn,rh tâtaH morôeoo öhbs! dp ,  esPodmiK sai/uo  ç9e rop	 isUginXt a” e h°s 'SQ  gê B

By using models to generate language, we can get a more intuitive sense of how well our models understand text. Apparently our ngrams model is still relatively weak. It's predictions are often far from what a human would anticipate. This is at least in part due to the fact that we set $n=3$. Perhaps with a greater context length our model would be stronger, and we will certainly try $n>3$.



Chapter 4: Adaptive Huffman Encoding with Ngrams Model

Recall that for Huffman, we need not just character probabilities, but also codewords, and in our implementation, trees. Here we add methods to the ngrams base class to provide these features. Now to prepare a model for use, we must train it and call these methods to create the underlying data structures.

4.1 Extend ngrams base class to build Huffman data structures

In [1164]:
# to write this code, we can rely on the functions we wrote earlier
# namely get_codewords() and get_huffman_trees()
class Ngrams_Huffman_Model(Ngrams_Model):
    def make_context_codewords(self):
        if not self.probs or not self.default_probs:
            print('Error: model must be trained before making context codewords')
        self.context_codewords = dict((ngram, get_codewords(probs)) 
                                      for ngram,probs in self.probs.items())
        self.default_codewords = get_codewords(self.default_probs)
        
    def make_context_trees(self):
        if not self.context_codewords or not self.default_codewords:
            print('Error: model must make context codewords before trees')
        self.context_trees = dict((ngram, get_huffman_tree(codewords)) 
                                 for ngram,codewords in self.context_codewords.items())
        self.default_tree = get_huffman_tree(self.default_codewords)     

4.2 Train models on all of the Dicken's texts for n from 1 to 6

In [1167]:
# all_dickens will just be a concatenation of the Dicken's texts 
TEXTS['all_dickens'] = TEXTS['great_expectations'] + \
                       TEXTS['a_christmas_carol'] + \
                       TEXTS['a_tale_of_two_cities']
text_train = TEXTS['all_dickens']
max_n = 6

# we record training times and space required
ngram_huffman_models = dict()
training_times = dict()
model_sizes = dict()
for n in range(1, max_n + 1):
    start = time.perf_counter()
    
    model = Ngrams_Huffman_Model(n, CHARS)
    model.train(text_train)
    model.make_context_codewords()
    model.make_context_trees()
    
    training_times[n] = time.perf_counter() - start
    sizes[n] = sys.getsizeof(model.context_codewords) + sys.getsizeof(model.context_trees)
    sizes[n] /= 1000
    ngram_huffman_models[n] = model

Show training results

In [1168]:
def plot_dict(data, title, xlabel, ylabel, ax=None):
    if not ax:
        fig, ax = plt.subplots(1,1)
    ax.plot(*zip(*data.items()), color='k')
    ax.set_ylabel(ylabel)
    ax.set_xlabel(xlabel)
    ax.set_title(title)
    
fig,ax = plt.subplots(1,2, figsize=(15,6))
fig.tight_layout(w_pad=4)
plot_dict(training_times, 'Time to Build Huffman Models', 'context length', 'second', ax[0])
plot_dict(sizes, 'Model Space Requirements', 'context length', 'kilobytes', ax[1])

As we increase the context length, training gets more expensive in terms of both time and space. This may or may not be acceptable. If we intend to train 1 model and use it to compress many texts, then the training resources become negligable. However, if we wish to imporve compression rates by training models for each text we compress greater context lengths become prohibitive. In fact, even just the model space can exceed the space of the original text, defeating our whole purpose.

4.3 Make Huffman Encoding and Decoding adaptive

We must not make slight modifications to our Huffman encoding and decoding algorithms. For each, we need to get the underlying datastructures based on the current context. Importantly, because the context is only based on previous characters, the decoder will see all the same exact contexts that the encoder did, making faithful decoding possible.

Note that we need not worry about the context before the first character because of our default codewords and trees.

In [1207]:
def encode(text, model):
    code = ''
    for i,ch in enumerate(text):
        # get context if there is enough of it to get
        context = text[i-model.n:i] if i>=model.n else ''
        # get codewords for this context
        codewords = model.context_codewords.get(context) or model.default_codewords
        code += codewords[ch]
    return code

def decode(code, model):
    output = ''
    # start with default tree
    cur = model.default_tree
    for bit in code:
        cur = cur.left if bit=='0' else cur.right
        if cur.ch:
            output += cur.ch
            context = output[-model.n:] if i>=model.n else ''
            # get tree for this context
            cur = model.context_trees.get(context) or model.default_tree
    return output

4.4 Test Adaptive Huffman Encoding and Decoding

We now test adaptive Huffman encoding for all the ngrams models we've trained (n from 1 to 6). For each test we time encoding and decoding, as well as collect other performance metrics. All the heavy lifting is done by functions we've already created.

In [1205]:
def test_adaptive_huffman(test_text, model):
    # code
    start = time.perf_counter()
    code = encode(test_text, model)
    code_time = time.perf_counter() - start
    
    bits_per_char = len(code) / len(test_text)
    
    # decode
    start = time.perf_counter()
    decoded = decode(code, model)
    decode_time = time.perf_counter() - start
    
    # see if output matches the original
    decode_success = 'yes' if decoded == test_text else 'no'
    
    entropy = adaptive_entropy(model, test_text)
    
    return (model.n, bits_per_char, entropy, code_time, decode_time, decode_success)

ADAPTIVE_HUFFMAN_RESULTS = dict()
def run_huffman_tests(test_title, models, show=True):
    test_text = TEXTS[test_title]
    max_n = max(models.keys())
    rows = [test_adaptive_huffman(test_text, models[n]) for n in range(1, max_n + 1)]
    col_names = ['context_length', 'coding bits / char', 'entropy / char', 
                 'code time', 'decode time', 'decoded matches original']
    ADAPTIVE_HUFFMAN_RESULTS[test_title] = {'col_names':col_names, 'rows':rows}
    if show: show_table(col_names, rows)

All Dickens results

In [1172]:
run_huffman_tests(test_title='all_dickens', models=ngram_huffman_models)
coding bits / char entropy / char code time decode time decoded matches original
context_length
1 3.542 3.479 0.818 1.383 yes
2 2.922 2.833 1.262 1.997 yes
3 2.633 2.511 1.652 4.143 yes
4 2.849 2.721 3.246 4.562 yes
5 3.365 3.239 5.537 4.034 yes
6 3.951 3.827 6.954 34.566 yes

Leviathan Results

In [1173]:
run_huffman_tests(test_title='leviathan', models=ngram_huffman_models)
coding bits / char entropy / char code time decode time decoded matches original
context_length
1 3.675 3.613 0.537 0.951 yes
2 3.226 3.150 0.863 1.414 yes
3 3.207 3.098 0.941 3.058 yes
4 3.566 3.452 2.261 4.990 yes
5 4.058 3.955 4.026 5.425 yes
6 4.509 4.418 7.725 9.402 yes

War and Peace Results

In [1174]:
run_huffman_tests(test_title='war_and_peace', models=ngram_huffman_models)
coding bits / char entropy / char code time decode time decoded matches original
context_length
1 3.636 3.584 1.543 2.205 yes
2 3.121 3.048 1.897 2.991 yes
3 2.962 2.856 2.089 5.466 yes
4 3.246 3.139 3.733 8.105 yes
5 3.790 3.686 6.227 12.535 yes
6 4.319 4.223 11.479 23.050 yes

Peter Pan Results

In [1175]:
run_huffman_tests(test_title='peter_pan', models=ngram_huffman_models)
coding bits / char entropy / char code time decode time decoded matches original
context_length
1 3.568 3.503 0.134 0.222 yes
2 3.020 2.934 0.243 0.432 yes
3 2.844 2.729 0.376 1.011 yes
4 3.140 3.022 0.803 1.736 yes
5 3.686 3.579 1.565 3.292 yes
6 4.220 4.121 2.741 5.200 yes

4.5 Results Analysis

We have successfuly encoded and decoded all the texts we tested, even War and Peace, which is more than 3 million characters long. Our code lengths are all very close to the entropy of the models, and we see substantial improvement compared to the non-adaptive Huffman encoding. Coding and decoding time increases with $n$ because the our data structures are growing quickly with $n$. For $n<4$ we have reasonable coding and decoding times, each at most a few seconds even for millions of characters.

Finally, we will run two more tests without tables, and then look at the code lengths as a function of $n$ for each text.

In [1208]:
run_huffman_tests(test_title='the_war_of_the_worlds', models=ngram_huffman_models, show=False)
run_huffman_tests(test_title='walden', models=ngram_huffman_models, show=False)

Plot compression rates as a function of $n$

In [1209]:
fix, ax = plt.subplots(1,1, figsize=(14,8))
for title, data in ADAPTIVE_HUFFMAN_RESULTS.items():
    x = [row[0] for row in data['rows']]
    y = [row[1] for row in data['rows']]
    ax.plot(x,y,label=title.replace('_',' '), lw=3, alpha=.8)
    i += 1
ax.set_xlabel('context length')
ax.set_ylabel('code bits / char')
ax.set_title('Adaptive Huffman Results for Ngram Models')
ax.legend();

As expected performance is best for the all Dicken's concatenation because this is what we trained the models on. For all texts $n=3$ appears to be ideal. Beyond that our model starts to overfit and we need to rely on the default codewords too often because we will be faced with unseen contexts.

We do not pay too much of a price for using a model trained on one text for another. In fact, Dicken's seems to be a pretty good proxy for all of these other books. This is encourging because using a single model means we do not have to train and store a model for each encoding job.

To improve Huffman's performance, we could consider a technique called 'blocking', which considers symbols in chunks. For example, if we used a word level model, our encoder could theoretically code a very likely next word as a single bit. By contrast working at the character level forced us to encode each word with a number of bits at least equal to its length. However, since these texts are not just words each separated by a single space, a word level model would introduce new problems.

This concludes our work with Huffman encoding. With an understanding of the basics and a working ngrams model base class, we now turn to attempting to improve compression rates and overhead with Arithmetic encoding.



Chapter 5: Arithmetic Encoding Fundamentals

An Arithmetic encoder will code a document as a number between 0 and 1. We will see exactly how this works with the code below, but at a high level consider the following toy example:

Suppose we have just 10 characters, a through j and a uniform distribution model. If we want to encode the word 'hi' we can deal with the characters in turn. For the h, since it's the 8th of our characters, we restrict our final output to be in $[.7, .8)$. This way, no matter what else we do, at decoding time we will see that our final code is in the 70-80th percentile interval between 0 and 1, and hence we can infer that the first character of the message was h. Now to encode our next character i, we note that i is our 9th character. Therefor we pick a final code in $[.78, .79)$, say .785. When decoding, after recognizing the first character as h, we'll see that the code .785 is in the 80-90th percentile of the $[.7, .8)$ interval that h restricted us to. This is precisely the information we need to realize the encoder must have seen i as the second character. Assuming infinite precision arithmetic, we can continue like this for any message length.

One can imagine how easily this can extend to arbitrary probability distributions. The characters that are very likely will 'own' large intervals that will require only few bits to represent, which is what we want. By contrast, the less probable characters will 'own' smaller intervals, which will in turn require more bits of precision to represent in the final output code, as we should expect.

The beautiful thing about Arithmetic coding is that we no longer have codewords or trees to create. For this reason, Arithmetic coding lends itself much more naturally to adaptive models than Huffman coding does. Also, as a consequence of not using codewords, we are not forced to assign integer bit lengths to each character. Consider the benefit we get from this additional freedom. For example, if we had characters a,b and c with equal probabilities, with Huffman the best we could do is length 1,2,2 codewords. Intuitively, breaking the symmetry seems like a waste and we should be better off if only we could assign each of these characters a codeword of length between 1 and 2. By comparison, Arithmetic encoding would have no trouble assigning each of these characters a 1/3 interval.

Also, as we mentioned earlier, Huffman must assign a codeword of at least 1 bit to each character. For an example of how wasteful this can be, suppose that we only have two characters, a and b with probabilities .999 and .001. Huffman must resign to assigning the codewords 0 and 1 and reaps no benefit from the very low binary entropy of the distribution. In our context, to encode a word, no matter how probable, a character level Huffman encoder must use a number of bits at least the length of the word. Arithmetic encoding does not have this wasteful restriction.

In fact, with Arithmetic encoding we are able to approach within a few bits of the entropy of the entire message, no matter how long the message is. We will further discuss this fact in chapter 6. Also, as we will see, we can encode and decode in time linear in the message length. These facts together make a very strong case for Arithmetic coding.

5.1 Infinite precision

Assuming for a moment we have infinite precision, programming an Arithmetic encoder is quite easy. We start by giving each character in our alphabet an interval of length proportional to its probability.

During encoding, for each character we reduce the scope of our current interval as discussed above. Finally, we just output a code in the middle of our final interval.

To decode, to determine each output character, we iterate through all possible candidates and see which one is consistent with the current code. Then we rescale the code to be its percentile in the last interval we've determined we're in. We use an end of file symbol so that the decoder can determine when to stop.

See the references section for a more complete discussion of Arithmetic coding fundamentals.

Infinite precision encoder and decoder algorithms

In [1178]:
# End of file symbol.  Importantly this does not appear in any of our texts
EOF = '•' 
chars = 'abcdefghijklmnopqrstuvwxyz!' + EOF  # simple char set for this example
# create a model with a uniform distribution over these chars
width = 1/len(chars) 
# the intervals are really all the model needs to provide
intervals = dict((ch, (width*i, width*(i+1))) for i,ch in enumerate(chars))

# example text to try encoding and decoding
text = 'peekaboo!' + EOF

# low and high tracks our current interval
# initially the text can encode to anywhere on the 0 to 1 interval
# encoding is an iterative process of narrowing this interval
def encode(text, chars):
    low, high = 0, 1
    for ch in text:
        l,h = intervals[ch]
        width = high-low
        low, high = low + width*l, low + width*h
    # output any number in the interval for simplicity
    return (low + high)/2 

# this function naively iterates over all chars to find the one that matches
# the interval found in the code for the current iteration
def decode_step(code, intervals):
    for ch, interval in intervals.items():
        if interval[0] <= code < interval[1]:
            return ch, interval
        
# get the char that matches the current interval, then rescale the code
def decode(code, intervals):
    output = ''
    while True:
        ch, interval = decode_step(code, intervals)
        output += ch
        width = interval[1] - interval[0]
        code = (code - interval[0])/width
        if ch == EOF: break
    return output

code = encode(text, intervals)
output = decode(code, intervals)
print('Original text: {}\nCode: {}\nDecoded text: {}'.format(text, code, output))
    
Original text: peekaboo!•
Code: 0.5410148146892083
Decoded text: peekaboo!•

We can see that our encoding and decoding worked; however, none of this was practical. Clearly if our original text was longer we would run out of precision with which to do the arithmetic and output the final code. Also note that our code is longer than the original text! We should be more careful as to what exactly we choose as the final output value. Moreover, we'll also need the code to be in binary.

Before moving on, let's be sure to add the end of file char we introduced to our global CHAR set.

EOF joins our character set

In [1179]:
CHARS |= {EOF}



Chapter 6: Finite Precision Arithmetic Encoder

The first step to extending these ideas to finite precision is to note how binary numbers work after the decimal point. We can think of each bit after the decimal point as specifying which half of a possible interval we are in. Just as $.1$ in decimal is $1/10$ and $.01$ is $1/10^2$, $.1$ in binary is $1/2$ and $.01$ is $1/2^2$. So .101 in binary tells us that we're in the second half of 0 to 1, then the first half of .5 to 1, and then the second half of that, or .625.

For the sake of practicality, to express numbers between 0 and 1 in binary with finite precision, we'll drop the decimal point from all of our calculations. Instead, we'll pick an amount of precision $p$ and refer to $2**p$ as whole. We then simply think of whole as being 1 in our units. If we choose $p=64$, with 64 bits we can express any number between 0 and $2^{64}-1$, which can in turn be mapped to numbers between 0 and 1 with a precision on the order of $1/2^{64}$.

The core to our finite precision encoder is the rescale function below. Note that as we encode, we narrow the interval we're in. To continue to calculate at narrower and narrower intervals, we don't actually need to be concerned with how we got to the current range; that's already encoded in our code output so far. Instead, all we need to do is to keep adding bits to the code to specify which half of the current interval we need to go to next. We then rescale that half interval to the new whole.

The tricky part arises when the next interval straddles 1/2 and the interval is narrow enough that we must rescale to get more precision. For example, if the current character during encoding is telling us, after all rescaling, that our next interval is from .4 to .6, we cannot yet know whether our next code bit should be a 1 or a 0. In fact, the next code bit can only be determined from looking at more characters of the text. In this case, we actually wait to output more code bits, and instead start appending bits to saved_bits. Rather than scale one or another half of the interval to the new whole, we rescale .25 to .75 as the new whole. Recall that our interval of interest, .4 to .6, is entirely contained in this middle half. In a pathological case, which is sure to arise when decoding a text even just thousands of characters long, we will have to rescale the middle multiple times before we can safely determine that our interval of interest lies entirely on one side or the other. In these cases, the length of saved_bits will effectively track how many of these middle rescalings we do.

For a more thorough explanation of finite precision Arithmetic coding, see Mark Nelson's Data Compression with Arithmetic Coding (2014) url: https://marknelson.us/posts/.

Although the python code below and modifications are my own, I credit the pseudocode used as a base to Mathematicalmonk's YouTube Arithmetic Coding playlist: https://youtu.be/ouYV3rBtrTI. This is also a great resource for a more detailed explanation than is in the scope of this project.

6.1 Rescaling and narrowing encoder helper functions

In [1184]:
# a and b represent the ends of the current interval
# pseudocode from reference (5)
def rescale(a, b, saved_bits, code, half, quarter):
    while b < half or a > half:
    # we can rescale
        if b < half: 
        # interval of interest is in first half, output 0 and saved bits, then rescale
            code += '0' + '1'*saved_bits
            saved_bits, a, b = 0, 2*a, 2*b
        # interval of interest is in second half, output 1 and saved bits, then rescale
        elif a > half:
            code += '1' + '0'*saved_bits
            saved_bits, a, b = 0, 2*(a-half), 2*(b-half)
    # interval of interest straddles the middle, blow up the middle and save bits
    while a > quarter and b < 3*quarter:
        saved_bits, a, b = saved_bits+1, 2*(a-quarter), 2*(b-quarter)
    return a , b, saved_bits, code
    
# given the current interval [a,b) and the new interval to narrow to [c/r,d/r)
# this function returns the new a and b after narrowing
# r represents 1 whole in the units of precision we use for c and d, and will be further discussed
def narrow(a,b,width,c,d,r):
    b = a + (width*d)//r
    a = a + (width*c)//r
    return a,b

6.2 Encoder

For the encoder we progress straight to assuming an adaptive model. We treat the model as a black box, but it must have a property n which tells us how much previous context to feed it. The model must also support the method get_intervals, which takes as input a context and returns intervals. Each interval corresponds to the probability of the character that interval is assigned to. Since we wish to do all arithmetic with integers, the space over which we divide these character intervals is over the integer r. For example, if a character has a probability of 1/10 and r is 500, we may assign that character the interval $[100,150)$. It is the responsibility of the model to produce these intervals from its underlying probability distribution and to decide on an appropriate precision for r.

In [1186]:
# we move straight to adaptive models for this Arithmetic encoder
# hence the first step on each character is to get the previous context
# pseudocode from reference (5)
def encode(text, model, precision):
    whole, half, quarter = 2**precision, 2**(precision-1), 2**(precision-2)
    code = ''
    a, b, saved_bits = 0, whole, 0
    
    for i,ch in enumerate(text):
        # get current context
        context = text[i-model.n:i] if i >= model.n else ''
        # get model prediction
        intervals, r = model.get_intervals(context)
        # get interval for current character being encoded
        c, d = intervals[model.char2idx[ch]]
        # narrow and rescale
        a, b = narrow(a, b, b-a, c, d, r)
        a, b, saved_bits, code = rescale(a, b, saved_bits, code, half, quarter)
    
    # finish
    saved_bits += 1
    return code + ('0' + '1'*saved_bits if a <= quarter else '1' + '0'*saved_bits)

Assuming some fixed model prediction time, it should be fairly clear that our encoder runs in time linear in the length of the text. We now breifly discuss why the expected length of the encoding will be very close to the entropy of the text under the model. We will keep this discussion to a very high level as it is mostly out of our scope here.

Suppose that we're encoding a text with probability $p$. Roughly speaking, by the time we rescale until the interval that represents the text is at least half the size of our entire interval of interest, we can output a number within the interval of the text with just 1 or 2 more bits. Now, note that every time we rescale, the interval representing the text blows up by a factor of 2. Thus the number of rescalings we will need will be $\log(1/p)$. Since our code is within a few bits of the number of rescalings, our code is of length roughly ceil($\log(1/p)$). Taken over all possible texts $t$ we could have seen, our expected code length becomes $\sum_t p_t \log(1/p_t)$, which is just the entropy of the text.



Chapter 7: Finite Precision Arithmetic Decoder

Of critical importance, the decoder does the same arithmetic that the encoder did. It rescales in exactly the same way, and it rounds in the narrowing function in exactly the same way. This control we have when working with integer arithmetic is the reason why these algorithms work for arbitrarily long texts.

7.1 Helper functions

These helper functions are analogous to what we had for the encoder. z is the current chunk of code we're trying to decode, within the precision we're working in. The binary search is an addition helper function that I introduced to speed up decoding. Recall from the infinite precision example that the decoder must search through the characters to find the one that is associated with the current interval specified by the code. With our 126 characters, binary search takes this process from 126 steps down to about 7 steps. The net effect from the binary search is to speed up the whole decoder by a factor of about 3.

In [1187]:
# pseudocode from reference (5)
def decoder_half_rescale(a,b,z,i,code,half):
    if b < half: 
        a*=2; b*=2; z*=2
    elif a > half:
        a,b,z = 2*(a-half), 2*(b-half), 2*(z-half)
    i += 1
    if i < len(code) and code[i] == '1':
        z += 1
    return a, b, z, i

# pseudocode from reference (5)
def decoder_quarter_rescale(a, b, z, i, code, quarter):
    a,b,z = 2*(a-quarter), 2*(b-quarter), 2*(z-quarter)
    i += 1
    if i < len(code) and code[i] == '1':
        z += 1
    return a, b, z, i    

# to speed up the search for the character that fits the current interval
# specified by z, the chunk of code currently considered
def binary_search(a, b, z, intervals, r):
    low, high = 0, len(intervals)-1
    while low != high:
        mid = (low + high) // 2
        c,d = intervals[mid]
        A, B = narrow(a,b,b-a,c,d,r)
        if z < B: # must be lower than mid
            high = mid
        else: # must be at least mid
            low = mid+1   
    return low
            

7.2 Decoder algorithm

In [1188]:
# decoding is made more straightforward with the helper functions
# importantly we must use the same precision we used for the encoding
# to output each character, we feed the model the current context to get intervals,
# binary search for the character/ interval that fits the code
# and then rescale a and b in exactly the same way we did while coding
# when we hit EOF, we return

# pseudocode from reference (5)
def decode(code, model, precision):
    whole, half, quarter = 2**precision, 2**(precision-1), 2**(precision-2)
    i = precision-1
    z = int(code[:i+1].ljust(precision, '0'),2)
    output = ''
    a, b = 0, whole

    while 1:
        context = output[-model.n:] if len(output)>=model.n else ''
        intervals, r = model.get_intervals(context)
        I = binary_search(a, b, z, intervals, r)
        ch = model.chars[I]
        c,d = intervals[I]
        a,b = narrow(a,b,b-a,c,d,r)
        output += ch
        if ch == EOF: return output
                
        while b < half or a > half:
            a, b, z, i = decoder_half_rescale(a,b,z,i,code, half)
            
        while a > quarter and b < 3*quarter:
            a, b, z, i = decoder_quarter_rescale(a, b, z, i, code, quarter)

Note that assuming a constant number of characters and a fixed time for the model to compute the intervals, the decoder runs in time linear in the output.



Chapter 8: Adaptive Arithmetic Encoding with Ngrams Model

8.1 Extend the ngrams base model from earlier

Recall that our arithmetic encoder treats the model as a black box; however, the model must support the get_intervals method. Given a context, get_intervals returns the interval it assigns to each character. These intervals are put into a list, where the interval at each index corresponds to the character at that same index in the model.chars list. To implement this functionality below, we take the probabilities our ngrams base model already has, and essentially blow up the space by a factor of r. Each unnormalized probability is then estimated as the rounded r times probs[ch], or 1 if this is zero.

For example, if we had probabilities .2, .5 and .3 and if r=100, then we would output model intervals as the list of tuples [(0,20), (20,70), (70,100)]. We also are sure to make the model return default intervals in the case of a context unseen during training. To produce the default intervals, we use the base class' default_probs.

In [ ]:
class Ngrams_Arithmetic_Model(Ngrams_Model):
    def _make_intervals(self, probs, r):
        unnormalized_probs = [(int(r*probs[ch]) or 1) for ch in self.chars]
        r_true = sum(unnormalized_probs)
        intervals, start = [], 0
        for prob in unnormalized_probs:
            intervals.append((start, start + prob))
            start += prob
        return intervals, r_true
    
    def make_context_intervals(self, r):
        if not self.probs or not self.default_probs:
            print('Error: model must be trained before making context intervals')
        self.chars = sorted(list(self.chars))
        self.char2idx = dict((ch,i) for i,ch in enumerate(self.chars))
        
        default_intervals, default_r = self._make_intervals(self.default_probs, r)
        self.intervals = defaultdict(lambda:default_intervals)
        self.r = defaultdict(lambda:default_r)
        for context, probs in self.probs.items():
            intervals, r_true = self._make_intervals(probs, r)
            self.intervals[context] = intervals
            self.r[context] = r_true
            
    def get_intervals(self, context):
        return self.intervals[context], self.r[context]

8.2 Train our first ngrams Arithmetic Model

Let's train a simple model and see how the choice of r influences the length of an encoding. For this example, we will set the context length to 3 and encode A Christmas Carol. We are not yet concerned with decoding, but will soon rigorously test our algorithms.

Recall that r controls how precisely the model can express probabilities. If r=500 for example, the model must estimate all probabilities to the closest 1/500th.

In [1189]:
model = Ngrams_Arithmetic_Model(n=3, chars=CHARS)
model.train(TEXTS['a_christmas_carol'])

bits_per_char = dict()
for r in [0,50,100,200,400,800]:
    model.make_context_intervals(r=r)
    # note that we must now allows append EOF to our texts
    # here we use an encoder precision of 64 bits, going to 32 or 128 makes little difference
    code = encode(TEXTS['a_christmas_carol'] + EOF, model, 64)
    bits_per_char[r] = len(code)/len(TEXTS['a_christmas_carol'] + EOF)
    
plot_dict(bits_per_char, 'The Effect of R', 'r', 'bits per character')

Note that the r we input into make_context_intervals is not exactly the true r the model will have. The true r will take account of rounding the model needed to do in making the intervals, including rounding all unnormalized probabilities less than 1 up to 1. Thus, when r=0, we essentially force the model to give uniform intervals, and with 126 characters, an encoding of 7 bits per character is expected.

As we increase r our compression rate generally improves. Once r is at least 400, the effect of more precision drops off. As per this analysis, we set the default r in our models to be 500 and leave it at that.

8.3 Train Models for ngrams up to length 6

Similar to what we did for Huffman encoding, we will train ngram Arithmetic models for a context length from 1 to 6. If we intend to train a model everytime we encode a text, the training performance is of great significance, so we analyze it and compare to what we saw with the Huffman models. Again we train on the concatenation of all the Dicken's texts.

In [1192]:
max_n = 6
text_train = TEXTS['all_dickens'] + EOF
ngram_arithmetic_models = dict()
training_times = dict()
model_sizes = dict()
for n in range(1, max_n + 1):
    start = time.perf_counter()
    
    model = Ngrams_Arithmetic_Model(n=n, chars=CHARS)
    model.train(text_train)
    model.make_context_intervals(r=500)
    
    training_times[n] = time.perf_counter() - start
    sizes[n] = sys.getsizeof(model.intervals)
    sizes[n] /= 1000
    ngram_arithmetic_models[n] = model
In [1193]:
fig,ax = plt.subplots(1,2, figsize=(15,6))
fig.tight_layout(w_pad=4)
plot_dict(training_times, 'Time to Build Arithmetic Models', 'context length', 'second', ax[0])
plot_dict(sizes, 'Model Space Requirements', 'context length', 'kilobytes', ax[1])

For point of comparison, the Huffman ngrams model for n=6 took about 600 seconds to train. For high $n$ we are now training about 20 times faster. The great savings in training cost comes because we don't need to make codewords or trees, instead, now we must only make intervals, which are cheap operations. This is also reflected in the model space for n=6 coming down by a factor of 2.

Note that for short contexts, training a model for each encoding becomes practical. This is potentially useful as we've already seen that ngram models provide lower entropies for texts they were trained on.

8.4 Test ngrams Arithmetic Encoding and Decoding

It's now time to test all the Arithmetic encoding tools we've developed. We will choose a number of texts to encode and decode from our books set. We will time the encoding and decoding computations, and also check the entropies of the models and that the decoding worked faithfully.

In [1194]:
# simple test script to time encoding, decoding and check correctness
def test_arithmetic(test_text, model):
    start = time.perf_counter()
    code = encode(test_text, model, precision=64)
    code_time = time.perf_counter() - start
    
    bits_per_char = len(code) / len(test_text)
    
    start = time.perf_counter()
    output = decode(code, model, precision=64)
    decode_time = time.perf_counter() - start
    
    entropy = adaptive_entropy(model, test_text)
    
    success = 'yes' if test_text == output else 'no'
    
    return model.n, bits_per_char, entropy, code_time, decode_time, success
    
# function to run test on a dictionary of models
# here we'll use our ngrams models that we just trained
def run_arithmetic_tests(test_title, models, results_dict, show=True):
    test_text = TEXTS[test_title] + EOF
    max_n = max(models.keys())
    min_n = min(models.keys())
    rows = [test_arithmetic(test_text, models[n]) for n in range(min_n, max_n + 1)]
    col_names = ['context_length', 'coding bits / char', 'entropy / char', 
                 'code time', 'decode time', 'decoded matches original']
    results_dict[test_title] = {'col_names':col_names, 'rows':rows}
    if show: show_table(col_names, rows)
    
# place to store results
results_bucket = dict()
In [1196]:
run_arithmetic_tests('a_christmas_carol', ngram_arithmetic_models, results_bucket)
coding bits / char entropy / char code time decode time decoded matches original
context_length
1 3.872 3.629 2.546 1.321 yes
2 3.147 2.929 2.340 1.326 yes
3 2.733 2.629 1.661 1.314 yes
4 2.830 2.869 1.981 1.448 yes
5 3.259 3.379 2.059 1.527 yes
6 3.770 3.940 2.993 1.872 yes
In [1197]:
run_arithmetic_tests('walden', ngram_arithmetic_models, results_bucket)
coding bits / char entropy / char code time decode time decoded matches original
context_length
1 3.768 3.530 27.493 5.004 yes
2 3.213 2.986 23.096 4.784 yes
3 2.916 2.800 19.000 4.774 yes
4 3.109 3.133 21.732 5.218 yes
5 3.671 3.752 27.957 7.258 yes
6 4.226 4.311 39.264 6.215 yes
In [1198]:
run_arithmetic_tests('peter_pan', ngram_arithmetic_models, results_bucket)
coding bits / char entropy / char code time decode time decoded matches original
context_length
1 3.743 3.504 5.054 2.068 yes
2 3.152 2.935 4.554 2.496 yes
3 2.842 2.731 4.625 1.984 yes
4 3.011 3.027 4.581 2.051 yes
5 3.508 3.585 5.296 2.183 yes
6 4.035 4.128 6.701 2.770 yes
In [1199]:
run_arithmetic_tests('the_war_of_the_worlds', ngram_arithmetic_models, results_bucket)
coding bits / char entropy / char code time decode time decoded matches original
context_length
1 3.758 3.514 11.069 2.606 yes
2 3.200 2.974 7.241 2.560 yes
3 2.878 2.765 6.434 2.507 yes
4 3.060 3.081 6.629 2.662 yes
5 3.578 3.657 7.882 2.909 yes
6 4.098 4.187 10.454 3.280 yes

Unfortunately our compression rates are about the same as they were for Huffman encoding with the ngrams models. However, both encoders are compressing at rates very close to entropy, so there is little room for improvement. In fact, we can see that our Arithmetic encoder even occasionally compresses below entropy. The reason for this is that whereas we compute the entropy with the model's probabilities, we compute the compression with the model's intervals. Recall that when we create the intervals we round off probabilities. Also, when we compute the compression, we further round when dividing intervals by r. This rounding occasionally has enough of a serendipitous effect to push the compression rate below the raw model probability entropy.

One serious issue with our implementation is the code and decode times. Although we can train models much faster for the Arithmetic encoder, the encoding and decoding times are far slower than for Huffman encoding. Recall that for our Huffman implementation, once the model is trained, encoding is just a process of hashing to get the codeword for each character. By comparison, although it also runs in linear time, we have far more overhead for our Arithmetic encoder. Recall that our Arithmetic encoder must narrow and rescale the interval for each character in the text. Even though these are still just a matter of simple arithmetic with 64 bit precision, they are far more computation than just hashing for codewords. Likewise, we've introduced the same arithmetic in our decoder, in addition to a binary search for the right character on each output iteration. By comparison the Huffman decoder only needed to walk down a tree approximately $\log(CHARS)$ length to decode each character.

There are a number a things we could do to speed up our implementation. First, we could use binary arithmetic explicitly and do the rescaling operations with bit shifting. Also, as suggest by reference (1) and (3) we could allow for more rounding and less precision for certain operations. These improvements are outside of the scope of this project.

In [1201]:
fix, ax = plt.subplots(1,1, figsize=(14,8))
for title, data in results_bucket.items():
    x = [row[0] for row in data['rows']]
    y = [row[1] for row in data['rows']]
    ax.plot(x,y,label=title.replace('_',' '), lw=3, alpha=.8)
    i += 1
ax.set_xlabel('context length')
ax.set_ylabel('code bits / char')
ax.set_title('Adaptive Arithmetic Results for Ngram Models')
ax.legend();

We see a similar plot of compression performance as a function of n as we did for Adaptive Huffman encoding. Larger context lengths don't work with ngrams models because while decoding the current context becomes increasingly unlikely to have been seen during training. Recall that in those cases the model must default to global counts to infer probabilities.

One way of taking advantage of larger contexts, as suggested by reference (3), is to train a model for all context lengths up to n. Then, when we come upon a novel context of lenth n during compression, we default to intervals for n=n-1 rather than for n=0. We can continue to decrease n until our current context was seen during model training. In this way we can take great advantage of larger contexts when they are predictive. For example, suppose the word 'summer' appears in a text we're encoding. An n=6 context would be able to inform us that almost certainly the r comes after the e. We would also have a very low entropy distribution for characters after the r, peaked at ' '. On the other hand, n=6 will likely be poor for proper nouns unseen during training, but n=2 or n=1 would still be helpful here to pick up on spelling conventions in english, such as u comes after q.

Our Arithmetic encoder would be able to take better advantage of these peaked distributions for greater n than Huffman encoding because it need not compromise itself to approximating probabilities with powers of 2.

We will now go on to explore an idea potentially more interesting and modern than using variable length n. We will develop and LSTM model and plug it into our Arithmetic encoder.



Chapter 9: LSTM Model

Developed in the 1990's, and LSTM is a recurrent neural network for making predictions from sequence inputs. Today LSTMs are some of the most powerful langauge models and are used by many companies for text predictions. They are also well known for their ability to generate langauge.

9.1 Arithmetic encoder LSTM Model

We use keras to build a simple character LSTM model.

In [992]:
# should ideally give the model 1M char text to train on
class LSTM_Model:
    def __init__(self, chars, n=20, units=128, lr=.01):
        self.chars = sorted(list(chars))
        self.char2idx = dict((c, i) for i, c in enumerate(self.chars))
        self.idx2char = self.chars
        self.n = n
        self.model = self._compile_model(n, self.chars, units, lr)

    # code from reference (9)
    def _compile_model(self, n, chars, units, lr):
        model = keras.Sequential()

        model.add(keras.Input(shape=(n, len(chars))))
        model.add(layers.LSTM(units))
        model.add(layers.Dense(len(chars), activation="softmax"))

        optimizer = keras.optimizers.RMSprop(learning_rate=lr)
        model.compile(loss="categorical_crossentropy", optimizer=optimizer)
        return model

    # code from reference (9)
    # we use the training text to generate context and next char pairs
    # the contexts play the role of x and next chars the role of y
    # x and y are really one-hot-encoded binary vectors that represent
    # characters by index
    def _get_x_y(self, text):
        n = self.n
        chars = self.chars
        step = 3
        contexts = [text[i : i + n] for i in range(0, len(text) - n, step)]
        next_chars = [text[i + n] for i in range(0, len(text) - n, step)]

        x = np.zeros((len(contexts), n, len(chars)), dtype=np.bool)
        y = np.zeros((len(contexts), len(chars)), dtype=np.bool)
        for i, context in enumerate(contexts):
            for t, char in enumerate(context):
                x[i, t, self.char2idx[char]] = 1
            y[i, self.char2idx[next_chars[i]]] = 1
        return x, y
        
    def train(self, train_text, epochs):
        x, y = self._get_x_y(train_text)
        self.model.fit(x, y, batch_size=128, epochs=epochs)

    # code from reference (9)
    # here we build X as the binary vector representation of the context
    # then we feed X to the model.predict function provided by keras
    def _predict(self, context): 
        X = np.zeros((1, self.n, len(self.chars)))
        for t, ch in enumerate(context):
            X[0, t, self.char2idx[ch]] = 1.0
        return self.model.predict(X, verbose=0)[0]
    
    def predict(self, context):
        probs = self._predict(context)
        return dict((self.chars[i], prob) for i,prob in enumerate(probs))
    
    # getting intervals is similar to what we've done earlier
    # we use lru_cache so that the model does not have to recompute intervals
    # for contexts it has already been asked to predict on
    @lru_cache()
    def get_intervals(self, context, r=500):
        padded_context = context[-self.n:].rjust(self.n, ' ')
        preds = self._predict(padded_context)
        unnormalized_probs = [int(r*prob) or 1 for prob in preds]
        start = 0
        intervals = []
        r_true = sum(unnormalized_probs)
        for i, prob in enumerate(unnormalized_probs):
            intervals.append((start, start + prob))
            start += prob
        return intervals, r_true

9.2 Train the LSTM model

We use n=7 for model training. Note that for shorter contexts input to the model will simply be padded with ' '. One of the key benefits of the LSTM model is that now we need not worry about unseen contexts; we can predict on them all the same. This allows us to use a larger n than was practical for the ngrams models.

In [ ]:
# train on 'a christmas carol' for 20 epochs
lstm_model = LSTM_Model(CHARS, n=7)
epochs = 20
text_train = TEXTS['a_christmas_carol']

# test text is a slice to speed up model evaluation times during training
TEXTS['a_christmas_carol_slice'] = TEXTS['a_christmas_carol'][5000:10000]
text_test = TEXTS['a_christmas_carol_slice']
acc_hist = dict()
entropy_hist = dict()
for e in range(1, epochs+1):
    lstm_model.train(text_train, epochs=1)
    # evaluate model every 4 training epochs
    if not e%4:
        acc_hist[e] = evaluate(lstm_model, text_test)[0]
        entropy_hist[e] = adaptive_entropy(lstm_model, text_test)

Training evaluation

In [975]:
fig,ax = plt.subplots(1,2, figsize=(15,6))
fig.tight_layout(w_pad=4)
plot_dict(acc_hist, 'LSTM Accuracy', 'Training Epochs', 'Accuracy', ax[0])
plot_dict(entropy_hist, 'LSTM Entropy', 'Training Epochs', 'Entropy', ax[1])

We observe that model accuracy increases during training, while entropy of the text under the model stays quite low after the first few epochs.

9.3 Test first LSTM model in our Arithmetic encoder

In [987]:
LSTM_models = {lstm_model.n: lstm_model}
run_arithmetic_tests('a_christmas_carol_slice', LSTM_models)
coding bits / char entropy / char code time decode time decoded matches original
context_length
7 2.331 2.181 60.724 59.331 yes

We have successfully used our LSTM model to encode and decode a slice of a christmas carol. With the more powerful model, we've managed to push both our entropy and compression rate to near 2 bits per character. Note that this is a savings of about .4 bits per character from our best earlier compression rate on this text.

One serious problem, however, is that this compression rate savings came at an unacceptable cost in compute time. It took us about a minute to encode and another minute to decode even this small slice of the text. Earlier we could have encoded and decoded a text of this length in a small fraction of a second.

We will continue by evaluating the affect of n on LSTM performance.

9.4 Evaluate the Affect of Context Length on LSTM Performance

We will now train LSTMs for different values of n and understand the affect of context length on accuracy and entropy.

In [ ]:
acc_hist = dict()
entropy_hist = dict()
for n in [1,3,7,10,15]:
    model = LSTM_Model(CHARS, n=n)
    model.train(text_train, epochs=15)
    acc_hist[n] = evaluate(model, text_test)[0]
    entropy_hist[n] = adaptive_entropy(model, text_test)
In [991]:
fig,ax = plt.subplots(1,2, figsize=(15,6))
fig.tight_layout(w_pad=4)
plot_dict(acc_hist, 'LSTM Accuracy', 'Context Length', 'Accuracy', ax[0])
plot_dict(entropy_hist, 'LSTM Entropy', 'Context Length', 'Entropy', ax[1])

We can see that our LSTM architecture is too simple to take advantage of very long context lengths. However, we are able to do better than we did with the ngrams models, where we could not take advantage of context lengths more than about 3.

We'll take a context length of 7 to be about ideal and train a final LSTM model.

9.5 Train and Evaluate the Final LSTM

In [ ]:
final_lstm_model = LSTM_Model(CHARS, n=7)
# again we'll train on just all of the Dicken's books
final_lstm_model.train(TEXTS['all_dickens']+EOF, epochs=20)
In [1008]:
# since our LSTM arithmetic encoding is very slow, we'll create slices
# of all texts to use for all tests
s1, s2 = 5000, 10000
for title in list(TEXTS.keys()):
    TEXTS[title + '_slice'] = TEXTS[title][s1:s2]

test_title = 'a_christmas_carol_slice'
col_names = ['test title', 'accuracy', 'relaxed accuracy', 'mean confidence', 'entropy / char']
test_text = TEXTS[test_title]
metrics = evaluate(final_lstm_model, test_text)
show_table(col_names, [[test_title] + list(metrics) + [adaptive_entropy(final_lstm_model, test_text)]])
accuracy relaxed accuracy mean confidence entropy / char
test title
a_christmas_carol_slice 0.6 0.781 0.651 2.349
In [1009]:
test_title = 'the_war_of_the_worlds_slice'
test_text = TEXTS[test_title]
metrics = evaluate(final_lstm_model, test_text)
show_table(col_names, [[test_title] + list(metrics) + [adaptive_entropy(final_lstm_model, test_text)]])
accuracy relaxed accuracy mean confidence entropy / char
test title
the_war_of_the_worlds_slice 0.511 0.72 0.575 2.735

Recall that our ngrams models has accuracies around .55 and .45 for text in the training and text outside respectively. We also see an improvement in our entropy rates, improving them by a few tenths from what we saw from our ngrams models.

Generate language with final LSTM

In [1146]:
prompt = 'The queen is'
print(generate_text(final_lstm_model, prompt))
Generated language: The queen is a virture, as prisoner. He asked by tells, I faded out the  giv’ now obrowf he hotes under reyeh,” 

More intuitively speaking, we can see from out LSTM's generated text that the model is far more coherent than our ngrams models were. It can spell fairly well, and understands the lengths of words and some punctuation expected in english.



Chapter 10: Test Adaptive Artihmetic Encoding with LSTM

In [1289]:
LSTM_models = {final_lstm_model.n: final_lstm_model} # just 1 model to test
results_dict = dict()
titles_to_test = ['a_christmas_carol', 'great_expectations', 'a_tale_of_two_cities',
                  'leviathan', 'war_and_peace', 'walden', 'peter_pan', 'the_war_of_the_worlds']

# run tests
for title in titles_to_test:
    run_arithmetic_tests(title + '_slice', LSTM_models, results_dict, show=False)

# show test results
col_names = ['title'] + results_dict['a_christmas_carol_slice']['col_names']
rows = [[title] + list(d['rows'][0]) for title, d in results_dict.items()]
show_table(col_names, rows)
context_length coding bits / char entropy / char code time decode time decoded matches original
title
a_christmas_carol_slice 7 2.341 2.044 69.650 66.018 yes
great_expectations_slice 7 2.336 2.044 80.999 87.385 yes
a_tale_of_two_cities_slice 7 2.549 2.260 79.030 80.824 yes
leviathan_slice 7 4.685 4.378 68.751 69.442 yes
war_and_peace_slice 7 4.221 3.927 40.862 40.101 yes
walden_slice 7 2.789 2.496 82.314 81.754 yes
peter_pan_slice 7 2.628 2.329 78.206 84.486 yes
the_war_of_the_worlds_slice 7 2.791 2.498 85.033 84.095 yes

All of our encoding and decodings were successful, although again, the time to encode and decode even small slices of text is prohibitive for the LSTM model. Compression rates are close to entropy, but not as close as they were earlier. Perhaps we should have used a greater r in this final more powerful model.

Plot compression rates vs model entropies

In [1290]:
fix, ax = plt.subplots(1,1, figsize=(14,8))

code_lengths = dict((row[0], row[2]) for row in rows) 
ax.bar(*zip(*code_lengths.items()), color='tan', width=.35, alpha=.3, label='encoding', align='edge')

entropies = dict((row[0], row[3]) for row in rows) 
ax.bar(*zip(*entropies.items()), color='orange', width=.35, alpha=.3, label='entropy')

plt.xticks(rotation=90)
ax.set_ylabel('bits / char')
ax.set_title('Arithmetic Results for LSTM Model')
ax.legend();



Chapter 11: Huffman vs Arithmetic Encoding

Finally we will compare our best compression rates with Huffman encoding to our best with Arithmetic encoding. Note that this is not just a comparison of Huffman vs Arithmetic, but also takes into account the extra work we put into the modeling for our Arithmetic encoder.

In [1291]:
# get average compression rate for the arithmetic encoder on Dickens texts
code_lengths['all_dickens_slice'] = np.mean([code_lengths['a_christmas_carol_slice'],
                                       code_lengths['great_expectations_slice'],
                                       code_lengths['a_tale_of_two_cities_slice']])
code_lengths_renamed = dict((title[:-6], length) for title,length in code_lengths.items())
for title in ['a_christmas_carol', 'great_expectations', 'a_tale_of_two_cities']:
    del code_lengths_renamed[title]
In [1292]:
# get the best of earlier Huffman compression rates
best_huffman_results = dict()
for title in ADAPTIVE_HUFFMAN_RESULTS.keys():
    rows = ADAPTIVE_HUFFMAN_RESULTS[title]['rows']
    min_length = min(row[1] for row in rows)
    best_huffman_results[title] = min_length
In [1293]:
# plot comparison
fix, ax = plt.subplots(1,1, figsize=(14,8))

ax.bar(*zip(*code_lengths_renamed.items()), color='tan', width=.35, alpha=.3, label='Arithmetic LSTM', align='edge')
ax.bar(*zip(*best_huffman_results.items()), color='orange', width=.35, alpha=.3, label='Huffman Ngrams')

plt.xticks(rotation=90)
ax.set_ylabel('bits / char')
ax.set_title('Arithmetic LSTM vs Huffman Ngrams Encoding Rates')
ax.legend();

We have a mixed bag of results. By our work with the Arithmetic encoder and the LSTM model we've managed to push down compression rates for Walden, Peter Pan, The War of the Worlds and all of Dickens to the best we've seen yet. However, our performance now suffers for Leviathan and War and Peace. Perhaps the reason for this is that these titles are less similar to the Dickens we trained on than the rest of the titles. Note that the LSTM with context length 7 should be more sensitive to overfitting than an ngrams model with context length 3.



Chapter 12: Conclusion

The goal of this project was not to achieve the very best compression rates or speeds, but rather to implement the main ideas of Adaptive Huffman and Arithmetic encoding and experiment. Throughout I have sought to keep this work self contained and readable by progressively building on ideas and code. It is my hope that anyone with a basic understanding of Huffman and Arithmetic coding will be able to follow my implementations and understand how they work in detail.

We have implemented Huffman and Arithmetic character level encoders for natural language text compression. Our encoders are capable of encoding at less than 3 bits per character. Considering we have nearly 128 characters in our alphabet, it is noteworthy that our encoders can specify characters with just 2 or 3 bits. In fact, this compression rate is better than we get from the standard zip algorithm built into MacOS, which compresses text at about 3-4 bits per character.

Our compression rates can be attributed to both our models and the fact that our encoders approach the entropy of the models. We trained ngrams context models as well as LSTMs, all capable of achieving text entropies considerably below those of a global count based model.

Our Huffman encoder and decoder both operated within a few seconds, even to code and decode War and Peace, which is over 3 million characters long. And although our Arithmetic encoder is slower than our Huffman implementation, this is a well known drawback to Arithmetic encoding. We also note that our Arithmetic encoder models can train much faster than those for Huffman.

There are several improvements we can make to speed up our Arithmetic encoder, namely using explicit bit-shift arithmetic and more sophisticated rounding. We may also look to ANS encoding as a promising alternative (reference 8). While Arithmetic encoding uses two values to represent an interval, ANS uses a single value to represent the current state. This simplifies the required arithmetic to encode and decode each character.

One of the key benefits of all our encoding algorithms is that they treat the model as a black box that need only implement one main method to return character distributions. As language models continue to rapidly improve in terms of both accuracy and speed, we'll be able to harness these models and employ the very same code developed here to improve our performance. This decoupling of the encoder from the model is powerful because it allows us to take advantage of the vast work being done in NLP.