She’d give Godzilla a run for his money…
Thanks to Kate for the recipe.
Bayes theorem (though developed primarily by Pierre Simon Laplace) is the theoretical underpinning of many generative statistical learning models used in industry.
It states that you can reverse the conditional in probability statements through the equality:
P(A | B) = P(B | A) * P(A) / P(B).
Why is this useful? Often probabilities are difficult to evaluate in one direction, but trivial in the other. For example: imagine you wanted to know the probability that it’s raining given that it’s cloudy, or in math “P(raining | cloudy)”. This isn’t trivial. But the other direction is: P(cloudy | raining) = 1.
I was inspired* to write an app that looks as though it does something sentient with little more than a few lines of python and a large amount of data. I chose to build a language classifier. Effectively, I wanted to compute: P(english | text), P(french | text), etc, and then declare the language with the highest probability the winner.
On the surface, this is a hard problem.
Judicious application of Bayes trivializes it:
P(english | text) = P(text | english) * P(english) / P(text)
P(french | text) = P(text | french) * P(french) / P(text)
OK, great. But what on earth does P(text | french) even mean? One interpretation, called the generative model, is to assume you have some statistical model of french. The question you are answering is “what is the probability this statistical model would generate the text?” More concretely, imagine you have a bucket full of french words in the proportion that they occur in the wild. You repeatedly reach into the bucket and extract a word. Each time you do this, you write the word down on your pad of paper and return it to the bucket. After pulling out as many words as you have text, what’s the probability that the words on your pad match the text?
This bucket analogy assumes that the language model is stateless or “naive”. In other words, unlike in real language, the probabilities of each word from the bucket is independent of what word came before. As it turns out, with enough data, this usually works reasonably (surprisingly?) well. P(text | french) = P(text_term_1 | french bucket) * P(text_term_2 | french bucket) * … * P(text_term n | french bucket). Modelling some of the word dependencies is possible by adding n-grams (pairs, triplets, etc of words) to the bucket as well. Care must be taken with n-grams: they tend to explode the size of your data. (eg. if there are m terms, there are mP2 = m*(m-1) bigrams! 3-grams and higher get even worse!)
So that takes care of P(text | french). What about P(french) and P(text)? Well, P(text) is easy: we don’t care. Why not? The order of fractions with identical denominators is the order of their numerators, and all we care about is the ordering of the probabilities. As for P(french), we can either assume the probability of each language is identical (so we can just ignore the term for similar reasons to P(text)) or we can use the distribution of texts used to generate the language models.
Great. So let’s say we wanted to implement this using 1- and 2-grams for our language model. Now all we need are buckets for each language. We’re looking for large amounts of data in various languages. The internet certainly has that. However, we need to know which bucket to put the words in. One source that comes to mind is Wikipedia. They have en.wikipedia.org, fr.wikipedia.org, etc. The vast majority of text on these hosts is in the host language. Downloading a large sampling of data from all these hosts, segmenting and counting term and bigram frequencies, we end up with the electronic equivalent of the bucket for each language. It turns out that even with a minuscule subset of text from Wikipedia (crawled with scrapy, an open source web crawler), this data gets ridiculously large. Even after removing everything from the bucket that doesn’t occur at least 100 times, the data is still approximately 300 megs in CSV format.
I planned to run this on Google AppEngine. Importing that much data into the BigTable backend is possible, but on the free CPU quotas it would have taken far too long. Instead, let’s shard the records across 5000 or so data files (using a hash of the term to decide which file each record goes into for efficient retrieval). Unfortunately, at the free quota levels, AppEngine has fairly tight requirements on app size (150M). This meant I needed to reduce the amount of language data by half. I didn’t want to throw out even more data (though I’d have to experiment to see how much this would have actually affected accuracy). What I ended up doing was zipping the shards. This gave an approximately 40% space savings. Still not enough. Reducing the number of shards to 750 so that zip could take advantage of more redundancy across the data increased the space savings to 47%. Sorting the records within each shard helped increase the redundancy enough to get a 58% space reduction and the program up on App Engine.
Needless to say, performance of this initial version was terrible. To determine the probability of a word in text a user has entered for the various language models, the code might need to unzip a new one of the 750 shards to extract the language probability records for that word. To combat this sluggishness, three levels of cache were added in front of this horribly inefficient structure: in RAM cache within the app, AppEngine’s memcache API, and BigTable. It is still slow, but is now within the realm of usable.
While we’re talking about practical matters, two issues that naturally arise are worth discussing
First, the probabilities are usually tiny, even for the correct language model (what’s the chance you pull “the rain in spain” even out of the English bucket? Extremely small.). Since even a double can only represent numbers as small as 10^-308 (and even then, very inaccurately), it is often impossible to perform this math accurately enough using the blind definitions above. The textbook solution to this, which works exceptionally well, is to take the logarithm of each probability and add them up. (Note that log(ab) = log(a) + log(b).)
Secondly, there can be words that do not appear in some language models. A strict implementation of the approach described above, a single word appearing in the text that is from foreign language, a made up word, or a typo, would cause the probability of that language to be identically zero, regardless of how much evidence provides support. For example, the text “The french usually use the phrase ‘pommes de terre’ when speaking of potatoes.” is English, but the absence of those words from the English model would result in the generative model giving English a score of zero – the probability of you pulling ‘terre’ from the English bucket is zero. This can also happen if, as in the implementation discussed here, part of the model is thrown away for size considerations. In order to correct for this, we admit that our generative model might not be authoritative and assign a non-zero probability to words missing from the language model. In this case, the assigned probability is low enough that it contributes less to a score than the lowest score a word in a language model provides (ie. 100/# terms in largest model). This ensures that these words reduce the probability of languages that are missing them than they reduce the probability of languages where they exist but occur rarely.
LaPlace smoothing is also a very popular way of performing this adjustment. In this case, rather than counting raw term frequencies, you count the number of documents that contain a term. To “smooth”, you add a single “fake” document to your collection that includes every possible term. Since that would be an infinitely long document, you simply doctor the math to have the same net effect. This is really just intuition behind some of the monkeying around described in the previous paragraph. Peter Norvig, in his discussion of word segmenters in Chapter 14 of Beautiful Data, describes a slightly more complex model for out of vocabulary (OOV) words. You could likely come up with something similar (though blind application of his approach wouldn’t be suitable) for this problem, perhaps by modeling character n-grams within words of a language and falling back to that classifier for such OOV terms.
With that all said and done, here it is.
*This post was inspired by Peter Norvig, who has several superb posts similar to this one on spell checking and segmentation. His website is full of very interesting things. You should go read it. A great book on the applications of Bayes over the years is on The Theory that Wouldn’t Die and whose author I have had the pleasure to hear speak.
It’s been a busy few weeks, but here are a couple of videos of Stephanie walking near the Bay yesterday:
And some photos to go with them…