keyan pishdadian

"The man of simple faith who loves learning for his burden is love"

01 23 2015

That title was generated by my Markov chain text generator that I just finished working on. Even though it's stochastically generated text, I feel like it is speaking to me. That might sound strange, that's because it is.

This was a side project I did after a fellow Recurser and friend, Amanda, talked about a twitter bot she wrote that used Markov chains to make tweets. I have always been interested in working on an NLP project and decided this would be perfect to do in-between other things this week.

For the uninitiated, Markov chains are mathematical systems that transition from one state to another. Each transition depends only on the current state and not on the transitions that bring the system to the present state. We can design Markov chain models of a system to describe the likelihood of any particular transition occurring. For instance, constructing a Markov chain model for a sentence would allow us to know the probability that a certain word would follow any other word in that sentence. For example in the sentence:

The lazy boy poked the lazy dog

We know that if our current state is "lazy" then the probabilities of the two possible future states are 50% "boy" and 50% "dog". In this project I used this idea to process a large input text (my favorite is The Sayings of Confucius) and use the weighted probabilities of individual state changes to stochastically generate text which (hopefully) looks and sounds similar to the input.

I did this by creating a dictionary which kept a count of the instances that certain individual words followed each two word phrase in the input. I used one large dictionary where each key was a two word phrase, and each value was a defaultdict. The code for this is below:

if current_token in markov_dictionary:
    self.update_dictionary(markov_dictionary,
                           tokens_list,
                           current_token,
                           next_index)
else:
    markov_dictionary[current_token] = defaultdict(int)
    self.update_dictionary(markov_dictionary,
                           tokens_list,
                           current_token,
                           next_index)

After this I just used the word counts to calculate a probability (0.0 - 1.0) that a word would follow the current state and used this to make a weighted randomized choice:

if current_token in markov_dictionary:
denominator = sum(value for key, value in current_word)
weights = [float(value) / denominator for key, value in current_word]
words = [key for key, value in current_word]
word = list(numpy.random.choice(words,
                                size=1,
                                replace=False,
                                p=weights))

The result of all this is text that is oftentimes gibberish, but sometimes strangely poetic:

In awe he is stable a man with enquiries to another state

Come from afar do we not rejoice to live unknown

They are unprincipled stern men of arts and learning delights

His face changed when he was asked what is meant by kindness without waste

The master said when a man of endless craving who never tires of his laughter he only takes when he has lost the way

The rest of the code is available here. I'm hoping to work more on my BitTorrent client this week, as I spent most of this past week fixing bugs in bpython. I'll probably talk more about that some other time.