Alexandre Trilla, PhD - Data Scientist |

Blog

-- Thoughts on data analysis, software development and innovation management. Comments are welcome

Natural Language Generator (NLG) released

01-Nov-2011

I recall the tutorial that Noah Smith gave at LxMLS about Sequence Models with much interest. Motivated and inspired by his explanation, today I release a Natural Language Generator (NLG) based on n-gram Language Models (see the CODE section).

Under the generative modelling paradigm, the history-based model P(w_n|w_1,w_2,...,w_{n-1}) (see further analysis in the NLG docs) may be graphically represented by a finite-state machine/automaton such as the Co-Occurrence Network (Mihalcea and Radev, 2011) that appears in Alias, et al. (2008), see the figure below. I find it self explanatory (a picture is worth a thousand words; it's funny to put it this way from a language processing perspective). Although it is limited by its first-order history (the NLG rather generalises its topology), the tutorial did not show such a clear representation of the model.

I especially enjoyed the complexity issues that were raised regarding the length of the considered history (i.e., the order of the Markov model): from the Bag-Of-Words (BOW) model with few parameters and strong independence assumptions to the infinite-history-based model with a rich expressive power to represent language. Prof. Smith conducted an example experiment where he used a corpus of 2.8M words of American political blog text to display how this expressive power can learn and generate natural language. First, he showed how a unigram model (i.e., a BOW) could not produce anything that made any sense. Second, he showed how a bigram model could only produce a few phrases with sense. The experiment went on up to a 100-gram language model, which just copied text straight from training instances. Imagine how the aforementioned graphical network would look like in these scenarios, from a fully connected network of words (i.e., unigrams) to a mesh of higher order grams. He ended up discussing that in the past few years, "webscale" n-gram models have become very popular because it's very hard to beat them.

In this post I reproduce the experiment with the NLG using "The Legend Of Sleepy Hollow", by Washington Irving, thanks to the e-book provided by Project Gutenberg (I could not find a more appropriate book after Halloween). What follows are some of the generated outputs:

Unigram model
"Ichabod who still purses his patched that crossed for it screech prisoner of seems beset of so Far on and into and sometimes a his roasting with in dead fly so as Dutch in To Tassel"

Bigram model
"But it were grunting in former times of the battle in the quietest places which last was found favor in with snow which he heard in a knot of doors of the fence Ichabod stole forth now came to look behind"

Trigram model
"As Ichabod jogged slowly on his haunches and unskilful rider that he was according to the lot of a snowy night With what wistful look did he shrink with curdling awe at the mention of the screech owl"

5-gram model
"About two hundred yards from the tree a small brook crossed the road and ran into a marshy and thickly wooded glen known by the name of the Headless Horseman of Sleepy Hollow"

As it can be observed, as the model gains expressive power by means of its increased order, it can generate more quality natural language instances.

Finally, as it is customary, the source code organisation of the NLG follows common FLOSS directives (such as src folder, doc, README, HACKING, COPYING, etc.). It only depends externally on the Boost Iostreams Library for tokenising text, and it makes use of the premake build script generation tool. I hope you enjoy it.

--
[Mihalcea and Radev, 2011] R. Mihalcea and D. Radev, "Graph-based Natural Language Processing and Information Retrieval", New York, NY, USA: Cambridge University Press, 2011.
[Alias, et al., 2008] F. Alias, X. Sevillano, J. C. Socoro and X. Gonzalvo, "Towards High-Quality Next-Generation Text-to-Speech Synthesis: A Multidomain Approach by Automatic Domain Classification", IEEE TASLP, vol. 16, no. 7, pp. 1340-1354, Sep. 2008.