Alexandre Trilla, PhD - Research Engineer |

Blog

-- Thoughts on data analysis, software development and innovation management. Comments are welcome

Sentiment analysis with NLTK

13-Aug-2010

By the beginning of the month, Streamhacker set a demo of sentiment analysis using the Natural Language Toolkit NLTK, a powerful Python set of open source modules for research and development in natural language processing and text analytics.

It is interesting to compare its features with EmoLib and see how different technologies tackle the same problem. For instance, Streamhacker's system is trained on movie reviews, identifying positive and negative sentiment, while EmoLib is trained on news headlines providing positive, negative and neutral sentiment tags, given the need of the neutral state in speech synthesis oriented applications. Regarding their innards, Streamhacker's system accounts for high-information words and collocations to train a Naive Bayes classifier and a Maximum Entropy classifier, while EmoLib represents the emotional words of a given text in a circumplex (a dimensional space of emotion) and yields a sentiment label according to a nearest centroid criterion (to the sentiment categories).

Overall, the two systems work similarly. A deeper performance analysis would be necessary to extract further conclusions. Nevertheless, IMO and according to the results I obtained for my dissertation, the systems working directly with textual features are expected to perform better than the systems working with emotion dimensions, at the expense of their somewhat poorer generalisation capabilities (they are bound to their training text domains, therefore the interest in using a more general method like emotion dimensions). In this sense, regarding the datasets I used for the experiments, MaxEnt methods and Vector Space Model approaches performed the best.