Blog
-- Thoughts on data analysis, software
development and innovation management. Comments are welcome
Post 35
Sentiment analysis with NLTK
13-Aug-2010
By the beginning of the month, Streamhacker set a
demo of sentiment
analysis using the Natural Language
Toolkit NLTK, a powerful Python
set of open source modules for research and development in natural
language processing and text analytics.
It is interesting to compare its features with
EmoLib
and see how different technologies tackle the same problem.
For instance, Streamhacker's system is trained on movie reviews,
identifying positive and negative sentiment, while EmoLib is trained
on news headlines providing positive, negative and neutral sentiment
tags, given the need of the neutral state in speech synthesis oriented
applications. Regarding their innards, Streamhacker's system
accounts for high-information words and collocations to train a
Naive Bayes classifier and a Maximum Entropy classifier, while EmoLib
represents the emotional words of a given text in a circumplex
(a dimensional space of emotion) and yields a sentiment label according
to a nearest centroid criterion (to the sentiment categories).
Overall, the two systems work similarly. A deeper performance analysis
would be necessary to extract further conclusions. Nevertheless, IMO
and according to the results I obtained for my dissertation,
the systems working directly with textual features are expected to
perform better than the systems working with emotion dimensions,
at the expense of their somewhat poorer
generalisation capabilities (they are bound to their training
text domains, therefore the interest in using a more general method like
emotion dimensions). In this sense, regarding the datasets I used
for the experiments, MaxEnt methods and Vector Space Model approaches
performed the best.
|