Alexandre Trilla, PhD - Data Scientist |

Blog

-- Thoughts on data analysis, software development and innovation management. Comments are welcome

Spelling correction and the death of words

23-Mar-2012

One of the topics treated in this second week of the Natural Language Processing class at Coursera is spelling correction (also treated in the Artificial Intelligence class). It's wonderful to have tools that help proofreading manuscripts, but this comes at the expense of impoverishing our own expression ability. This newspaper article, which links to the original research work conducted by Alexander Petersen, Joel Tenenbaum, Shlomo Havlin and Eugene Stanley, states that spelling correction (not only computerised but also human-made in the editorial industry) causes language to be homogenised, and this eventually reduces the lexicon (old words die at a faster rate than new words are created). So, is this NLP fancy topic actually hurting NLP? What a headache...

Anyway, I find this spelling correction field very appealing because it shows a direct link with speech (i.e., spoken language) through the consideration of a phonetic criterion in the spelling error model. This points to the metaphone algorithm, which creates the same key for similar sounding words. It is reported that metaphone is more accurate than soundex as it knows the basic rules of English pronunciation. Regarding spelling correction, metaphone is used in GNU Aspell, and to my surprise, it's already integrated in the latest versions of PHP! Along with the edit distance topic treated in the first week, this shall make a new addition (e.g., a phonetic similarity module) to the NLP toolkit I'm beginning to work on!