| 
 | |||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectemolib.util.proc.TextDataProcessor
emolib.stemmer.Stemmer
public abstract class Stemmer
The Stemmer abstract class defines the general structure to perform the stemming process, which eliminates the inflexion of words.
The idea of stemming is to improve the Information Retrieval (IR) performance generally by bringing under one heading variant forms of a word which share a common meaning.
Stemming is feasible to the Indo-European languages because there exist common patterns of word structures, and for languages that are more highly inflected than English (and most of them are), greater improvements will be observed when stemming is applied. Thus, for a Romance language like Spanish the results are expected to be reasonably good.
Assuming that words are written left to right, the stem, or root of a word is on the left, and zero or more suffixes may be added on the right. If the root is modified by this process it will normally be at its right hand end. Also, prefixes may be added on the left usually altering its meaning radically, so they are best left in place. But suffixes can, in certain circumstances, be removed. In fact, suffix stripping is a practical aid in IR after all.
Here, stem and root are used interchangeably, but there exists a finer distinction between them, regarding the stem as the residue of the stemming process and the root as the inner word from which the stem word derives. Anyway, such a proficient degree is not intended in this Stemmer class description.
There is a lot more to be said about the stemming process: the stemming errors, the use of dictionaries, stop words, irregularities, etc. Refer to Dr. Porter's article Snowball: A language for stemming algorithms in order to get an extensive description of the stemming process.
| Constructor Summary | |
|---|---|
| Stemmer()Main constructor of the Stemmer. | |
| Method Summary | |
|---|---|
| abstract  void | applyStemming(TextData inputTextDataObject)Method to perform the stemming process. | 
|  Data | getData()Obtains the TextData from the previous module, processes it and makes it available to the rest of the text processing chain. | 
|  void | initialize()Method to initialize the Stemmer. | 
|  void | newProperties(PropertySheet ps)This method is called when this configurable component has new data. | 
|  void | register(java.lang.String name,
         Registry registry)Register my properties. | 
| Methods inherited from class emolib.util.proc.TextDataProcessor | 
|---|
| flush, getName, getPredecessor, setPredecessor, toString | 
| Methods inherited from class java.lang.Object | 
|---|
| clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait | 
| Constructor Detail | 
|---|
public Stemmer()
| Method Detail | 
|---|
public void register(java.lang.String name,
                     Registry registry)
              throws PropertyException
Configurable
register in interface Configurableregister in class TextDataProcessorname - the name of the componentregistry - the registry for this component
PropertyException
public void newProperties(PropertySheet ps)
                   throws PropertyException
Configurable
newProperties in interface ConfigurablenewProperties in class TextDataProcessorps - a property sheet holding the new data
PropertyException - if there is a problem with the properties.
public Data getData()
             throws DataProcessingException
getData in interface DataProcessorgetData in class TextDataProcessorDataProcessingException - If there is a processing error.public void initialize()
initialize in interface DataProcessorinitialize in class TextDataProcessorpublic abstract void applyStemming(TextData inputTextDataObject)
inputTextDataObject - The TextData object to process.| 
 | |||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||