| 
 | |||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectemolib.util.proc.TextDataProcessor
emolib.tokenizer.Tokenizer
public abstract class Tokenizer
The Tokenizer abstract class defines the general structure to perform the tokenisation process, which splits a text string into individual units, called tokens.
The Tokenizer provides a means of structured data retrieval from the incoming text, which is basically the words in question being analysed (lexicon units) and their correspondent word-classes (lexical categories). So, the Tokenizer constitutes the Text Data feeder to the processing pipeline, aka the INPUTTER, therefore it implements the "inputData" method.
These tokens are expressed in regular patterns as established by the grammar of the language.
The child classes that will inherit the methods from this parent (abstract) class will be generated by the JavaCC parser/scanner generator for Java. This tool eases the production of a grammar through the specification of its lexicon (words/tokens) and its syntax to then create the correspondent Java class, i.e., the lexical analyser, to detect matches to this grammar.
By following this implementation, two important goals are brought to success:
This open definition of the tokenisation process (through parsing the incoming text with a well defined grammar) enables a wide range of applications such as Named Entity Tagging, Part-Of-Speech Tagging, morphological analysis, etc. The definition of the grammar (lexicon + syntax) must be suited to the task.
Once the text is tokenised (and thus categorised) it becomes more manageable. In the case that concerns EmoLib, for example, the words that are of no interest are the words that don't convey or modify affect, i.e., "stop words" such as function words, and they can hence be more easily filtered out if the input text is tokenised.
The words that actually have an affective content, namely nouns, verbs and adjectives, are marked as affective containers. Other words that don't contain any affect by themselves but have an influence on the nearby affective words are marked as modifiers. These are basically quantitative adverbs. EmoLib defines 3 levels of modification, defined through the external configuration file. Whether their value is positive or negative, the adverb denotes positive or negative intention.
| Field Summary | |
|---|---|
|  float | negation | 
|  float | negativeModifier1 | 
|  float | negativeModifier2 | 
|  float | negativeModifier3 | 
|  float | positiveModifier1 | 
|  float | positiveModifier2 | 
|  float | positiveModifier3 | 
| static java.lang.String | PROP_NEGATION | 
| static java.lang.String | PROP_NEGATIVE_MODIFIER_1 | 
| static java.lang.String | PROP_NEGATIVE_MODIFIER_2 | 
| static java.lang.String | PROP_NEGATIVE_MODIFIER_3 | 
| static java.lang.String | PROP_POSITIVE_MODIFIER_1 | 
| static java.lang.String | PROP_POSITIVE_MODIFIER_2 | 
| static java.lang.String | PROP_POSITIVE_MODIFIER_3 | 
| Constructor Summary | |
|---|---|
| Tokenizer() | |
| Method Summary | |
|---|---|
|  void | fillConfigurationValues(float positiveModifier1,
                        float positiveModifier2,
                        float positiveModifier3,
                        float negativeModifier1,
                        float negativeModifier2,
                        float negativeModifier3,
                        float negation)Method to fill this Tokenizer with the appropriate configuration values. | 
|  Data | getData()Generates the TextData available to the rest of the text processing chain. | 
| abstract  Tokenizer | getNew(java.lang.String initialization)Function to obtain a new initialized instance of the Tokenizer. | 
|  java.lang.String | getPossibleEmotionalContent()Function to retrieve the possible word emotional content of this Tokenizer. | 
|  java.lang.String | getWord()Function to retrieve the words of this Tokenizer. | 
|  java.lang.String | getWordClass()Function to retrieve the word-classes of this Tokenizer. | 
|  java.util.ArrayList | getWordModifierValue()Function to retrieve the list of word modifier values. | 
|  void | initialize()Method to initialize the Tokenizer. | 
|  void | inputData(java.lang.String theDataToBeInputted)Mehtod to input text data into the system. | 
|  void | newProperties(PropertySheet ps)This method is called when this configurable component has new data. | 
| abstract  void | parseGrammar()Method to parse the incoming text with the well defined grammar. | 
|  void | putModifierValue(float modifierValue)Method to put a modifier value into the queue. | 
|  void | putWord(java.lang.String insertionWord)Method to put a word into the system. | 
|  void | putWordClass(java.lang.String insertionWordClass)Method to put a word class into the system. | 
|  void | register(java.lang.String name,
         Registry registry)Register my properties. | 
|  void | setPossibleEmotionalContent(java.lang.String insertionPossibleEmotion)Method to put a possible emotional word into the system. | 
| Methods inherited from class emolib.util.proc.TextDataProcessor | 
|---|
| flush, getName, getPredecessor, setPredecessor, toString | 
| Methods inherited from class java.lang.Object | 
|---|
| clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait | 
| Field Detail | 
|---|
public static final java.lang.String PROP_POSITIVE_MODIFIER_1
public static final java.lang.String PROP_POSITIVE_MODIFIER_2
public static final java.lang.String PROP_POSITIVE_MODIFIER_3
public static final java.lang.String PROP_NEGATIVE_MODIFIER_1
public static final java.lang.String PROP_NEGATIVE_MODIFIER_2
public static final java.lang.String PROP_NEGATIVE_MODIFIER_3
public static final java.lang.String PROP_NEGATION
public float positiveModifier1
public float positiveModifier2
public float positiveModifier3
public float negativeModifier1
public float negativeModifier2
public float negativeModifier3
public float negation
| Constructor Detail | 
|---|
public Tokenizer()
| Method Detail | 
|---|
public void register(java.lang.String name,
                     Registry registry)
              throws PropertyException
Configurable
register in interface Configurableregister in class TextDataProcessorname - the name of the componentregistry - the registry for this component
PropertyException
public void newProperties(PropertySheet ps)
                   throws PropertyException
Configurable
newProperties in interface ConfigurablenewProperties in class TextDataProcessorps - a property sheet holding the new data
PropertyException - if there is a problem with the properties.
public Data getData()
             throws DataProcessingException
getData in interface DataProcessorgetData in class TextDataProcessorDataProcessingException - If there is a processing error.public void initialize()
initialize in interface DataProcessorinitialize in class TextDataProcessorpublic void putWord(java.lang.String insertionWord)
insertionWord - The word to be inserted.public java.lang.String getWord()
public void putWordClass(java.lang.String insertionWordClass)
insertionWordClass - The word class to be inserted.public java.lang.String getWordClass()
public void setPossibleEmotionalContent(java.lang.String insertionPossibleEmotion)
insertionPossibleEmotion - The chance to have affective content.public java.lang.String getPossibleEmotionalContent()
public void putModifierValue(float modifierValue)
modifierValue - The modifier value.public java.util.ArrayList getWordModifierValue()
public void fillConfigurationValues(float positiveModifier1,
                                    float positiveModifier2,
                                    float positiveModifier3,
                                    float negativeModifier1,
                                    float negativeModifier2,
                                    float negativeModifier3,
                                    float negation)
positiveModifier1 - The positive modifier value, level 1.positiveModifier2 - The positive modifier value, level 2.positiveModifier3 - The positive modifier value, level 3.negativeModifier1 - The negative modifier value, level 1.negativeModifier2 - The negative modifier value, level 2.negativeModifier3 - The negative modifier value, level 3.
public abstract void parseGrammar()
                           throws java.lang.Exception
java.lang.Exception - If a ParseException occurs.public abstract Tokenizer getNew(java.lang.String initialization)
initialization - The string to initialize the new Tokenizer.
public void inputData(java.lang.String theDataToBeInputted)
               throws java.lang.Exception
theDataToBeInputted - The text to be inputted.
java.lang.Exception| 
 | |||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||