emolib.tokenizer
Class Tokenizer

java.lang.Object
  extended by emolib.util.proc.TextDataProcessor
      extended by emolib.tokenizer.Tokenizer
All Implemented Interfaces:
Configurable, DataProcessor
Direct Known Subclasses:
EnglishLexer, SpanishLexer

public abstract class Tokenizer
extends TextDataProcessor

The Tokenizer abstract class defines the general structure to perform the tokenisation process, which splits a text string into individual units, called tokens.

The Tokenizer provides a means of structured data retrieval from the incoming text, which is basically the words in question being analysed (lexicon units) and their correspondent word-classes (lexical categories). So, the Tokenizer constitutes the Text Data feeder to the processing pipeline, aka the INPUTTER, therefore it implements the "inputData" method.

These tokens are expressed in regular patterns as established by the grammar of the language.

The child classes that will inherit the methods from this parent (abstract) class will be generated by the JavaCC parser/scanner generator for Java. This tool eases the production of a grammar through the specification of its lexicon (words/tokens) and its syntax to then create the correspondent Java class, i.e., the lexical analyser, to detect matches to this grammar.

By following this implementation, two important goals are brought to success:

This open definition of the tokenisation process (through parsing the incoming text with a well defined grammar) enables a wide range of applications such as Named Entity Tagging, Part-Of-Speech Tagging, morphological analysis, etc. The definition of the grammar (lexicon + syntax) must be suited to the task.

Once the text is tokenised (and thus categorised) it becomes more manageable. In the case that concerns EmoLib, for example, the words that are of no interest are the words that don't convey or modify affect, i.e., "stop words" such as function words, and they can hence be more easily filtered out if the input text is tokenised.

The words that actually have an affective content, namely nouns, verbs and adjectives, are marked as affective containers. Other words that don't contain any affect by themselves but have an influence on the nearby affective words are marked as modifiers. These are basically quantitative adverbs. EmoLib defines 3 levels of modification, defined through the external configuration file. Whether their value is positive or negative, the adverb denotes positive or negative intention.

Author:
Alexandre Trilla (atrilla@salle.url.edu)

Field Summary
 float negation
           
 float negativeModifier1
           
 float negativeModifier2
           
 float negativeModifier3
           
 float positiveModifier1
           
 float positiveModifier2
           
 float positiveModifier3
           
static java.lang.String PROP_NEGATION
           
static java.lang.String PROP_NEGATIVE_MODIFIER_1
           
static java.lang.String PROP_NEGATIVE_MODIFIER_2
           
static java.lang.String PROP_NEGATIVE_MODIFIER_3
           
static java.lang.String PROP_POSITIVE_MODIFIER_1
           
static java.lang.String PROP_POSITIVE_MODIFIER_2
           
static java.lang.String PROP_POSITIVE_MODIFIER_3
           
 
Constructor Summary
Tokenizer()
           
 
Method Summary
 void fillConfigurationValues(float positiveModifier1, float positiveModifier2, float positiveModifier3, float negativeModifier1, float negativeModifier2, float negativeModifier3, float negation)
          Method to fill this Tokenizer with the appropriate configuration values.
 Data getData()
          Generates the TextData available to the rest of the text processing chain.
abstract  Tokenizer getNew(java.lang.String initialization)
          Function to obtain a new initialized instance of the Tokenizer.
 java.lang.String getPossibleEmotionalContent()
          Function to retrieve the possible word emotional content of this Tokenizer.
 java.lang.String getWord()
          Function to retrieve the words of this Tokenizer.
 java.lang.String getWordClass()
          Function to retrieve the word-classes of this Tokenizer.
 java.util.ArrayList getWordModifierValue()
          Function to retrieve the list of word modifier values.
 void initialize()
          Method to initialize the Tokenizer.
 void inputData(java.lang.String theDataToBeInputted)
          Mehtod to input text data into the system.
 void newProperties(PropertySheet ps)
          This method is called when this configurable component has new data.
abstract  void parseGrammar()
          Method to parse the incoming text with the well defined grammar.
 void putModifierValue(float modifierValue)
          Method to put a modifier value into the queue.
 void putWord(java.lang.String insertionWord)
          Method to put a word into the system.
 void putWordClass(java.lang.String insertionWordClass)
          Method to put a word class into the system.
 void register(java.lang.String name, Registry registry)
          Register my properties.
 void setPossibleEmotionalContent(java.lang.String insertionPossibleEmotion)
          Method to put a possible emotional word into the system.
 
Methods inherited from class emolib.util.proc.TextDataProcessor
flush, getName, getPredecessor, setPredecessor, toString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

PROP_POSITIVE_MODIFIER_1

public static final java.lang.String PROP_POSITIVE_MODIFIER_1
See Also:
Constant Field Values

PROP_POSITIVE_MODIFIER_2

public static final java.lang.String PROP_POSITIVE_MODIFIER_2
See Also:
Constant Field Values

PROP_POSITIVE_MODIFIER_3

public static final java.lang.String PROP_POSITIVE_MODIFIER_3
See Also:
Constant Field Values

PROP_NEGATIVE_MODIFIER_1

public static final java.lang.String PROP_NEGATIVE_MODIFIER_1
See Also:
Constant Field Values

PROP_NEGATIVE_MODIFIER_2

public static final java.lang.String PROP_NEGATIVE_MODIFIER_2
See Also:
Constant Field Values

PROP_NEGATIVE_MODIFIER_3

public static final java.lang.String PROP_NEGATIVE_MODIFIER_3
See Also:
Constant Field Values

PROP_NEGATION

public static final java.lang.String PROP_NEGATION
See Also:
Constant Field Values

positiveModifier1

public float positiveModifier1

positiveModifier2

public float positiveModifier2

positiveModifier3

public float positiveModifier3

negativeModifier1

public float negativeModifier1

negativeModifier2

public float negativeModifier2

negativeModifier3

public float negativeModifier3

negation

public float negation
Constructor Detail

Tokenizer

public Tokenizer()
Method Detail

register

public void register(java.lang.String name,
                     Registry registry)
              throws PropertyException
Description copied from interface: Configurable
Register my properties. This method is called once early in the time of the component, shortly after the component is constructed. This component should register any configuration properties that it needs to register. If this configurable extends another configurable, super.register should also be called

Specified by:
register in interface Configurable
Overrides:
register in class TextDataProcessor
Parameters:
name - the name of the component
registry - the registry for this component
Throws:
PropertyException

newProperties

public void newProperties(PropertySheet ps)
                   throws PropertyException
Description copied from interface: Configurable
This method is called when this configurable component has new data. The component should first validate the data. If it is bad the component should return false. If the data is good, the component should record the the data internally and return true.

Specified by:
newProperties in interface Configurable
Overrides:
newProperties in class TextDataProcessor
Parameters:
ps - a property sheet holding the new data
Throws:
PropertyException - if there is a problem with the properties.

getData

public Data getData()
             throws DataProcessingException
Generates the TextData available to the rest of the text processing chain.

Specified by:
getData in interface DataProcessor
Specified by:
getData in class TextDataProcessor
Returns:
The next available Data object, returns null if no Data object is available.
Throws:
DataProcessingException - If there is a processing error.

initialize

public void initialize()
Method to initialize the Tokenizer.

Specified by:
initialize in interface DataProcessor
Overrides:
initialize in class TextDataProcessor

putWord

public void putWord(java.lang.String insertionWord)
Method to put a word into the system.

Parameters:
insertionWord - The word to be inserted.

getWord

public java.lang.String getWord()
Function to retrieve the words of this Tokenizer.

Returns:
The words of this Tokenizer.

putWordClass

public void putWordClass(java.lang.String insertionWordClass)
Method to put a word class into the system. It is assumed that this method is called right after the insertion of a word, so that the lengths of the parsed words and classes of word strings are kept consistent.

Parameters:
insertionWordClass - The word class to be inserted.

getWordClass

public java.lang.String getWordClass()
Function to retrieve the word-classes of this Tokenizer.

Returns:
The word-classes of this Tokenizer.

setPossibleEmotionalContent

public void setPossibleEmotionalContent(java.lang.String insertionPossibleEmotion)
Method to put a possible emotional word into the system. It is assumed that this method is called right after the insertion of a word, so that the lengths of the parsed words and classes of word strings are kept consistent.

Parameters:
insertionPossibleEmotion - The chance to have affective content.

getPossibleEmotionalContent

public java.lang.String getPossibleEmotionalContent()
Function to retrieve the possible word emotional content of this Tokenizer.

Returns:
The possible word emotional content.

putModifierValue

public void putModifierValue(float modifierValue)
Method to put a modifier value into the queue.

Parameters:
modifierValue - The modifier value.

getWordModifierValue

public java.util.ArrayList getWordModifierValue()
Function to retrieve the list of word modifier values.

Returns:
The list of word modifier values.

fillConfigurationValues

public void fillConfigurationValues(float positiveModifier1,
                                    float positiveModifier2,
                                    float positiveModifier3,
                                    float negativeModifier1,
                                    float negativeModifier2,
                                    float negativeModifier3,
                                    float negation)
Method to fill this Tokenizer with the appropriate configuration values. This method exists because the objects declared in it need these values and they can't be inherited.

Parameters:
positiveModifier1 - The positive modifier value, level 1.
positiveModifier2 - The positive modifier value, level 2.
positiveModifier3 - The positive modifier value, level 3.
negativeModifier1 - The negative modifier value, level 1.
negativeModifier2 - The negative modifier value, level 2.
negativeModifier3 - The negative modifier value, level 3.

parseGrammar

public abstract void parseGrammar()
                           throws java.lang.Exception
Method to parse the incoming text with the well defined grammar.

Throws:
java.lang.Exception - If a ParseException occurs.

getNew

public abstract Tokenizer getNew(java.lang.String initialization)
Function to obtain a new initialized instance of the Tokenizer. The real (not abstract) tokenizers should override this function.

Parameters:
initialization - The string to initialize the new Tokenizer.
Returns:
The new Tokenizer.

inputData

public void inputData(java.lang.String theDataToBeInputted)
               throws java.lang.Exception
Mehtod to input text data into the system. It is prepared for the optional use of a parser implementation (through the instantiation of a new class). This method labels the Tokenizer module as an INPUTTER.

Parameters:
theDataToBeInputted - The text to be inputted.
Throws:
java.lang.Exception