Tokenizer (EmoLib)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

emolib.tokenizer
Class Tokenizer

java.lang.Object
  emolib.util.proc.TextDataProcessor
      emolib.tokenizer.Tokenizer

All Implemented Interfaces:: Configurable, DataProcessor

Direct Known Subclasses:: EnglishLexer, SpanishLexer

public abstract class Tokenizer
extends TextDataProcessor
extends TextDataProcessor

The Tokenizer abstract class defines the general structure to perform the tokenisation process, which splits a text string into individual units, called tokens.

The Tokenizer provides a means of structured data retrieval from the incoming text, which is basically the words in question being analysed (lexicon units) and their correspondent word-classes (lexical categories). So, the Tokenizer constitutes the Text Data feeder to the processing pipeline, aka the INPUTTER, therefore it implements the "inputData" method.

These tokens are expressed in regular patterns as established by the grammar of the language.

The child classes that will inherit the methods from this parent (abstract) class will be generated by the JavaCC parser/scanner generator for Java. This tool eases the production of a grammar through the specification of its lexicon (words/tokens) and its syntax to then create the correspondent Java class, i.e., the lexical analyser, to detect matches to this grammar.

By following this implementation, two important goals are brought to success:

The determination of the admissible words of a language, achieved by matching the incoming text units to the tokens described in the grammar.
The verification that the order of the tokens matches the rules described by the syntax of the grammar.

This open definition of the tokenisation process (through parsing the incoming text with a well defined grammar) enables a wide range of applications such as Named Entity Tagging, Part-Of-Speech Tagging, morphological analysis, etc. The definition of the grammar (lexicon + syntax) must be suited to the task.

Once the text is tokenised (and thus categorised) it becomes more manageable. In the case that concerns EmoLib, for example, the words that are of no interest are the words that don't convey or modify affect, i.e., "stop words" such as function words, and they can hence be more easily filtered out if the input text is tokenised.

The words that actually have an affective content, namely nouns, verbs and adjectives, are marked as affective containers. Other words that don't contain any affect by themselves but have an influence on the nearby affective words are marked as modifiers. These are basically quantitative adverbs. EmoLib defines 3 levels of modification, defined through the external configuration file. Whether their value is positive or negative, the adverb denotes positive or negative intention.

Author:: Alexandre Trilla (atrilla@salle.url.edu)

Field Summary
`float`	`negation`
`float`	`negativeModifier1`
`float`	`negativeModifier2`
`float`	`negativeModifier3`
`float`	`positiveModifier1`
`float`	`positiveModifier2`
`float`	`positiveModifier3`
`static java.lang.String`	`PROP_NEGATION`
`static java.lang.String`	`PROP_NEGATIVE_MODIFIER_1`
`static java.lang.String`	`PROP_NEGATIVE_MODIFIER_2`
`static java.lang.String`	`PROP_NEGATIVE_MODIFIER_3`
`static java.lang.String`	`PROP_POSITIVE_MODIFIER_1`
`static java.lang.String`	`PROP_POSITIVE_MODIFIER_2`
`static java.lang.String`	`PROP_POSITIVE_MODIFIER_3`

Constructor Summary
`Tokenizer()`

Method Summary
`void`	`fillConfigurationValues(float positiveModifier1, float positiveModifier2, float positiveModifier3, float negativeModifier1, float negativeModifier2, float negativeModifier3, float negation)` Method to fill this Tokenizer with the appropriate configuration values.
`Data`	`getData()` Generates the TextData available to the rest of the text processing chain.
`abstract Tokenizer`	`getNew(java.lang.String initialization)` Function to obtain a new initialized instance of the Tokenizer.
`java.lang.String`	`getPossibleEmotionalContent()` Function to retrieve the possible word emotional content of this Tokenizer.
`java.lang.String`	`getWord()` Function to retrieve the words of this Tokenizer.
`java.lang.String`	`getWordClass()` Function to retrieve the word-classes of this Tokenizer.
`java.util.ArrayList`	`getWordModifierValue()` Function to retrieve the list of word modifier values.
`void`	`initialize()` Method to initialize the Tokenizer.
`void`	`inputData(java.lang.String theDataToBeInputted)` Mehtod to input text data into the system.
`void`	`newProperties(PropertySheet ps)` This method is called when this configurable component has new data.
`abstract void`	`parseGrammar()` Method to parse the incoming text with the well defined grammar.
`void`	`putModifierValue(float modifierValue)` Method to put a modifier value into the queue.
`void`	`putWord(java.lang.String insertionWord)` Method to put a word into the system.
`void`	`putWordClass(java.lang.String insertionWordClass)` Method to put a word class into the system.
`void`	`register(java.lang.String name, Registry registry)` Register my properties.
`void`	`setPossibleEmotionalContent(java.lang.String insertionPossibleEmotion)` Method to put a possible emotional word into the system.

Methods inherited from class emolib.util.proc.TextDataProcessor
`flush, getName, getPredecessor, setPredecessor, toString`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait`

Field Detail

PROP_POSITIVE_MODIFIER_1

public static final java.lang.String PROP_POSITIVE_MODIFIER_1

See Also:: Constant Field Values

PROP_POSITIVE_MODIFIER_2

public static final java.lang.String PROP_POSITIVE_MODIFIER_2

See Also:: Constant Field Values

PROP_POSITIVE_MODIFIER_3

public static final java.lang.String PROP_POSITIVE_MODIFIER_3

See Also:: Constant Field Values

PROP_NEGATIVE_MODIFIER_1

public static final java.lang.String PROP_NEGATIVE_MODIFIER_1

See Also:: Constant Field Values

PROP_NEGATIVE_MODIFIER_2

public static final java.lang.String PROP_NEGATIVE_MODIFIER_2

See Also:: Constant Field Values

PROP_NEGATIVE_MODIFIER_3

public static final java.lang.String PROP_NEGATIVE_MODIFIER_3

See Also:: Constant Field Values

PROP_NEGATION

public static final java.lang.String PROP_NEGATION

See Also:: Constant Field Values

positiveModifier1

public float positiveModifier1

positiveModifier2

public float positiveModifier2

positiveModifier3

public float positiveModifier3

negativeModifier1

public float negativeModifier1

negativeModifier2

public float negativeModifier2

negativeModifier3

public float negativeModifier3

negation

public float negation

Constructor Detail

Tokenizer

public Tokenizer()

Method Detail

register

public void register(java.lang.String name,
                     Registry registry)
              throws PropertyException

Description copied from interface: Configurable

Register my properties. This method is called once early in the time of the component, shortly after the component is constructed. This component should register any configuration properties that it needs to register. If this configurable extends another configurable, super.register should also be called

Specified by:: register in interface Configurable
Overrides:: register in class TextDataProcessor

Parameters:: name - the name of the component; registry - the registry for this component
Throws:: PropertyException

newProperties

public void newProperties(PropertySheet ps)
                   throws PropertyException

Description copied from interface: Configurable

This method is called when this configurable component has new data. The component should first validate the data. If it is bad the component should return false. If the data is good, the component should record the the data internally and return true.

Specified by:: newProperties in interface Configurable
Overrides:: newProperties in class TextDataProcessor

Parameters:: ps - a property sheet holding the new data
Throws:: PropertyException - if there is a problem with the properties.

getData

public Data getData()
             throws DataProcessingException

Generates the TextData available to the rest of the text processing chain.

Specified by:: getData in interface DataProcessor
Specified by:: getData in class TextDataProcessor

Returns:: The next available Data object, returns null if no Data object is available.
Throws:: DataProcessingException - If there is a processing error.

initialize

public void initialize()

Method to initialize the Tokenizer.

Specified by:: initialize in interface DataProcessor
Overrides:: initialize in class TextDataProcessor

putWord

public void putWord(java.lang.String insertionWord)

Method to put a word into the system.

Parameters:: insertionWord - The word to be inserted.

getWord

public java.lang.String getWord()

Function to retrieve the words of this Tokenizer.

Returns:: The words of this Tokenizer.

putWordClass

public void putWordClass(java.lang.String insertionWordClass)

Method to put a word class into the system. It is assumed that this method is called right after the insertion of a word, so that the lengths of the parsed words and classes of word strings are kept consistent.

Parameters:: insertionWordClass - The word class to be inserted.

getWordClass

public java.lang.String getWordClass()

Function to retrieve the word-classes of this Tokenizer.

Returns:: The word-classes of this Tokenizer.

setPossibleEmotionalContent

public void setPossibleEmotionalContent(java.lang.String insertionPossibleEmotion)

Method to put a possible emotional word into the system. It is assumed that this method is called right after the insertion of a word, so that the lengths of the parsed words and classes of word strings are kept consistent.

Parameters:: insertionPossibleEmotion - The chance to have affective content.

getPossibleEmotionalContent

public java.lang.String getPossibleEmotionalContent()

Function to retrieve the possible word emotional content of this Tokenizer.

Returns:: The possible word emotional content.

putModifierValue

public void putModifierValue(float modifierValue)

Method to put a modifier value into the queue.

Parameters:: modifierValue - The modifier value.

getWordModifierValue

public java.util.ArrayList getWordModifierValue()

Function to retrieve the list of word modifier values.

Returns:: The list of word modifier values.

fillConfigurationValues

public void fillConfigurationValues(float positiveModifier1,
                                    float positiveModifier2,
                                    float positiveModifier3,
                                    float negativeModifier1,
                                    float negativeModifier2,
                                    float negativeModifier3,
                                    float negation)

Method to fill this Tokenizer with the appropriate configuration values. This method exists because the objects declared in it need these values and they can't be inherited.

Parameters:: positiveModifier1 - The positive modifier value, level 1.; positiveModifier2 - The positive modifier value, level 2.; positiveModifier3 - The positive modifier value, level 3.; negativeModifier1 - The negative modifier value, level 1.; negativeModifier2 - The negative modifier value, level 2.; negativeModifier3 - The negative modifier value, level 3.

parseGrammar

public abstract void parseGrammar()
                           throws java.lang.Exception

Method to parse the incoming text with the well defined grammar.

Throws:: java.lang.Exception - If a ParseException occurs.

getNew

public abstract Tokenizer getNew(java.lang.String initialization)

Function to obtain a new initialized instance of the Tokenizer. The real (not abstract) tokenizers should override this function.

Parameters:: initialization - The string to initialize the new Tokenizer.
Returns:: The new Tokenizer.

inputData

public void inputData(java.lang.String theDataToBeInputted)
               throws java.lang.Exception

Mehtod to input text data into the system. It is prepared for the optional use of a parser implementation (through the instantiation of a new class). This method labels the Tokenizer module as an INPUTTER.

Parameters:: theDataToBeInputted - The text to be inputted.
Throws:: java.lang.Exception