emolib.classifier.machinelearning
Class ARNReduced

java.lang.Object
  extended by emolib.util.proc.TextDataProcessor
      extended by emolib.classifier.Classifier
          extended by emolib.classifier.machinelearning.ARNReduced
All Implemented Interfaces:
Configurable, DataProcessor

public class ARNReduced
extends Classifier

The ARNReduced classifies according to a cosine similarity in a weighted Vector Space Model with co-occurrences, which are assumed to capture the style in text.

The Associative Relational Network - Reduced is word co-occurrence network-based model, see the figure below, which constructs a Vector Space Model (VSM) with a term selection method "on the fly" based on the observation of test features (Alías et al., 2008). This term selection refinement is reported to improve the classical VSM for classification. Dense vectors representing the input text and the class are retrieved (no learning process is involved) and evaluated by the cosine similarity measure. The basic hypothesis in using the ARN-R for classification is the contiguity hypothesis, where terms in the same class form a contiguous region and regions of different classes do not overlap.

Associative Relational Network - Reduced

The ARN-R also provides several methods, i.e., criteria, 1) to weight the features in order to enhance their discriminative features, and 2) to select the most relevant features in order to reduce the sparsity in the VSM. These approaches intend to simplify the model in order to generalise better.

In addition, the ARNReduced provides a classical VSM implementation which enables the retrieval of sparse vectors, and therefore standardises the interface to the textual features for any vector-based classifier.

--
(Alías et al., 2008) Francesc Alías, Xavier Sevillano, Joan Claudi Socoró and Xavier Gonzalvo, "Towards high quality next-generation Text-to-Speech synthesis: a Multidomain approach by automatic domain classification", IEEE Transactions on Audio, Speech and Language Processing (Special issue on New Approaches to Statistical Speech and Text Processing) (ISSN 1558-7916), vol. 16 (7), pp. 1340-1354, September.

Author:
Alexandre Trilla (atrilla@salle.url.edu)

Nested Class Summary
 class ARNReduced.Graph
          Generic graph inner class.
 class ARNReduced.GraphElement
          Inner class representing an element of the graph.
 
Field Summary
static java.lang.String PROP_EXTERNAL_FILE
          Property to indicate a pre-trained classifier.
 
Constructor Summary
ARNReduced()
          Main constructor of this classifier.
 
Method Summary
 void applyModelTermWeighing(ARNReduced.Graph inputGraph, int cat)
          Method to weight the terms of corresponding to the model vector.
 void applyTermWeighing(ARNReduced.Graph inputGraph, int cat)
          Method to apply a term weighting methodology to the given graph.
 ARNReduced.Graph buildFullGraph(ARNReduced.Graph input)
          Function to build a full graph with the term frequencies given by the input terms.
 ARNReduced.Graph buildGraph(FeatureBox inputFeatures)
          Function to build a graph from input features.
 int getBigramVocabularySize(int bigramFreqThreshold, java.lang.String cat)
          Function to retrieve the number of terms (vocabulary size, bigrams alone) which frequency is greater than the given threshold, wrt a given category.
 java.lang.String getCategory(FeatureBox inputFeatures)
          The function that decides the most appropriate emotional category.
 java.util.ArrayList<ARNReduced.Graph> getCategoryGraphs()
          Function to recover the category-specific graphs.
 java.util.HashMap getCategoryHash()
          Function to retrieve a hash map of the categories to deal with.
 java.util.ArrayList<java.lang.String> getCategoryList()
          Function to retrieve a list of the categories to deal with.
 int getCorpusSize(java.lang.String cat)
          Function to retrieve the corpus size (number of words) of the given category.
 int getCorpusTupleSize(java.lang.String cat)
          Function to retrieve the corpus size of tuples of the given category.
 void getOrderedList(java.lang.String cat, java.util.ArrayList<java.lang.String> wList, java.util.ArrayList<java.lang.Integer> fList)
          Function to retrieve a sorted list (in frequency descending order) of words.
 void getOrderedTupleList(java.lang.String cat, java.util.ArrayList<java.lang.String> wList, java.util.ArrayList<java.lang.Integer> fList)
          Function to retrieve a sorted list (in frequency descending order) of tuples.
 float getSimilarity(FeatureBox inputText, java.lang.String cat)
          Function to retrieve the similarity of a given text with a given category.
 ARNReduced.Graph getVocabularyGraph()
          Function to recover the full vocabulary graph.
 int getVocabularySize(int wordFreqThreshold, java.lang.String cat)
          Function to retrieve the number of terms (vocabulary size, words alone) which frequency is greater than the given threshold, wrt a given category.
 void initialize()
          Method to initialize the Classifier.
 void load(java.lang.String path)
          Generic function to load a previously saved classifier.
 void newProperties(PropertySheet ps)
          This method is called when this configurable component has new data.
 void register(java.lang.String name, Registry registry)
          Register my properties.
 void resetExamples()
          Method to reset the classifier and flush the training examples.
 void save(java.lang.String path)
          Generic method to save the fully fledged classifier into a given file path.
 void setCOF(boolean flag)
          Method to set the assessment of co-ocurrence frequencies (tuples actually).
 void setFeatSelChi2(boolean flag, int numFeats)
          Method to set the Chi square global feature selection.
 void setFeatSelMI(boolean flag, int numFeats)
          Method to set the Mutual Information global feature selection.
 void setFeatSelTF(boolean flag, int numFeats)
          Method to set the Term Frequency global feature selection.
 void setPOS(boolean flag)
          Method to set the assessment of POS tags (grammatical analysis).
 void setSimilarityMeasure(java.lang.String simil)
          Method to set the similarity measure.
 void setStems(boolean flag)
          Method to set the assessment of stemmed terms.
 void setSynonyms(boolean flag)
          Method to set the assessment of synonyms.
 void setTermWeighingMeasure(java.lang.String twm)
          Method to set the term weighting measure.
 void simpleClassification()
          Functionality test.
 void trainingProcedure()
          Generic training procedure.
 
Methods inherited from class emolib.classifier.Classifier
applyClassification, getData, getListOfExampleCategories, getListOfExampleFeatures, inputTrainingExample, train
 
Methods inherited from class emolib.util.proc.TextDataProcessor
flush, getName, getPredecessor, setPredecessor, toString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

PROP_EXTERNAL_FILE

public static final java.lang.String PROP_EXTERNAL_FILE
Property to indicate a pre-trained classifier.

See Also:
Constant Field Values
Constructor Detail

ARNReduced

public ARNReduced()
Main constructor of this classifier. It assigns, by default, the term frequency as the weigthting term method, the cosine distance as the similarity measure and no co-ocurrence frquencies.

Method Detail

register

public void register(java.lang.String name,
                     Registry registry)
              throws PropertyException
Description copied from interface: Configurable
Register my properties. This method is called once early in the time of the component, shortly after the component is constructed. This component should register any configuration properties that it needs to register. If this configurable extends another configurable, super.register should also be called

Specified by:
register in interface Configurable
Overrides:
register in class Classifier
Parameters:
name - the name of the component
registry - the registry for this component
Throws:
PropertyException

newProperties

public void newProperties(PropertySheet ps)
                   throws PropertyException
Description copied from interface: Configurable
This method is called when this configurable component has new data. The component should first validate the data. If it is bad the component should return false. If the data is good, the component should record the the data internally and return true.

Specified by:
newProperties in interface Configurable
Overrides:
newProperties in class Classifier
Parameters:
ps - a property sheet holding the new data
Throws:
PropertyException - if there is a problem with the properties.

initialize

public void initialize()
Method to initialize the Classifier.

Specified by:
initialize in interface DataProcessor
Overrides:
initialize in class Classifier

getCategoryList

public java.util.ArrayList<java.lang.String> getCategoryList()
Function to retrieve a list of the categories to deal with. This function is important to iterate over the category labels because the iterator returned by the hash map is not guaranteed to remain constant over time.

Returns:
The list of the categories to deal with.

getCategoryHash

public java.util.HashMap getCategoryHash()
Function to retrieve a hash map of the categories to deal with.

Returns:
The hash map of the categories to deal with.

getCategoryGraphs

public java.util.ArrayList<ARNReduced.Graph> getCategoryGraphs()
Function to recover the category-specific graphs.

Returns:
The list of category graphs.

getVocabularyGraph

public ARNReduced.Graph getVocabularyGraph()
Function to recover the full vocabulary graph.

Returns:
The full graph.

getCorpusSize

public int getCorpusSize(java.lang.String cat)
Function to retrieve the corpus size (number of words) of the given category. That is, the sum of all term frequencies (terms considered to be words alone).

Parameters:
cat - The given category.
Returns:
The corpus size.

getCorpusTupleSize

public int getCorpusTupleSize(java.lang.String cat)
Function to retrieve the corpus size of tuples of the given category. That is, the sum of all term frequencies (terms considered to be tuples).

Parameters:
cat - The given category.
Returns:
The corpus tuple size.

getVocabularySize

public int getVocabularySize(int wordFreqThreshold,
                             java.lang.String cat)
Function to retrieve the number of terms (vocabulary size, words alone) which frequency is greater than the given threshold, wrt a given category.

Parameters:
wordFreqThreshold - The word frequency treshold.
cat - The given category.
Returns:
The vocabulary size.

getBigramVocabularySize

public int getBigramVocabularySize(int bigramFreqThreshold,
                                   java.lang.String cat)
Function to retrieve the number of terms (vocabulary size, bigrams alone) which frequency is greater than the given threshold, wrt a given category.

Parameters:
bigramFreqThreshold - The bigram frequency treshold.
cat - The given category.
Returns:
The vocabulary size.

getOrderedList

public void getOrderedList(java.lang.String cat,
                           java.util.ArrayList<java.lang.String> wList,
                           java.util.ArrayList<java.lang.Integer> fList)
Function to retrieve a sorted list (in frequency descending order) of words. The sorting algorithm of use is the bubble sort.

Parameters:
cat - The given category.
wList - The list if words to produce.
fList - The list of frequencies to produce.

getOrderedTupleList

public void getOrderedTupleList(java.lang.String cat,
                                java.util.ArrayList<java.lang.String> wList,
                                java.util.ArrayList<java.lang.Integer> fList)
Function to retrieve a sorted list (in frequency descending order) of tuples. The sorting algorithm of use is the bubble sort.

Parameters:
cat - The given category.
wList - The list if tuples to produce.
fList - The list of frequencies to produce.

setTermWeighingMeasure

public void setTermWeighingMeasure(java.lang.String twm)
Method to set the term weighting measure.

Parameters:
twm - The term weighting measure.

setSimilarityMeasure

public void setSimilarityMeasure(java.lang.String simil)
Method to set the similarity measure.

Parameters:
simil - The similarity measure.

setCOF

public void setCOF(boolean flag)
Method to set the assessment of co-ocurrence frequencies (tuples actually).

Parameters:
flag - The set flag.

setPOS

public void setPOS(boolean flag)
Method to set the assessment of POS tags (grammatical analysis).

Parameters:
flag - The set flag.

setSynonyms

public void setSynonyms(boolean flag)
Method to set the assessment of synonyms.

Parameters:
flag - The set flag.

setStems

public void setStems(boolean flag)
Method to set the assessment of stemmed terms.

Parameters:
flag - The set flag.

setFeatSelMI

public void setFeatSelMI(boolean flag,
                         int numFeats)
Method to set the Mutual Information global feature selection.

Parameters:
flag - The set flag.
numFeats - The number of feats per class to select.

setFeatSelChi2

public void setFeatSelChi2(boolean flag,
                           int numFeats)
Method to set the Chi square global feature selection.

Parameters:
flag - The set flag.
numFeats - The number of feats per class to select.

setFeatSelTF

public void setFeatSelTF(boolean flag,
                         int numFeats)
Method to set the Term Frequency global feature selection.

Parameters:
flag - The set flag.
numFeats - The number of feats per class to select.

getSimilarity

public float getSimilarity(FeatureBox inputText,
                           java.lang.String cat)
Function to retrieve the similarity of a given text with a given category.

Parameters:
inputText - The given text.
cat - The given category.
Returns:
The resulting similarity.

applyModelTermWeighing

public void applyModelTermWeighing(ARNReduced.Graph inputGraph,
                                   int cat)
Method to weight the terms of corresponding to the model vector. In some cases the Term Weighting method needs information from the original domain model (e.g., the |T^k| in (Alías et al., 2008) for the ITF). That's the reason for using this method instead of the general applyTermWeighing.

Parameters:
The - given graph to weight.
The - given catetory for supervised term weighting methods.

applyTermWeighing

public void applyTermWeighing(ARNReduced.Graph inputGraph,
                              int cat)
Method to apply a term weighting methodology to the given graph. In the case that the Term Weighting strategy of the the ARN is supervised, a category is also provided. If no weighting strategy is needed, the input graph will not be modified (it will just contain the default frequencies of the terms within).

Parameters:
The - given graph to weight.
The - given catetory for supervised term weighting methods.

buildGraph

public ARNReduced.Graph buildGraph(FeatureBox inputFeatures)
Function to build a graph from input features.

Parameters:
inputFeatures - The input features.
Returns:
The resulting graph.

buildFullGraph

public ARNReduced.Graph buildFullGraph(ARNReduced.Graph input)
Function to build a full graph with the term frequencies given by the input terms.

Parameters:
input - The input text graph.
Returns:
The full graph.

getCategory

public java.lang.String getCategory(FeatureBox inputFeatures)
Description copied from class: Classifier
The function that decides the most appropriate emotional category. This is required for any classifier. The classifier in question has to previously run any training algorithm in order to provide the required prediction.

Specified by:
getCategory in class Classifier
Parameters:
inputFeatures - The input emotional features.
Returns:
The most appropriate emotional category.

trainingProcedure

public void trainingProcedure()
Description copied from class: Classifier
Generic training procedure. It trains the classifier in question with the input training examples.

Specified by:
trainingProcedure in class Classifier

save

public void save(java.lang.String path)
Description copied from class: Classifier
Generic method to save the fully fledged classifier into a given file path. It is recommended to use a plain text file (such as XML) to save the classifier's configuration since it's readable directly.

Specified by:
save in class Classifier
Parameters:
path - The file path to save the classifier.

load

public void load(java.lang.String path)
Description copied from class: Classifier
Generic function to load a previously saved classifier. This function should be consistent with the design followed in the saving procedure.

Specified by:
load in class Classifier
Parameters:
path - The path of the file which contains the previously saved classifier.

resetExamples

public void resetExamples()
Description copied from class: Classifier
Method to reset the classifier and flush the training examples. This method only makes sense if the classifier in question is trainable and already has some training examples.

Overrides:
resetExamples in class Classifier

simpleClassification

public void simpleClassification()
Functionality test.