Alexandre Trilla, PhD - Data Scientist |

Blog

-- Thoughts on data analysis, software development and innovation management. Comments are welcome

Natural Language Processing: forthcoming online classes at Stanford, and a unified approach with Multilayer Neural Networks

22-Jan-2012

Needless to say, I was very eager to begin the nlp-class tomorrow. It's a pity that it has just been delayed for administrative reasons. But good things take time, so we'd better be patient 'cos this looks very good (45000 people expecting its launch, this is incredibly exciting). Having used Manning's and Jurafsky's masterpieces for years, it is now enthralling to have the chance to attend their classes, getting my hands dirty with the classical problems, techniques and the whole lot of details of statistical NLP.

But with time one gets to become a little agnostic with respect to the idealisation of some method for solving a particular problem. Let me elaborate on this. For instance, given a classification problem and a particular complexity for a discriminating function, does it really matter to learn the boundary with a Maximum Entropy principle (e.g., Logistic Regression), or with a Maximum Margin criterion (e.g., Support Vector Machine)? Will the optimised functions be much different? Assuming that the two classifiers are successfully trained (i.e., they are pertinently regularised, so no overfitting problems exist, and they are still complex enough for the problem at hand, so no underfitting either), my experience tells me that the particular learning philosophy/scheme (MaxEnt vs. Max-Margin) is reduced to a minimum. And this feeling (also supported by evidence) is shared with other researchers such as Alexander Osherenko (see our public discussion on the Corpora-List), and Ronan Collobert, who even developed a fully unified approach based on Multilayer Neural Networks (MNN) as the universal learners (*) to tackle a variety of typical NLP tasks, including Part-Of-Speech tagging, chunking, Named Entity Recognition and Semantic Role Labelling. In his own words: "it is often difficult to figure out which ideas are most responsible for the state-of-the-art performance of a large ensemble (of classifiers)". In this regard, in (Collobert, et al., 2011), see Remark 4 (Conditional Random Fields), it is also specifically noted how other techniques such as CRFs have been applied to the same NLP tasks with similar results. So there are plenty of choices to attain the same goals and the interest (IMO) comes down to discussing the one that could best fit in the situation/problem. To me this is synonymous with contributing to knowledge, although MNNs are generally regarded as black boxes that lack human interpretation. In fact, this is the seed of a critical discussion (Collobert, et al., 2011).

(*) The proposed generic neural network discovers useful features in large unlabelled datasets, which enable it to perform close to the state-of-the-art. This approach is presented in contrast to relying on a priori knowledge (e.g., domain heuristics). Task-specific engineering (i.e., that does not generalise to other tasks) is not desirable in this multiple task setting (Collobert, et al., 2011).

--
(Collobert, et al., 2011) Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K. and Kuksa, P., "Natural Language Processing (Almost) from Scratch", Journal of Machine Learning Research, vol. 12, pp. 2493-2537, Aug 2011.

Post 63

L-systems, fractal plants and grammars

24-Dec-2011

One beautiful example of life is vegetation. In fact, since I was little I had always believed that ferns were supposed to be the first form of life on Earth, but now I see that the Wikipedia debunks this fact here. Anyhow, plants do deserve a lot of admiration and respect, they are fascinating to me. I myself grow a vegetable garden (especially for cultivating tomatoes with the seeds that my grandfather carefully selected for over 40 years), I annually visit the St. Boi and Molins expositions of agriculture and farming, and I own two exemplars of John Seymour's books on horticulture. However, since I am also a computer enthusiast, seeing such beauty in the form of an algorithm is a doubly rewarding experience to me. In this regard, I here release my implementation of an L-system (see the CODE section).

L-systems are a mathematical formalism originally aimed at modelling the growth processes of plant development. The basic idea is to define complex objects by successively replacing parts of a simple object using a set of rewriting rules, aka productions. These production rules can be carried out recursively and in parallel (in contrast to Chomsky grammars, which are applied sequentially).

The biological motivation of L-systems is to interpret a character string as a sequence of drawing instructions. Here is where turtle graphics come in handy, otherwise the generated object would be very difficult to describe by standard Euclidean geometry. The Logo programming language is one example of such turtle graphics system. See what can be generated with the fractal plant and tree definitions that L-system provides:

The linguistic motivation of L-systems lies in the grammar that defines them. The former examples were produced with deterministic grammars, which means that there is exactly one production rule for each symbol in the grammar's alphabet. However, this implementation of L-system also allows defining multiple rules per symbol, i.e., a stochastic grammar, and the choice is made according to an uniform distribution among the available choices. If this concept takes a textual form, the pair of sentences:

John	has	done	the	work
N	V	V	D	N

and

John	did	the	work
N	V	D	N

(where N stands for noun, V for verb and D for determiner)

may be produced with a single L-system with the following definition:

Symbol	Production rule
S	N VP
VP	V VP
VP	V NP
NP	D N

(where S stands for sentence, NP for noun phrase and VP for verb phrase)

The source code organisation of L-system follows common FLOSS directives (such as src folder, doc, README, HACKING, COPYING, etc.). It only depends externally on the Boost Iostreams Library for tokenising the definition file, and it makes use of the premake build script generation tool. Additionally, a Logo code interpreter is needed to graphically represent the production. I hope you enjoy it.

By the way, Merry Xmas!

Post 62

Principal Component Analysis for Blind Source Separation? Shouldn't you use Independent Component Analysis instead?

02-Dec-2011

This week's work in the Machine Learning class has treated the Principal Component Analysis (PCA) procedure to reduce the dimensionality of the feature space. The use of the Singular Value Decomposition (SVD) of the covariance matrix of the features has reminded me of the introductory video to unsupervised learning where Prof. Ng applies the SVD to blindly separate speech sound sources (e.g., the cocktail party problem). And I have felt puzzled because as far as I know, the problem of Blind Source Separation (BSS) boils down to finding a linear representation in which the components are statistically independent, uncorrelatedness is not enough (Hyvarinen, et al., 2001).

Independence is a much stronger property than uncorrelatedness. Considering the BSS problem, many different uncorrelated representations of the signals that would not be independent and would not separate the sources could be found. Uncorrelatedness in itself is not enough to separate the components. This is also the reason why PCA (or factor analysis) cannot separate the signals: they give components that are uncorrelated, but little more. In this regard, Independent Component Analysis (ICA) finds underlying factors or components from multivariate (multidimensional) statistical data that are both statistically independent and nongaussian. Nongaussianity is a fundamental requirement that also explains the main difference between ICA and PCA, in which the nongaussianity of the data is not taken into account (for gaussian data, uncorrelated components are always independent). In fact, ICA could be considered as nongaussian factor analysis. In reality, the data often does not follow a gaussian distribution, and the situation is not as simple as PCA assumes. For example, many real-world signals have supergaussian distributions, e.g., the speech signal follows a Laplacian distribution (Gazor and Zhang, 2003). This means that the random variables, i.e., the signals, take relatively more often values that are very close to zero or very large.

Seeing Prof. Ng solving the BSS problem with PCA does indeed surprise me, but I certainly doubt that the solution may be happily explained with a single Matlab/Octave one-liner. I have myself done some experimentation (for the sake of empirical evidence) with the noise cancellation scenario that is suggested in my signal processing class, and I can definitely state that PCA cannot separate voice from noise. In actual fact, both the speech signal and the noise signal differ considerably from the Gaussian distributions that are assumed in PCA, see (Gazor and Zhang, 2003) and the figure below, respectively.

The noise used to produce the histogram in the figure corresponds to the rotor of a helicopter.

--
[Hyvarinen, et al., 2001] Hyvarinen, A., Karhunen, J. and Oja, E., "Independent Component Analysis", New York: John Wiley & Sons, Inc., 2001, ISBN: 0-471-22131-7
[Gazor and Zhang, 2003] Gazor, S. and Zhang, W., "Speech probability distribution", IEEE Signal Processing Letters, vol. 10, no. 7, pp. 204-207, Jul. 2003, ISSN: 1070-9908

Post 61

Support Vector Machines and the importance of regularisation to approach the optimal boundary

27-Nov-2011

In this post I would like to show some empirical evidence about the importance of model regularisation in order to control the complexity so as to not overfit the data and generalise better. To this end, I make use of this week's work on Support Vector Machines (SVM) in the Machine Learning class. Firstly, a scenario similar to Post 52 is settled. Then, the optimal Bayes boundary is drawn, along with an unregularised SVM (C=100). Up to this point, the performance of this classifier is expected to be as modest as the former unregularised Logistic Regression: since it is tightly dependent on the observed samples, the slope of the linear boundary differs a lot from the optimum. Instead, when a considerable amount of regularisation is applied, i.e., a SVM with C=1, the model is more conservative and reticent to classify outliers, so its final slope approaches the optimum, thus increasing its expected effectiveness.

The Octave code used to generate the figure is available here. Note that the rest of the ml-class files need to be present along with the former.

Post 60

On figuring out the similarity between the Mean Squared Error and the Negative Corpus LogLikelihood in the cost function for optimising Multilayer Neural Networks

23-Nov-2011

This long post title links with my previous Post 59, where I discussed using different criteria for the cost function and for the derivation of the gradient when training Multilayer Neural Networks (MNN) with Backpropagation. After a fruitful chat with Prof. Xavier Vilasis and some careful reading (Bishop, 2006) and (Hastie, et al., 2003), I have eventually come to realise there is no mystery, as long as the right assumptions are made.

The gist of the discussion is the error of the trained model wrt the data, along with its form and/or its distribution. Bishop (2006) shows that this error function can be motivated as the maximum likelihood solution under an assumed Gaussian noise model. Note that the normal distribution has maximum entropy among all real-valued distributions with specified mean and standard deviation. By applying some calculus, the negative log of a Gaussian yields a squared term, which determines the form of the cost function, just as the Mean Squared Error (MSE). Hence, the two optimisation criteria converge to a similar quadratic form, and this explains their freedom to interchange. Although not all real-life problems are suitable for such Gaussian noise model, e.g., the quantisation error has an approximately uniform distribution, the normal distribution of the error is plausibly assumed for the majority of situations. What follows is some experimentation considering 1) the impact of the nature of the data and 2) the fitting of the model to attain this Gaussian noise model. In this regard, (1) is evaluated by classifying data from a well-defined Gaussian distribution and (first) from a pure random distribution (generated by radioactive decay), and (second) from its "equivalent" Gaussian model (by computing the mean and the variance of the points in the random sample). Given that two Gaussians with the same variance are to be discriminated in the end, the optimal Bayes boundary is a line perpendicular to the segment that links their means, passing through its mid-point, see Post 52. Therefore, the MNN model does not need to be very complex. This is accomplished with one hidden layer with two units, and (2) is evaluated by applying different levels of regularisation (this design is ensured not to underfit the data). If the assumption for Gaussian noise model holds, the error should be normally distributed and the effectiveness rates for the MSE and the Neg Corpus LogLikelihood costs should be similar.

Without regularisation

Problem setting	Acc. MSE	Acc. Neg Corpus LogLikelihood
Pure random	63.80%	68.40%
Gaussians	63.50%	69.50%

With regularisation

Problem setting	Acc. MSE	Acc. Neg Corpus LogLikelihood
Pure random	69.30%	71.30%
Gaussians	68.50%	71.20%

Given the results, it is observed how the actual nature of the data has a little impact on the analysis as the effectiveness trends (measured with the Accuracy rate in the table) are similar for both scenarios in all settings. The determining factor so as to assume a normally distributed error, which embraces the Gaussian noise model, is the fitting of the model wrt the data. This leads to allowing some regularisation. When the model overfits the data (i.e., without regularisation), its poor generalisation yields a varied range of error samples that distances from the normal distribution, hence the different effectiveness rates between MSE and Neg LogLikelihood costs. Instead, when the model fits the data well (i.e., with regularisation, resulting in a complex-enough model that can generalise well), the range of error samples is more narrow (i.e., the error samples are more comparable) and its distribution approaches the Gaussian form, hence the similar effectiveness rates between MSE and Neg LogLikelihood costs. Note that only in this maximum entropy model setting, the classifier contains the maximum amount of (Fisher) information from the data, so this is the optimum model (i.e., optimum parameter values) for a given network structure. The code used to conduct the experiments is available here.

--
[Bishop, 2006] Bishop, C. M., "Pattern Recognition and Machine Learning (Information Science and Statistics)", New York: Springer Science + Business Media, LLC, 2006, ISBN: 978-0387310732
[Hastie, et al., 2003] Hastie, T., Tibshirani, R. and Friedman, J. H., "The Elements of Statistical Learning", Springer, 2003, ISBN: 978-0387952840

Post 59

Different cost criteria in Multilayer Neural Network training

19-Nov-2011

It is customary to train Multilayer Neural Networks (MNN) with the Mean Squared Error (MSE) criterion for the cost function (Duda, et al., 2001), especially when the Backpropagation algorithm is used, which is presented as a natural extension of the Least Mean Square algorithm for linear systems, a deal of lexical coincidences with "mean square" altogether. Nevertheless, Prof. Ng in the ml-class presented a somewhat different flavour of cost function for training MNN, recurring to the "un-log-likelihood" error, i.e., the negative of the corpus loglikelihood, that typically characterises the Logistic Regression error wrt the data, and for a good reason:

Cost function	Effectiveness (accuracy)	Training time (sec)
Neg. Corpus LogLikelihood	95.06%	57.94
MSE	92.88%	113.27

Not only is the Neg. Corpus LogLikelihood a more effective cost function than the traditional MSE, it is also twice faster to train using Backpropagation with a MNN, at least for the digits recognition task. Check out the cost function code here. In addition, it shows that the advanced optimisation method does its job regardless of the underlying matching criteria between the cost function and the gradients. That's awesome.

--
[Duda, et al., 2001] Duda, R.O., Hart, P.E. and Stork, D.G., "Pattern Classification", New York: John Wiley & Sons, 2001, ISBN: 0-471-05669-3

Post 58

Inter-language evaluation of sentiment analysis with granular semantic term expansion

17-Nov-2011

Yesterday, my Master Student Isaac Lozano defended his thesis with honours. The topic of his work entitles this post: inter-language evaluation of sentiment analysis with granular semantic term expansion. His work is mainly framed by our former publication (Trilla et al., 2010), but he has extended it by porting EmoLib to Spanish in its entirety and evaluating its performance wrt English, which is the default working language, and also the most supported one by the research community.

His contribution principally consists in improving the Word-Sense Disambiguation module in order to profit from the knowledge contained in the Spanish WordNet ontology, to then expand the term feature space with the synsets of the observed words with their correct senses. To accomplish this goal, a Lucene index of WordNet needs to be created first, along with the Information-Content-based measures that help determine the semantic similarity/relatedness among the terms (e.g., with Conrath-Jiang, Resnik, etc), a la JavaSimLib. In fact, to perform a more granular analysis, several WordNet indices need to be created, one for each word-class with a possible amount of affect, i.e., content words such as nouns, verbs and adjectives. This granularity wrt the Part-Of-Speech is assumed to be of help to increase the identification of affect in text, see (Pang and Lee, 2008). Nevertheless, he concluded his work by showing that such intuitive considerations do not deliver a clear improvement in any of the two languages, at least for the datasets at hand: the Semeval 2007 and the Advertising Database. However, he pointed out several interesting ideas regarding the flaws he observed during the development of his work, such as the consideration of lemmas instead of stems in order to increase the retrieval rate from WordNet, or the consideration of antonyms instead of synonyms in case negations are observed in the text of analysis. Allow me to express my congratulations.

Now that EmoLib performs in Spanish as well as in English, the new language version of the demo service has just been made available here. Enjoy.

--
[Trilla et al., 2010] Trilla, A., Alias, F. and Lozano, I., "Text classification of domain-styled text and sentiment-styled text for expressive speech synthesis", In Proceedings of VI Jornadas en Tecnologia del Habla (FALA2010) (ISBN: 978-84-8158-510-0), pp. 75-78, 2010, November, Vigo, Spain.
[Pang and Lee, 2008] Pang B. and Lee L., "Opinion mining and sentiment analysis", Foundations and Trends in Information Retrieval, vol. 2, no. 1-2, pp. 1-135, 2008.

Post 57

Spanish Government presidential candidates face-off analytics

15-Nov-2011

This post is in line with a recent article that presents a novel application to signal the polarity of opinions in Twitter posts, see this piece of news published by Universidad de Alicante. The original work is centred on the face-off debate between Rajoy and Rubalcaba, the two main candidates for the Spanish Government presidential election that will take place next Sunday. I here reproduce a similar experiment with the default configuration of EmoLib, which is grounded on a centroid-based classifier representing the words' emotional dimensions in the circumplex, but with the transcription of the whole debate, which is available for the two candidates: Rajoy, on behalf of the PP, and Rubalcaba, on behalf of the PSOE. What follows is a table that shows some linguistic statistics regarding their speeches (bearing in mind that they spoke the same amount of time):

Feature	Rajoy (PP)	Rubalcaba (PSOE)
Total paragraphs	97	112
Positive paragraphs	32	40
Neutral paragraphs	46	56
Negative paragraphs	19	16
Total unigrams	8885	8859
Total bigrams	8788	8747
Unigram vocabulary	2207	2194
Bigram vocabulary	6133	5933

In contrast to the original article where only positive and negative tags were provided, EmoLib allows a more granular analysis with the additional "neutral" label, which lies in between the opposing extreme positive and negative labels. According to the statistics, note that both candidates have a similar speaking style. Perhaps Rubalcaba is more prone to expose his ideas incrementally while Rajoy prefers to put them forward at once, since the former has articulated 15 paragraphs more than the latter in the same amount of time. Anyhow, the richness of the vocabulary employed is similar (taken for the rate between the size of the vocabulary and the total amount of spoken words). Maybe what is most surprising is that the affective figures delivered by this study end up being contrary to the original ones, since Rajoy spoke positively 32.99% of the time while Rubalcaba did it a 35.71%. The rates continue being more of less comparable, but the trend is inverted wrt what has been originally published.

Post 56

Multi-class Logistic Regression in Machine Learning (revisited)

10-Nov-2011

In my former post about Multi-class Logistic Regression in ML (see Post 55), I questioned the ml-class about using a multi-class generalisation strategy, i.e., the One-Versus-All (OVA), with a plain dichotomic Logistic Regression classifier when there's a more concise multi-class model, i.e., the Multinomial Logistic Regression (MLR), that is already of extensive use in ML. I foolishly concluded (to err is human) that using OVA should be due to higher effectiveness issues, assuming that my MLR optimisation implementation based on batch Gradient Descent was directly comparable to the one provided in class, which is based on an advanced optimisation procedure, "with steroids", such as Conjugate Gradient or L-BFGS. Specifically, the course provides a method based on the Polack-Ribiere gradient search technique, which uses first and second order derivatives to determine the search direction, and the Wolfe-Powell conditions, which lead to fast minimisation of the criterion function, altogether suited for efficiently dealing with a large number of parameters. But as Prof. Ng remarked in a previous lecture, the optimisation procedure has a notable impact on the final fitting of the model wrt the training data, and this definitely makes the difference. I have recoded the MLR with the parameter unrolling method shown this week in order to interface with the same advanced optimisation method (available here), so the results are now truly directly comparable. The MLR yields an effectiveness of 95.22% in 99.58 seconds, which makes it 0.2% more effective and 10.91 times faster than OVA-LR. Therefore, the class-wise parameter dependence in the MLR predictions provides an overall much faster to train and slightly better classifier than OVA-LR, for this problem. Nevertheless, the question wrt which strategy is preferable still remains open because other multi-class strategies such as Pairwise classification are yet to be studied.

Post 55

Multi-class Logistic Regression in Machine Learning

06-Nov-2011

This week's assignment of the ml-class deals with multi-class classification with Logistic Regression (LR) and Neural Networks. In this post I would like to focus on the former method, though, in line with Post 52. There I missed the use of the Multinomial LR (MLR) to tackle multi-class problems, putting in question the need of a multi-category generalisation strategy, i.e., One-Versus-All (OVA), when there is already a model that inherently integrates this multi-class issue, i.e., MLR. Now, after conducting some experimentation (see the table below), I shall conclude that it must be due to its higher effectiveness, which is measured wrt the accuracy rate, at least for the proposed digit-recognition problem.

Classifier	Effectiveness (accuracy)	Performance (sec)
OVA-LR	95.02%	1086.21
MLR	92.12%	128.85

Both these methods learn some discriminant functions and assign a test instance to the category corresponding to the largest discriminant (Duda, et al., 2001). Specifically, the OVA-LR learns as many discriminant functions as the number of classes, but the MLR learns one function less because it firstly sets the parameters for one class (the null vector) and then learns the rest in concordance with this setting without loss of generality (Carpenter, 2008). Therefore, the essential difference is that OVA-LR learns the discriminants independently, while MLR needs all class-wise discriminants for each prediction, so they cannot be trained independently. This dependence characteristic may then flaw the final system effectiveness a little (in fact, it only makes 3% worse), but in contrast, it learns at a much faster rate (8.43 times faster given this data).

The code that implements the MLR, which is available here (it substitutes "oneVsAll.m" in "mlclass-ex3"), is based on (Carpenter, 2008). Nevertheless, a batch version has been produced in order to avoid the imprecision introduced by the online approximation, see Post 51, ~~and hence to be directly comparable to OVA-LR~~. The question remains open as to whether the multi-class generalisation of a dichotomic classifier is generally preferable to a unified multi-class model, because different decision criteria (effectiveness vs performance (Manning, et al., 2008)) point to different classifiers, and the results with other multi-class strategies such as Pairwise classification are yet to be studied.

Actually, this MLR is not directly comparable to the OVA-LR developed in class because it is optimised with a different strategy. A fair comparison regarding this optimisation aspect is conducted and explained in the following post, see Post 56

--
[Duda, et al., 2001] Duda, R.O., Hart, P.E. and Stork, D.G., "Pattern Classification", New York: John Wiley & Sons, 2001, ISBN: 0-471-05669-3
[Carpenter, 2008] Carpenter, B., "Lazy Sparse Stochastic Gradient Descent for Regularized Multinomial Logistic Regression", 2008.
[Manning, et al., 2008] Manning, C. D., Raghavan, P. and Schutze, H., "Introduction to Information Retrieval", Cambridge: Cambridge University Press, 2008, ISBN: 0521865719

newer | older - RSS - Search