Alexandre Trilla, PhD - Data Scientist |

Blog

-- Thoughts on data analysis, software development and innovation management. Comments are welcome

The abridged build-measure-learn loop: innovate and seek excellence

12-Feb-2013

The principal objective of a tech startup (Research and Development also fit the shoes of tech entrepreneurship without loss of generality) is to learn how to build and run a sustainable business where value is created when a new technology invention is matched to customer need. Therefore, validated learning is a fundamental issue in this uncertain quest for success.

Ideas are born of initial leaps of faith. But then, they need to be conceptualised, sketched, implemented, submitted for testing through a minimum viable product, and by making use of innovation accounting and actionable metrics, the results have to be evaluated and the decision must be made whether to pivot or persevere. It is true that simulation is useful to understand the impact of uncertainty on the distribution of expected outcomes, but the real world is much harder to debug than a piece of code and there is always the need to iterate a business idea with real people (i.e., prospect customers) in order to discover their actual needs.

Similarly, in innovation management, it is said that the innovation that moves along the technology and market curves is incremental (persevere), in contrast to the innovation that is disruptive, which introduces a discontinuity and shifts to new curves (pivot). A pivot is a special kind of structured change designed to test a new fundamental hypothesis about the product, business model and engine of growth. It is the heart of the Lean Startup method (in fact, the runway of a startup is the number of pivots that it can do), which makes a company resilient in the face of failures (which are not mistakes, this is a different issue). However, there is (at least) one peril/caveat wrt the Lean Startup method (what's left out of the pivoting topic): if you do have true expertise in a particular field, you are then likely succeed and end up doing something of value for the customers you discovered, but this is no guarantee to be a rewarding experience to you. In that situation, you cannot do great work (unless you have a very wide band or changing taste, your work preferences will prevent you from doing a great job). We know from Steve Jobs that "the only way to do great work is to love what you do", so one might still need to pivot in that situation, too. Joel Spolsky also proclaims this message in his "careers badge": Love your job. Or else, pivot, and have a read at Cal Newport's book: So Good They Can't Ignore You, where it is supported that what you do for a living is much less important than how you do it, focusing on the hard work that is required to become excellent at something valuable instead of keeping pivoting until all variables fit your taste.

In a recent podcast, though, Cal emphasises the importance of craftsmanship, which is somewhat contradictory because craftsmanship is rather associated with passion. Anyhow, it's a sensible link and it's always reasonable to bear in mind the reverse side of an argument.

Post 73

A New Year's resolution: get over specialisation and embrace generalisation to face real world industry problems

01-Jan-2013

Regularisation is a recurrent issue in Machine Learning (and so it is in this blog, see this post). Prof. Hinton also borrowed the concept in his neural networked view of the world, and used a shocking term like "unlearning" to refer to it. Interesting as it sounds, to achieve a greater effectiveness, one must not learn the idiosyncrasies of the data, one must remain a little ignorant in order to discover the true behaviour of the data. In this post, I revisit typical weight penalties like Tikhonov (L-2 norm), Lasso (L-1 norm) and Student-t (sum of logs of squared weights), which function as model regularisers:

And their representation in the feature space is shown as follows (the code is available here; this time I used the Nelder-Mead Simplex algorithm to fit the linear discriminant functions):

As expected, the regularised models generalise better because they approach the optimal solution, although the differences are small for the problem at hand. Even more different regularisation proposals could still be suggested using model ensembles through bagging, dropout, etc, but are they indeed necessary? Does one really need to bother learning them? The obtained results are more or less the same, anyway. What is more, not every situation may come down to optimising a model with a fancy smoothing method. For example, you can refer to a discussion about product improvement in Eric Ries' "The Lean Startup" book (page 126, Optimisation Versus Learning), where optimising under great uncertainty can lead to a total useless product in addition to a big waste of time and effort (as the true objective function, i.e., the success indicator the product needs to become great, is unknown). And still further, not in the startup scenario but in a more established industry like the rail transport, David Briginshaw (Editor-in-Chief of the International Railway Journal, October 2012) wrote:

"Specialisation leads to people becoming blinkered with a very narrow view of their small field of activity, which is bad for their career development, (...), and can hamper their ability to make good judgements."

So, a lack of generalisation (as in happens with overfitted models) leads to a useless skewed vision of the world. Abraham Maslow already put it in different words: if you only have a hammer, you tend to see every problem as a nail. This reflection inevitably puts into scene the people who are at the crest of specialisation: the PhD's. Is there any place for them outside the fancy world of academia where they usually dwell and solve imaginary problems? Are they ready to face the real tangible problems (which are not only technical) commonly found in the industry? The world is harder to debug than any snippet of fancy code. Daniel Lemire long discussed these aspects and stated that training more PhD's in some targeted areas might fail to improve research output in these areas. Instead, creating new research jobs would be a preferable choice, as it is usually the case that academic papers do not suit many engineering needs and those fancy (reportedly enhanced) methods are thus never adopted by the industry. His articles are worth a read. Research is indeed necessary to solve real world problems, but it must be led by added-value objectives, lest it be of no use at all. Free happy-go-lucky research should not be a choice nowadays (has anyone heard of the financial abyss in academia?).

Post 72

Sure, you can do that... and still get an IEEE published article

24-Dec-2012

This year has been rather prolific with respect to the attained number of research publications. The most noteworthy is the one on the IEEE Transactions on Audio, Speech and Language Processing (TASLP), which is entitled "Sentence-based Sentiment Analysis for Expressive Text-to-Speech". Its abstract is posted as follows:

"Current research to improve state of the art Text-To- Speech (TTS) synthesis studies both the processing of input text and the ability to render natural expressive speech. Focusing on the former as a front-end task in the production of synthetic speech, this article investigates the proper adaptation of a Sentiment Analysis procedure (positive/neutral/negative) that can then be used as an input feature for expressive speech synthesis. To this end, we evaluate different combinations of textual features and classifiers to determine the most appropriate adaptation procedure. The effectiveness of this scheme for Sentiment Analysis is evaluated using the Semeval 2007 dataset and a Twitter corpus, for their affective nature and their granularity at the sentence level, which is appropriate for an expressive TTS scenario. The experiments conducted validate the proposed procedure with respect to the state of the art for Sentiment Analysis."

In addition, three other publications at the SEPLN 2012 Conference (see Publications) have allowed focusing on specific aspects as subsets of a greater whole (i.e., the IEEE TASLP article). This has been hard work, indeed. And I'm proud of it. Nonetheless, I cannot help being objective about it and admit that this line of research falls into the "data porn" category (check out the "publication Markov Chain" that is being mocked there). In any case, the addressed problem is a real one and alternative sources of knowledge have been considered to solve it, so this an altogether good lesson learnt.

By the way, Merry Xmas!

Post 71

Perceptron learning with the overused Least Squares method

02-Nov-2012

Following Geoffrey Hinton's lectures on Neural Networks for Machine Learning, this post overviews the Perceptron, a single-layer artificial neural network that provides a lot of learning power, especially by tuning the strategy that is used for training the weights (note that Support Vector Machines are Perceptrons in the end). To keep things simple, 1) no regularisation issues will be covered here, and 2) the weight optimisation criterion will be the minimisation of the squared error cost function, which can be happily overused. In another post, the similarity between using the least squares method and the cross-entropy cost through the negative log-likelihood function (as it is reviewed in class) assuming a Gaussian error was already discussed. So using one or the other won't yield much effectiveness improvement for a classic toy dataset sampled from two Gaussian distributions.

Therefore, the ability of the perceptron to excel in classification tasks effectively relies on its activation function. In the lectures, the following functions are reviewed: binary, linear, logit and softmax. All of them provide their own singular learning capability, but the nature of the data for the problem at hand is always a determining factor to consider. The binary activation function is mainly used for describing the Perceptron rule, which updates the weights according to the steepest descent. Although this method is usually presented as an isolated golden rule, not linked with the gradient, the math is clearer than the wording:

Eq. a) corresponds to the update rule with a binary activation function, Eq. b) with a linear function and Eq. c) with a logit function.

The gradient for the logit is appended in the figure above to see how a different activation function (and thus a different cost function to minimise) provides an equivalent discriminant function (note that the softmax is a generalisation of the logit to multiple categories, so it makes little sense here):

As it can be observed in the plot, the form of the activation indeed shapes the decision function under the same cost criterion (not of much use here, though). In certain situations, this can make the difference between a good model and an astounding one. Note that different optimisation functions require different learning rates to reach convergence (you may check the code here). And this process can be further studied with many different activation functions (have a look at the variety of sigmoids that is available) as long as the cost function is well conformed (i.e., it is a convex function). Just for the record, the Perceptron as we know it is attributed to Rosenblatt, but similar discussions can be found with respect to the Adaline model, by Widrow and Hoff. Don't let fancy scientific digressions disguise such a useful machine learning model!

Post 70

Least Squares regression with outliers is tricky

23-Jul-2012

If reams of disorganised data is all you can see around you, a Least Squares regression may be a sensible tool to make some sense out of them (or at least to approximate them within a reasonable interval, making the analysis problem more tractable). Fitting functions to data is a pervasive issue in many aspects of data engineering. But since the devil is in the details, different objective criteria may cause the optimisation results to diverge considerably (especially if outliers are present), misleading the interpretation of the study, so this aspect cannot be taken carelessly.

For the sake of simplicity, linear regression is considered in this post. In the following lines, Ordinary Least Squares (OLS), aka Linear Least Squares, Total Least Squares (TLS) and Iteratively Reweighted Least Squares (IRWLS) are discussed to accurately regress some points following a linear function, but with an outlying nuisance, to evaluate the ability of each method to succeed against such a noisy instance (this is fairly usual in a real-world setting).

OLS is the most common and naive method to regress data. It is based on the minimisation of a squared distance objective function, which is the vertical residual between the measured values and their corresponding current predicted values. In some problems, though, instead of having measurement errors along one particular axis, the measured points have uncertainty in all directions, which is known as the errors-in-variables model. In this case, using TLS with mean subtraction (beware of heteroskedastic settings, which seem quite likely to appear with outliers; otherwise the process is not statistically optimal) could be a better choice because it minimises the sum of orthogonal squared distances to the regression line. Finally, IRWLS with a bisquare weighting function is regarded as a robust regression method to mitigate the influence of outliers, linking with M-estimation in robust statistics. The results are shown as follows:

According to the shown results, OLS and TLS (with mean subtraction) display a similar behaviour despite their differing optimisation criteria, which is slightly affected by the outlier (TLS is more affected than OLS). Instead, IRWLS with a bisquare weighting function maintains the overall spirit of the data distribution and pays little attention to the skewed information provided by the outlier. So, next time reliable regression results are needed, the bare bones of the regression method of use are of mandatory consideration.

Note: I used Matlab for this experiment (the code is available here). I still do support the use of open-source tools for educational purposes, as it is a most enriching experience to discover (and master) the traits and flaws of OSS and proprietary numerical computing platforms, but for once, I followed Joel Spolsky's 9th principle to better code: use the best tools money can buy.

Post 69

On using Hacker News to validate a product idea involving NLP and PHP

12-Jul-2012

The first step to creating a valuable product is to discover what it is exactly wanted or needed by the target customers. The Lean Startup process states it straight, and the Pragmatic Programmer even provides a means to find it out by asking Hacker News (HN). HN is a vibrant community of tech people, hackers is its broadest sense... and entrepreneurs (these concepts need not be disjoint), which can provide a lot of insight into the value of a product idea.

Now, my product idea: a general-purpose Natural Language Processing (NLP) toolkit coded in PHP. This is certainly a long wanted product (note that the two links date back to 2008), and for a sensible reason: the Internet is bloated with textual content, so let's develop a NLP tool that is focused on processing text on the web. In this sense, the PHP programming language, i.e., by definition, the Hypertext Preprocessor, should be a practical choice with which to do it. Moreover, PHP is the default platform that is available on a web server. Then, all the elements seem to be in the right place. And the problem seems to be addressed logically this way, but it still needs positive feedback from the end users (the developers) to succeed. Note that none of the currently available NLP toolkits reported in the Wikipedia list has been developed in PHP, so there must be a niche of improvement here, or must there be something wrong going on? Why is it so? Perhaps the product was not interesting a few years ago, maybe it did not catch up because of marketing issues, or using the many bindings and wrappers available was just enough in contrast to putting the effort in doing it all again from scratch... Therefore, the question naturally arises: is it really interesting to the community? If so, to what extent? Is it worth the bother? Will this be a profitable project? Would it be nuts to rely solely on Ian Barber's opinion?

These questions require some scientific experimentation, so I built a prototype (mainly based on text classification, which has 24 GitHub watchers at this time of writing; thanks for your interest, indeed) and submitted it to HN. What I found out was contrary to what I expected: the general interest in this kind of product is essentially nonexistent, just in line with what had already happened with the previous approaches. I failed. OK. At least I now know by myself it's nonsense to invest in this product. I'd better do something else. Fine. Let's keep engineering. The upside is that I practised some PHP (my skills with this language were getting a little rusty) and (more importantly) I learnt that businesses need solutions, not tools to develop solutions (this conclusion is derived directly from the only -ironic- comment that appears in HN, which was motivated by the demo app that I provided where I trained the classifier with a popular research dataset only as a proof of concept). That's awesome! If I had dismissed the so-valuable Lean Startup directive, assuming that the world was just how I saw it, I would have "wasted" (please note the quotation marks) a whole lot of time developing something nobody would pay for (I'm being rather like Edison here, I know). This is an undoubtedly good "lesson learned". Needless to say, though, if I ever get to obtain economic support for its development, I will gladly resume the coding phase!

Post 68

The Passionate Programmer in the late-2000s recession

03-Jun-2012

The present receding economy displays a scenario that is wildly unknown, and this inevitably affects the attitude that we take with respect to our careers, reminding us all of the crucial importance to always be heading to where the magic happens. In addition to the renown advice to not settle, what's utterly of value is to stay hungry in this continuously changing world.

In this regard, the Passionate (and Pragmatic) Programmer provides some insight that is worth noting. In this post, I review some of its guidelines to "create a remarkable career in software development", and the many connections with the present situation arise naturally (the book was published three years ago):

Pursue the bleeding edge of technology (the Next Big Thing), out of the comfort zone.
Seek salient features as a professional (this was related to the author's stay in India for recruitment issues, which reminded me of the Aspiring Minds Machine Learning Competition).
Choose your crowd wisely. The people around you affect your own performance. Make the hang with the greats. Be on the shoulders of giants, there are many ways to paraphrase this.
Practise at your limits to improve. If you always do what you've always done, you will always get what you've always got.
Work with a mission. Attain daily accomplishments (consider the pomodoro technique).
You are what you can explain, so don't make a fool of yourself with (unnecessary) rigid values (avoid monkey traps).
Be intentional about your choice of career path and how to invest in your professional self. Career choices should be sought after and decided upon with intention. Each choice should be part of a greater whole (connecting the dots...).

And last but not least: you can't creatively help a business until you know how it works. In this regard, the next book in my reading list is The Lean Startup.

Post 67

Numerical computation platform for the technical university: values to decide on a proprietary or open-source software model

30-Mar-2012

A discussion on the adequacy of a proprietary numerical computation platform like Matlab or a free open source alternative like Octave is an old story already. But I feel it would be inadequate to stick with rigid values only because of one's preference for a single particular software development model. To me, one is just as good as the other. And I opine with certain authority being a TA at the university who led the migration from Matlab to Scilab for the practise sessions of Discrete-Time Signal Processing (a graphical interface for simulating dynamic systems was required, therefore xcos was needed), which is part of the Master's degree programme in Telecommunications Engineering.

Needless to say, I have a preference for open-source software, but in an educational environment such as the university, choosing an open platform for teaching is more of an act of responsibility than it is of taste. Here are the reasons why I bothered remaking from scratch the whole lot of practise sessions (see my teaching materials) with Scilab:

It enables the students to fully reach to the detail of their implementations from all analysis perspectives. This is the essence of hacking in the end, which in turn was born at MIT, one of the most prestigious technology universities in the world.
It saves students (and the university) a deal of money as they are not forced to buy a proprietary software licence for conducting the lab experiments (the over-dimensioned commercial product is simply not necessary).
It does not entice the students to commit the illegal activity of software piracy through breaking the contract they are forced to accept with a proprietary software license.

With these arguments I don't mean that Matlab is a bad product at all in any sense! On the contrary, as long as people acquire it, I assume it must provide some differentiated solutions for specific needs. But in an university environment, where the gist of a technical class is teaching off-the-shelf methods to fledgling engineers, open-source software packages like Octave, Scilab or SciPy, offer high quality numerical computation platforms that are orders of magnitude more powerful than it is needed. What is more, I have used them myself for more serious computing tasks, and in my experience, strictly speaking, they are truly comparable to their proprietary counterparts.

Now, while this seems to be a reasonable an sound argument (IMHO), related companies seem to disagree and complain to the university to prevent the publication of such opinions and to remove the teaching materials that we instructors offer for free for the sake of education. If such complaints are not simply dismissed as the university is supposed to always protect the educational freedom, students are in threat of being tangled with the monopoly dictated by these companies (and shamefully accepted by the university). So this is what happened to Guillem Borrell with Mathworks and the Universidad Politecnica de Madrid. And since I share his indignation wrt this issue, I wanted to echo his open letter to Mathworks. I wonder if this company will also complain to Andrew Ng for the similar opinions he expressed in the materials he prepared for the Machine Learning class.

Post 66

Spelling correction and the death of words

23-Mar-2012

One of the topics treated in this second week of the Natural Language Processing class at Coursera is spelling correction (also treated in the Artificial Intelligence class). It's wonderful to have tools that help proofreading manuscripts, but this comes at the expense of impoverishing our own expression ability. This newspaper article, which links to the original research work conducted by Alexander Petersen, Joel Tenenbaum, Shlomo Havlin and Eugene Stanley, states that spelling correction (not only computerised but also human-made in the editorial industry) causes language to be homogenised, and this eventually reduces the lexicon (old words die at a faster rate than new words are created). So, is this NLP fancy topic actually hurting NLP? What a headache...

Anyway, I find this spelling correction field very appealing because it shows a direct link with speech (i.e., spoken language) through the consideration of a phonetic criterion in the spelling error model. This points to the metaphone algorithm, which creates the same key for similar sounding words. It is reported that metaphone is more accurate than soundex as it knows the basic rules of English pronunciation. Regarding spelling correction, metaphone is used in GNU Aspell, and to my surprise, it's already integrated in the latest versions of PHP! Along with the edit distance topic treated in the first week, this shall make a new addition (e.g., a phonetic similarity module) to the NLP toolkit I'm beginning to work on!

Post 65

Hacking with Multinomial Naive Bayes

29-Feb-2012

Today it's the most significant day of a leap year, and I won't miss the chance to blog a little. I think I can put Udacity aside for a moment to note the importance of Naive Bayes in the hacker world. Regardless of its naive assumption of feature independence, which does not hold for text data due to the grammatical structure of language, the classification decisions (based on Bayes decision rule) of this oversimplified model are surprisingly good. I am particularly fond of implementing the Multinomial version of Naive Bayes as is defined in (Manning, et al., 2008), and I must say that for certain problems (namely for sentiment analysis) it improves the state-of-the-art baseline straightaway. My open source implementation is available here, as well as a couple of example applications on sentiment analysis and topic detection.

UPDATE on 07-Mar-2012: A book entitled "Machine Learning for Hackers" has just been published.

--
(Manning, et al., 2008) Manning, C. D., Raghavan, P. and Schutze, H., "Introduction to Information Retrieval", Cambridge: Cambridge University Press, 2008, ISBN: 0521865719

newer | older - RSS - Search