Alexandre Trilla, PhD - Data Scientist |

Blog

-- Thoughts on data analysis, software development and innovation management. Comments are welcome

Multi-class Logistic Regression in Machine Learning

06-Nov-2011

This week's assignment of the ml-class deals with multi-class classification with Logistic Regression (LR) and Neural Networks. In this post I would like to focus on the former method, though, in line with Post 52. There I missed the use of the Multinomial LR (MLR) to tackle multi-class problems, putting in question the need of a multi-category generalisation strategy, i.e., One-Versus-All (OVA), when there is already a model that inherently integrates this multi-class issue, i.e., MLR. Now, after conducting some experimentation (see the table below), I shall conclude that it must be due to its higher effectiveness, which is measured wrt the accuracy rate, at least for the proposed digit-recognition problem.

Classifier	Effectiveness (accuracy)	Performance (sec)
OVA-LR	95.02%	1086.21
MLR	92.12%	128.85

Both these methods learn some discriminant functions and assign a test instance to the category corresponding to the largest discriminant (Duda, et al., 2001). Specifically, the OVA-LR learns as many discriminant functions as the number of classes, but the MLR learns one function less because it firstly sets the parameters for one class (the null vector) and then learns the rest in concordance with this setting without loss of generality (Carpenter, 2008). Therefore, the essential difference is that OVA-LR learns the discriminants independently, while MLR needs all class-wise discriminants for each prediction, so they cannot be trained independently. This dependence characteristic may then flaw the final system effectiveness a little (in fact, it only makes 3% worse), but in contrast, it learns at a much faster rate (8.43 times faster given this data).

The code that implements the MLR, which is available here (it substitutes "oneVsAll.m" in "mlclass-ex3"), is based on (Carpenter, 2008). Nevertheless, a batch version has been produced in order to avoid the imprecision introduced by the online approximation, see Post 51, ~~and hence to be directly comparable to OVA-LR~~. The question remains open as to whether the multi-class generalisation of a dichotomic classifier is generally preferable to a unified multi-class model, because different decision criteria (effectiveness vs performance (Manning, et al., 2008)) point to different classifiers, and the results with other multi-class strategies such as Pairwise classification are yet to be studied.

Actually, this MLR is not directly comparable to the OVA-LR developed in class because it is optimised with a different strategy. A fair comparison regarding this optimisation aspect is conducted and explained in the following post, see Post 56

--
[Duda, et al., 2001] Duda, R.O., Hart, P.E. and Stork, D.G., "Pattern Classification", New York: John Wiley & Sons, 2001, ISBN: 0-471-05669-3
[Carpenter, 2008] Carpenter, B., "Lazy Sparse Stochastic Gradient Descent for Regularized Multinomial Logistic Regression", 2008.
[Manning, et al., 2008] Manning, C. D., Raghavan, P. and Schutze, H., "Introduction to Information Retrieval", Cambridge: Cambridge University Press, 2008, ISBN: 0521865719