Alexandre Trilla, PhD - Data Scientist | home publications


-- Thoughts on data analysis, software development and innovation management. Comments are welcome

Post 60

On figuring out the similarity between the Mean Squared Error and the Negative Corpus LogLikelihood in the cost function for optimising Multilayer Neural Networks


This long post title links with my previous Post 59, where I discussed using different criteria for the cost function and for the derivation of the gradient when training Multilayer Neural Networks (MNN) with Backpropagation. After a fruitful chat with Prof. Xavier Vilasis and some careful reading (Bishop, 2006) and (Hastie, et al., 2003), I have eventually come to realise there is no mystery, as long as the right assumptions are made.

The gist of the discussion is the error of the trained model wrt the data, along with its form and/or its distribution. Bishop (2006) shows that this error function can be motivated as the maximum likelihood solution under an assumed Gaussian noise model. Note that the normal distribution has maximum entropy among all real-valued distributions with specified mean and standard deviation. By applying some calculus, the negative log of a Gaussian yields a squared term, which determines the form of the cost function, just as the Mean Squared Error (MSE). Hence, the two optimisation criteria converge to a similar quadratic form, and this explains their freedom to interchange. Although not all real-life problems are suitable for such Gaussian noise model, e.g., the quantisation error has an approximately uniform distribution, the normal distribution of the error is plausibly assumed for the majority of situations. What follows is some experimentation considering 1) the impact of the nature of the data and 2) the fitting of the model to attain this Gaussian noise model. In this regard, (1) is evaluated by classifying data from a well-defined Gaussian distribution and (first) from a pure random distribution (generated by radioactive decay), and (second) from its "equivalent" Gaussian model (by computing the mean and the variance of the points in the random sample). Given that two Gaussians with the same variance are to be discriminated in the end, the optimal Bayes boundary is a line perpendicular to the segment that links their means, passing through its mid-point, see Post 52. Therefore, the MNN model does not need to be very complex. This is accomplished with one hidden layer with two units, and (2) is evaluated by applying different levels of regularisation (this design is ensured not to underfit the data). If the assumption for Gaussian noise model holds, the error should be normally distributed and the effectiveness rates for the MSE and the Neg Corpus LogLikelihood costs should be similar.

Without regularisation

Problem setting Acc. MSE Acc. Neg Corpus LogLikelihood
Pure random 63.80% 68.40%
Gaussians 63.50% 69.50%

With regularisation

Problem setting Acc. MSE Acc. Neg Corpus LogLikelihood
Pure random 69.30% 71.30%
Gaussians 68.50% 71.20%

Given the results, it is observed how the actual nature of the data has a little impact on the analysis as the effectiveness trends (measured with the Accuracy rate in the table) are similar for both scenarios in all settings. The determining factor so as to assume a normally distributed error, which embraces the Gaussian noise model, is the fitting of the model wrt the data. This leads to allowing some regularisation. When the model overfits the data (i.e., without regularisation), its poor generalisation yields a varied range of error samples that distances from the normal distribution, hence the different effectiveness rates between MSE and Neg LogLikelihood costs. Instead, when the model fits the data well (i.e., with regularisation, resulting in a complex-enough model that can generalise well), the range of error samples is more narrow (i.e., the error samples are more comparable) and its distribution approaches the Gaussian form, hence the similar effectiveness rates between MSE and Neg LogLikelihood costs. Note that only in this maximum entropy model setting, the classifier contains the maximum amount of (Fisher) information from the data, so this is the optimum model (i.e., optimum parameter values) for a given network structure. The code used to conduct the experiments is available here.

[Bishop, 2006] Bishop, C. M., "Pattern Recognition and Machine Learning (Information Science and Statistics)", New York: Springer Science + Business Media, LLC, 2006, ISBN: 978-0387310732
[Hastie, et al., 2003] Hastie, T., Tibshirani, R. and Friedman, J. H., "The Elements of Statistical Learning", Springer, 2003, ISBN: 978-0387952840

All contents © Alexandre Trilla 2008-2024