Blog
 Thoughts on data analysis, software
development and innovation management. Comments are welcome
Post 60
On figuring out the similarity between the Mean Squared Error and the Negative Corpus LogLikelihood in the cost function for optimising Multilayer Neural Networks
23Nov2011
This long post title links with my previous
Post 59,
where I discussed using different criteria for the cost function and for the
derivation of the gradient when training Multilayer Neural Networks (MNN)
with Backpropagation. After a fruitful chat with
Prof. Xavier Vilasis
and some careful reading (Bishop, 2006) and (Hastie, et al., 2003),
I have eventually come to realise there is no mystery, as long as the right
assumptions are made.
The gist of the discussion is the error of the trained model wrt the
data, along with its form and/or its distribution.
Bishop (2006) shows that this error
function can be motivated as the maximum likelihood solution under an
assumed Gaussian noise model. Note that the normal distribution has
maximum entropy among all realvalued distributions with specified mean
and standard deviation. By applying some calculus,
the negative log of a Gaussian
yields a squared term, which determines the form
of the cost function, just as the Mean Squared Error (MSE). Hence, the two
optimisation criteria converge to a similar quadratic form, and this
explains their freedom to interchange. Although not all
reallife problems are suitable for such Gaussian noise model, e.g.,
the quantisation error has an approximately uniform distribution,
the normal distribution of the error is plausibly assumed for
the majority of situations. What follows is some experimentation considering
1) the impact of the nature of the data and 2) the fitting of the model to
attain this Gaussian noise model. In this regard, (1) is evaluated by
classifying data from a welldefined Gaussian distribution and
(first) from a pure random distribution (generated by
radioactive decay),
and (second) from its "equivalent" Gaussian model (by computing the mean and
the variance of the points in the random sample). Given that
two Gaussians with the same variance are to be discriminated in
the end, the optimal Bayes boundary is a line perpendicular to the
segment that links their means, passing through its midpoint, see
Post 52. Therefore,
the MNN model does not need to be very complex. This is accomplished
with one hidden layer with two units, and (2) is evaluated by
applying different levels of regularisation (this design is ensured not
to underfit the data). If the assumption for
Gaussian noise model holds, the error should be normally distributed
and the effectiveness rates for the MSE and the Neg Corpus LogLikelihood
costs should be similar.
Without regularisation
Problem setting  Acc. MSE  Acc. Neg Corpus LogLikelihood 
Pure random  63.80%  68.40% 
Gaussians  63.50%  69.50% 
With regularisation
Problem setting  Acc. MSE  Acc. Neg Corpus LogLikelihood 
Pure random  69.30%  71.30% 
Gaussians  68.50%  71.20% 
Given the results, it is observed how the actual nature of
the data has a little impact on the analysis as the effectiveness
trends (measured with the Accuracy rate in the table) are similar
for both scenarios in all settings. The determining factor so as to
assume a normally distributed error, which embraces the
Gaussian noise model, is the fitting of the model wrt the data. This
leads to allowing some regularisation.
When the model overfits the data (i.e., without regularisation), its
poor generalisation yields a varied range of error samples that
distances from the normal distribution, hence the different
effectiveness rates between MSE and Neg LogLikelihood costs.
Instead, when the model fits the data well (i.e., with
regularisation, resulting in a complexenough model that can
generalise well), the range of error samples is more narrow (i.e., the
error samples are more comparable)
and its distribution approaches the Gaussian form, hence the
similar effectiveness rates between MSE and Neg LogLikelihood costs.
Note that only in this maximum entropy model setting, the classifier
contains the maximum amount of (Fisher) information from the data, so
this is the optimum model (i.e., optimum parameter values)
for a given network structure. The code used to conduct the
experiments is available here.

[Bishop, 2006] Bishop, C. M., "Pattern Recognition and
Machine Learning (Information Science and Statistics)",
New York: Springer Science + Business Media, LLC, 2006,
ISBN: 9780387310732
[Hastie, et al., 2003] Hastie, T., Tibshirani, R. and
Friedman, J. H., "The Elements of Statistical Learning",
Springer, 2003, ISBN: 9780387952840
