Laboratory 9: Evaluation and machine learning

Introduction

Machine learning is now ubiquitous in computer vision: there are very few real-world applications that do not make use of it. In fact, computer vision research has always been closely associated with machine learning because humans clearly learn to see.

The standard way of evaluating computer vision algorithms is statistical, generally employing a large number of ground-truthed images: images for which one or (better) more domain experts agree on what they contain. As most effective machine learning techniques also require ground-truthed imagery for training, the twin requirements of training and testing are closely related. In this experiment, you will train up a number of vision techniques and evaluate the trained systems on the popular MNIST example, which comprises 60,000 training images and 10,000 test ones. MNIST is now regarded as being quite easy, so some researchers have produced a drop-in replacement that uses images of different types of clothing; this fashion MNIST database is regarded as being somewhat harder. (In fact, I regard one of its images as being truly impossible.)

To carry out this experiment, you will need a zip-file of the software etc. This is a big file at about 700 Mbyes but will allow you to carry out all the training yourself, though some of it will take too long on the machines in CSEE's Software Labs.

Training and testing algorithms

The zip-file contains a program called ml.py (for machine learning) which is able to train a system on a variety of vision tasks and test them, generating output that can be fed into the FACT program you have used in earlier experiments. Many of the machine learning algorithms discussed in lectures are implemented in it. The simplest way of using it is something like:

python ml.py -learner=svm train recog.kb train/*

This tells the program to train, saving the trained recognizer in file recog.kb (the .kb is for knowledge base but you can use any extension you like) and the remainder of the command-line arguments are the data files on which the recognizer is trained. The -learner qualifier tells ml.py to use a support vector machine for training; the possibilities are:

cnn: convolutional neural network
eigen: the eigenfaces algorithm described in Chapter 8 of the notes. My implementation of this is particularly simplistic, which means it yields overly poor results and takes ages to run: \(O(N^2)\) rather than \(O(N)\).
mlp: multi-layer perceptron
rf: random forests, not described in the lecture notes but quite widely used in practice.
svm: support vector machine
wisard: WISARD

so you can see I'm training a SVM here.

This approach of specifying all the training images on the command line is quite elegant but ultimately runs out of steam: the buffer into which the Unix shell expands wildcards (train/* above) is of finite size and if the number of filenames is very large, it can overflow --- and this is certainly the case for the 60,000 training images in MNIST. As an alternative, you can specify a task file which details both the training images (and corresponding classes) and the tests. The zip-file contains one for all of MNIST, called mnist.task, and also one for a subset of it, mnist-part.task. As usual, these are text files so feel free to look at them. With a task file, the command would instead be something like:

python ml.py -learner=svm train recog.kb mnist.task

As well as training up recognizers, ml.py can test them. A typical invocation would then be:

python ml.py test recog.kb mnist.task

The recognizer saved from the training phase, recog.kb here, has the training algorithm name used stored within it so that ml.py uses the appropriate algorithm when working through the tests. The output from ml.py is a FACT-compatible transcript which you can save into a file using command-line redirection in precisely the same way as in earlier experiments, and you use FACT on it in the same way.

Each of the machine learning algorithms in ml.py is configured to work fairly well on the MNIST task but where an algorithm has tuning parameters that affect its performance, they can be set using command-line qualifiers. For the WISARD algorithm for example, you can set both the number of image locations that form one of its "tuples" and the number of tuples via ml.py's -nlocs and -ntuples qualifiers respectively. The command

python ml.py -h

will give you all the gory details.

If your project involves machine learning, you are welcome to use ml.py as the basis of your own software --- but if you do, remember that you need to acknowledge it to avoid being accused of cheating. The ability to read task files and output FACT transcripts is especially useful, and a number of students have used them successfully in the past.

Assessing algorithms' performance individually

The zip-file contains trained versions of the recognizers listed above on MNIST, in filenames such as mnist-svm.kb for SVM. I have also ran ml.py on them in test mode, yielding the transcripts stored in files mnist-svm.res and so on. The training and testing times in seconds on the author's (rather fast) laptop are shown in the following table:

learner	train time	test time
EIGEN	11	2,040
MLP	108	1
RF	86	2
SVM	140	59
WISARD	54	10

Some of these times include significant speed-ups through the use of multiple cores and GPUs, so they are indicative the kinds of times you may experience yourself rather than the amount of computation involved. WISARD was run with -ntuples=50 -nlocs=10. For all the learners but SVM, which is deterministic, the random number generator was initialized with -seed=1 on the ml.py command line. There is no result from CNN because, at the time of writing, Tensorflow was unhappy on the author's computer --- and the CNN code in ml.py is untested for the same reason, though the code in the lecture notes was tested.

Your task is to analyse the performance of the various learning algorithms on these results using FACT --- you should know how to do that from a previous laboratory script. Try rank-ordering the algorithms in terms of accuracy. Would this order be the same if you ranked them in terms of specificity or some other measure?

If you have oodles of CPU time to spare, you might train up (say) the MLP recognizer a few times without fixing the seed on the command line, which means it will use a different series of random numbers in each run. Then use FACT to ascertain whether the different trained versions yield different accuracies. The commands involved will be something like:

python ml.py train -data=mnist/train haveago1.kb mnist.task -learner=mlp
python ml.py test  -data=mnist/test  haveago1.kb mnist.task > haveago1.res

You can also ascertain whether there are any statistically-significant differences in performance (see below) from the runs.

Comparing the performances of algorithms

As discussed in lectures, a good test harness should allow one to compare the performances of algorithms in a statistically-valid way. FACT does this by using McNemar's test. To use it, you run FACT in compare mode with a pair of transcript files. A typical invocation here would be:

python fact.py compare mnist-svm.res mnist-mlp.res

In fact, you can use the command:

python fact.py compare mnist-*.res

in which case FACT does pair-wise comparisons between all the transcript files.

There are some subtleties when comparing more than two algorithms because you need to adapt the critical value for significance of 1.96 described in Chapter 6 of the lecture notes. When you choose a pair of algorithms to compare, you are choosing from an ensemble (to use the correct statistical nomenclature) of all possible algorithms. If you keep doing this many times, you will eventually choose a pair of algorithms for which there appears to be a significant difference in performance just because of the arrangement of the data. Remember, the critical value of 1.96 given in the lecture notes corresponds to an expectation that the data will make one algorithm appear to be better than another simply as a consequence of the data used one time in twenty --- so if you perform twenty pairwise comparisons, one of them might be expected to appear significant simply because of the data and not because of a genuine performance difference. (Confusing, isn't it? Do talk to a demonstrator about this.)

What this means is that we need to increase the critical value that indicates significance so that a larger \(Z\) is needed from McNemar's test. The most widely-used such correction is the Bonferroni correction and comes down to multiplying the critical value by the number of algorithms being tested. You need to do this when interpreting the result from fact.py compare. When you have done this, you are in a position to judge which is the best algorithm to use on MNIST from all those considered in this experiment.

Training and testing a face recognizer

Having worked through the MNIST transcripts, it is time to do some training and testing yourself. You have been provided with a face database, the Olivetti Research Laboratory's one, in the directory orl in the zip-file. This contains ten images of each of forty subjects, minuscule by modern standards. We shall retain 4 images of each subject for testing and use either five or six for training --- see the files orl5.task and orl6.task.

Train up and test each of the learners on orl5.task and orl6.task, retaining their results. For example, to train and test an MLP on the ORL5 case, you'd use a command such as:

python ml.py train -data=orl orl5-svm.kb orl5.task -learner=mlp
python ml.py test  -data=orl orl5-svm.kb orl5.task > orl5-svm.res

Then analyse and compare the results using FACT. Does using a larger number of training images make much difference to any of the learners? Are there any significant performance differences between learners with the same number of training images? What is your opinion of the experiment? Do discuss your answers with a demonstrator.

All the laboratory scripts

Web page maintained by Adrian F. Clark using Emacs, the One True Editor ;-)