Computer vision systems are normally bespoke, constructed by
an experienced developer to solve a specific task. Many modern vision
systems involve an element of machine learning — but in the vast
majority of cases, the machine learning component is used only to aid
the classification process, one of the final stages of a complete
system. The early stages, which typically involve segmenting features
from background and grouping these features into objects, are produced
manually and usually fine-tuned to work reliably for the task of the
vision system. This process is typically fairly slow — indeed,
a typical vision PhD student will spend a large proportion of his or
her research performing system tuning.
This research explores a different approach. Our aim is to devise
a set of generic vision components and use machine learning not to
tune up a human-defined pipeline of components but to use it to learn
the best way in which the individual components may be combined into a
complete vision system. This means that the time-consuming stages of
trying out different algorithms to see if they work and then tuning
them to yield their best performance are done by a computer, not a
human.