ECJ Tutorial

The following is a walkthrough for newcomers to ECJ, with an introduction to all the major points in ECJ in order to create a relatively non-trivial image processing problem. This tutorial is a work in progress, so if you find any problems or have suggestions for improvements, then please let me know. This tutorial is good for ECJ 13 onwards (at least that's what it was tested on).

The VASE lab here at Essex University is geared towards image processing and computer vision, so this tutorial concerns a simple image processing problem. This is great if you're also doing computer vision, but if not you should be able to see how ECJ may be adapted to solve whatever it is you need.

This tutorial does assume that you know the basics of Genetic Programming (GP) and conerns itself with the fun part - that of writing programs and getting results!

Getting started with ECJ

The first thing to consider is the problem that we'd like to use a Genetic Programming approach to solve. At first it seems a little tricky, as you need to think in a different way to usual, but you'll soon find that any problem can be represented in ECJ. This tutorial is intended to give you a help you get started in that respect, as there is a fair amount of leg-work required to get even a simple example working in ECJ. Once you have started though, creating new GP experiments becomes progressively easier and faster.

We'll start with the example of finding edges on an image. We'll keep it very simple and just concentrate on a binary image for now, but you should see how this methodology can be extended.

ECJ is not used to generate a complete application; we're not expecting it to generate the code to load an image, run through each line and produce a sensible output, we just need it to produce an output for a given pixel in an image, and we'll write the rest of the logic ourselves. If we're clever, we can use just a few ECJ setups to to handle a wide range of different GP individuals.

Giving GP the Tools

Our first task is to ensure that the GP program is able to solve the problem. To do this we need to create some functions; they may be maths functions, boolean functions, if statements, loops, variables - whatever you can think of.

So what functions will our program require? We don't need to know how exactly ECJ will put these functions together, but if we try to solve this edge detection problem with only maths functions and no imaging functions, GP will never find an acceptable solution!

In fact, we'll probably need a function that can return the colour of a pixel in a given location, and since we're dealing with an edge detector, we probably need to be able to look at the pixel values at locations relative to a central pixel when deciding if that pixel is an edge or not.

How about this: int getPixelValue(int offsetX, int offsetY)

This function, I propose, will get the value of a pixel relative to the central point (0,0) which will either be classified as an edge, or a non-edge. getPixelValue(0, 0) will get the value of the central pixel itself. The value will be a greyscale number in the range 0-255.

As you'll probably have read, Koza's Genetic Programming produces trees so we need to implement the function to fit in with ECJ's tree node functionality. In completely practical terms, this means we have to extend ECJ's GPNode object and put our logic within the eval function.

Creating a `GPData` Object

Wait! Before we start creating function nodes, we need to decide how the information is passed between the nodes in the GP tree. This problem is not too tricky actually, a floating point number will usually suffice for most problems, so we'll create an object to hold the value. Nearly all the work in ECJ does involve extending an existing base class, so we'll start by creating DoubleData.java, which, unsurprisingly, allows us to use a double precision number.

More practical stuff: In the case where we're dealing with booleans, we'll just return either a 1.0 or 0.0. Where we want ints, we can just cast the double to an integer, and our ERCs will only output whole numbers.

Writing a `GPNode` function

I've written a class called GetPixel.java. It extends the abstract GPNode class and has to override the eval method which is called when the GP program is run. It is here that you have to put your own logic which is that what we described above.

The eval method signature is a bit messy, and the way you have to run the sub-nodes on a function is pretty messy too, but you get used to it. Better, create functions that you can re-use. ECJ's parameter files, frustrating at first, are very good at allowing you to re-use your code in many different situations.

Essentially, the node has two children, one tells it the offsetX value and one tells it the offsetY value. Both these children need to be executed first, recursively, before we know what these two values are ( just, of course as a normal program is executed). Once this has been done the result can be extracted and saved into a couple of local variables. This ugly stage is involved in all functions that take parameters.

Once we've got the parameters we can execute the actual logic part of the GetPixel function, namely getting the Pixel value of the current pixel. Both a utility clas (PixelLoader.java) to read an image pixel, and the location of the current pixel are stored in public variables on the PixelProblem class. Don't worry about the problem class yet - we're going to concentrate on how the parameters are produced first.

Constants in ECJ (ERCs)

How do we get values for offsetX and offsetY into the GP Tree in the first place? We use Ephemeral Random Constants. This is basically a random number generated by ECJ that remains fixed in value (constant) throughout the lifetime of the program.

An ERC is similar to a function ( it's actually a terminal ), except without any children. Naturally they are very useful, or the tree would have no leaves and would never end. Like anything else, we have to override ECJ's basic ERC class so that we get some control over the kind of random number we're need. Since our edge detector is only going to look around at neighbouring pixels, there is not much sense in having offsets generated that look hundreds of pixels away from the current pixel. We'd prefer to look at just the immediate neighbours, so it will generate one of: -1, 0 or +1. When used in both the X and Y directions, this means the program will be able to look at any of the central pixels 8 immediate neighbours, as well as the central pixel itself.

Sadly, for some unknown reason, ECJ does not offer a neat way of overriding the ERC class. You are required to implement various complex looking functions. The code required for these hardly changes, so I choose to create an ERCAdaptor class that only requires two methods to be overridden - setNumber and name(), and hides the rest. Have a look at the ERCAdaptor class if you wish, but there won't be many occasions when you'll need to tinker with it.

Now we've created an ERC, the getPixel() function should work. We need a couple of other functions to 'glue' everything together. Since the image we use is going to be binarised (either black or white), we'll create a few binary operators: And.java, Or.java, and Not.java. These operators are created in much the same way as the GetPixel() function, have a look at them to see how they work.

Introduction to the `GPProblem` Class

As we've seen already, there is a very important central class at play. Its called a problem class, and it deals with the actual evaluation part of the Genetic Program.

The problem does more than that, it is the (well, your) main entry point into the program, in terms of setting up the training set, evaluating the individual (and, optionally, dealing with the resulting individual)

Creating a Training Set

That was a (very) brief introduction to the problem Class, which we shall implement shortly. However, before we do so, we need to think about how we are going to train our GP program. The population is evaluated against a set of criteria, and the individual program(s) that recieve the best score (that is the score closest to zero), are used as the basis for the next generation.

ECJ is naturally quite vague on this part, it is the evaluation part where you really have to think carefully, and where many strategies have been used, dependent on the kind of problem being undertaken.

In the case of an imaging application such as this one, we'll usually have a set of training images, which consist of the test image itself, and some kind of truth map indentifying which pixels belong to a particular class. In a binary classification problem ( all we want is a 'yes' or 'no' response ) such as this, the truth map most commonly consists of a black and white image, with black pixels indicating non-hits, and white pixels indicating hits.

In terms of implementation, we'll create a class called TrainingData, which contains two images - the test image, and the truth image. The GP program will, in effect, create a truth image of its own, and all we need to do is compare the two images. As soon as the GP-generated truth map is the same as the training truth map, a perfect solution has been found.

Fitness Functions

The comparison between the actual output and the expected output is quantified by the Fitness Function. Most of the time in GP image processing, we end up with two statistics:

Statistic	Description	AKA
True Positives	The percentage of correct pixels that exist, that were indeed identified as true. This number should be as high as possible	Sensitivity
True Negatives	The percentage of incorrect pixels, that were indeed identified as false. This number should also be as high as possible	Specificity

In an ideal world, both would be 100%. However, this is never the case, as making a program more likely to find the more tricky True Positives make it more susceptible to finding False Positives, that is identifying pixels as "true", when in fact they are false.

The False Positive rate is simply equal to 1 - True Negative %.

In most engineering situations, we'll have an idea of what kind of error we're happy with. For instance, it is acceptable to have False Positives on an automated Breast Cancer screen, provided we get all the True Positives. We can bias the fitness function with a TP : FP ratio for this purpose. When running GP experiments, however, I would recommend running your program with a number of different ratios and plotting the results as an ROC curve, which gives a much better representation of the effectiveness of your program.

Let's have a look at a fairly standard Fitness Function that does exactly this:

fitness = (ALPHA * TP) / ((ALPHA * totalTruePixels) + (BETA * FP));

There are two constants here. High values of ALPHA favour high TP values despite increased FP values; ie they value success most highly. High values of BETA favour low FP values, despite decreased TP values; ie they penalise failure mostly. Actually you only need one constant, as ALPHA and BETA make up a single ratio, but it makes it more easy to understand.

So our Problem needs to record the TP and FP values ( not percentages in this case ), and pass them to a FitnessCalculator, which is implemented in FitnessCalculator.java. We're now in a position where we can discuss the Problem Class in more depth.

The Problem Class: Loading Training Data

We'll now start on the definition of the problem class, which we'll call PixelProblem.java. This class extends ECJ's GPProblem class, which allows us to initialise the training data using the setup method, and the opportunity to print or process the individual afterward with the describe method. The problem also implements the SimpleProblemForm interface which allows us to use the eval method, which is called to evaluate each individual in the population.

We'll start with the setup method, where we'll load the training data. Other than the obligatory super call, the code in this method is all yours. We'll load some images into a Vector of training data. To do this we need a TrainingData class, which I've created. It simply holds two images and has an isExpected(int x, int y) method to say whether a pixel is "false" or "true". This is decided on the basis of whether the truth image contains a black or white pixel at that location respectively.

At this stage you'd have your actual training data on hand. I've made some already:

Test Image Truth Image

Ignore for one moment the dullness of the images we're testing. What we're saying to ECJ is "I want to start with this first image, and I want you to generate for me the second image".

Have a look at the setup code in PixelProblem.java to see how these two images are loaded and then added to the trainingData vector.

That's all the initialisation we need to do in the setup method, lets move onto the evaluation part.

The Problem Class: Evaluating Training Data

Our GP program works on a per-pixel basis, giving a "yes" / "no" response. All we need to do is go through each pixel in the image, and count up how many times the Genetic Individual's response was correct. We can then pass these results through the fitness calculator and get a result.

What happens if there's more than one test image? Well, there's more than one fitness value. What do we do with it? We could take the average of the values as the individuals overall fitness, but then good performances on some images would mask terrible performances on others. In my experience, the best way to get a good "all rounder" individual is with a little tough love and say that its fitness is equal to the lowest score we encounter. That's what the PixelProblem does.

Essentially, the evaluate method follows these steps:

Get next test image
Loop through every pixel on the image and get the GP program's output
Compare with the truth image and calculate TP and FP
Calculate fitness.
If this fitness worse than previous, save it.
More images? Go to Step 1
Set the individual's fitness
Tell ECJ the individual has been evaluated.
Print out a dot so ECJ won't look like its crashed
- Done -

Have a look at the eval method on PixelProblem.java to see what this means in practice.

The describe method is not compulsory, but is one that I like, because I like to see the results getting printed out as we go. It is called on the best individual of every generation, and you can run a very similar method to the eval method, collecting the TP/FP information etc, and just printing it onto the screen.

In this case the describe method also prints out a "results" image where it labels the TP and FP pixels with blue and red accordingly so you can get a visual handle on the success of the program.

The Params File

You thought it was all over, didn't you? Well not quite. ECJ runs in a funny way, one that you may find infuriating at first, until you realise "hey, this will save me a load of time!". It may take a few weeks for this epiphany to kick in.

The params is a large Java Properties list, allowing you to control loads of aspects of how ECJ operates. It allows you to re-use your Java files for many different situations without needing to recompile your code. It really is very useful, but yes, a pain to get started with.

So what can I put into a params file?

Population Size
Number of generations
Loads of Tree builder parameters
Strongly Typed GP information
Which classes to use for what functions
Crossover and Mutation Parameters
How to build the tree

Fortunately ECJ has a form of inheritance built into its params files. They can inherit the properties from other params files, so each time you make one you don't need to start from scratch. ECJ has a little hierachy of them which is (for GP):

File	Description
ec.params	High level, admin stuff
simple.params	Which of ECJs classes to use for various GP functions, population size, elite count
koza.params	Defines specific Genetic Programming parameters; tree information

We're going to inherit koza.params, which inherits simple.params, which inherits ec.params. We'll call it PixelProblem.params

Have a look at the params file here

Running ECJ

Okay, we've done just about everything. All you need to do now is add your classes directory to the Java CLASSPATH. Make sure ECJ is on the CLASSPATH too, while you're at it.
( In Linux you can edit your .bashrc file and then use the source command to refresh it.)

Now we need to run ECJ, which can be done with the following command line incantation:

> java ec.Evolve -file PixelProblem.params -p stat.gather-full=true

After the inevitable teething problems with your params file, caused by you typing in index number wrong and so forth, ECJ should start running.

Results

If ECJ runs successfully, you should find a few results in the results folder. Have a look at some I generated earlier Play around and see what effects different parameters have. Create some of your own training images; a simple framework like this is capable of solving quite a few problems.

But how does the program work?

By default, ECJ produces a file called out.stat, which shows you how the best individual in each generation works, in LISP format.

You can actually convert the best individual directly into a Java file with a bit of tree-reading code in the Problem's describe method. To do this you can use this code

Note that you'll need to change the paths used in the PixelProblem file so they are appropriate to your own system.

Download Source

The complete source code, JAR file, test data and build file are available to download:
ecj_tutorial.tar
ecj_tutorial.zip

Browse Source Code

You can browse the source code in your web browser, if you prefer:
Complete Source Code

Back to ECJ Index

Tutorial written by Olly Oechsle, ooechs at essex dot ac dot uk.
Please contact me if you have any queries or suggestions.