An Introduction to Genetic Algorithms

, March 01, 1995

March 1995/An Introduction to Genetic Algorithms

An Introduction to Genetic Algorithms

Keith Grant

Keith Grant is a software engineering manager at Standard & Poor's Compustat, a provider of financial data and analysis tools. He also teaches C++ and OO analysis and design. He has a BS in physics from the University of Washington.

Introduction

A surprising number of everyday problems are difficult to solve by traditional algorithms. A problem may qualify as difficult for a number of different reasons; for example, the data may be too noisy or irregular; the problem may be difficult to model; or it may simply take too long to solve. It's easy to find examples: finding the shortest path connecting a set of cities, dividing a set of different tasks among a group of people to meet a deadline, or fitting a set of various sized boxes into the fewest trucks. In the past, programmers might have carefully hand crafted a special-purpose program for each problem; now they can reduce their time significantly by using a genetic algorithm.

What are Genetic Algorithms?

A genetic algorithm (GA) is one of a relatively new class of stochastic search algorithms. Stochastic algorithms are those that use probability to help guide their search. John Holland developed GAs at the University of Michigan in the mid-1970s. As the name implies, GAs behave much like biological genetics. GAs encode information into strings, just as living organisms encode characteristics into strands of DNA. (The choice of the term string is unfortunate. In the GA community, a string contains the potential solution and bears no relationship to a string in C or C++.)

A string in a GA is analogous to a chromosome in biology. A population of strings competes and those strings that are the fittest procreate, the rest eventually die off, childless. As with biological parents, two strings combine and contribute part of their characteristics to create their offspring, the new individual. This new string joins the population and the fight to produce the next generation. If both parents contribute good building blocks (short sections of the string) to the offspring, it will be more fit and will procreate in its turn. If the building blocks are poor then the offspring will die off without generating offspring. A second — but important — process occurs in GAs: sometimes, very rarely, a mutation occurs and the offspring will incorporate a new building block that came from neither parent. The above cycle of death and birth repeats until an acceptable solution to the problem is found.

Benefits of Genetic Algorithms

One of a GA's most important qualities is its ability to evaluate many possible solutions simultaneously. This ability, called implicit parallelism, is the cornerstone of GA's power. Implicit parallelism results from simultaneous evaluation of the numerous building blocks that comprise the string. Each string may contain millions of these building blocks, and the GA assesses them all simultaneously each time it calculates the string's fitness. In effect, the algorithm selects for patterns inside the string that exhibit high worth, and passes these building blocks on to the next generation. This selection process enables genetic algorithms to perform well where traditional algorithms flounder, such as in problems with huge search spaces.

Robustness

Genetic algorithms also have the quality of robustness. That is, while special-case algorithms may find more optimal solutions to specific problems, GAs perform very well over a large number of problem categories. This robustness results in part because genetic algorithms usually apply their search against a large set of points, rather than just a single point, as do calculus-based algorithms. Because of this, GAs are not caught by local minima or maxima. Another contribution to their robustness is that GAs use the strings' fitness to direct the search; therefore they do not require any problem-specific knowledge of the search space, and they can operate well on search spaces that have gaps, jumps, or noise.

Miscellaneous Benefits

GAs also perform well on problems whose complexity increases exponentially with the number of input parameters. Such problems, called NP-complete, would take years to solve using traditional approaches. Furthermore, genetic algorithms can produce intermediate solutions; the program can stop at any time if a suboptimal solution is acceptable. Finally, GAs easily lend themselves to parallel processing; they can be implemented on any multiprocessor architecture.

The Algorithm Explored

As the pseudo-code in Figure 1 illustrates, the basic genetic algorithm is quite simple.

While extensive research on GAs has produced no optimal implementation (many aspects of GAs are still debated), there are several algorithms that work quite well in most situations. The example algorithm (Figure 1) is one of these implementations, and it is simple and reliable.

Even with an off-the-shelf GA, the programmer still faces two significant tasks: designing the coding scheme and creating the fitness function. The coding scheme defines how a string will represent a potential solution of the problem at hand. The fitness function uses the coding scheme to evaluate each string's fitness or worth. By combining these two parts the genetic algorithm can calculate how well any string solves the problem.

The Knapsack Problem

To understand how GAs work, consider a concrete example. The knapsack problem is a classic NP-complete problem. (Sedgewick uses this kind of problem in his book, Algorithms [1], to illustrate dynamic programming.) Simply stated, given a pile of items that vary in weight and value, find that combination of items having the greatest total value but which does not exceed a maximum weight. In other words, the goal is to fill up a hypothetical knapsack with the most expensive loot it can carry. While it's easy to describe, this goal can be difficult to accomplish. For example, a pile of just 50 items presents 250 different possible selections. Assuming a computer could test a million different combinations each second, it would still take 35 years to try them all.

In this article, I show how a GA solves such a problem, but for the sake of illustration I use a smaller number of items. The pile contains fourteen items, so it provides a little more than 16,000 possible combinations to try in the knapsack. There are five different kinds of items, ranging from 3 to 9 in weight and from 4 to 13 in value. The knapsack can hold a maximum weight of 17, so it can carry one A, or two Bs, etc. Table 1 lists all the different items, their weights and values, and maximum number of each type that can fit into the knapsack. The program in Listing 1 illustrates the use of a GA to solve the knapsack problem. Supporting functions and class definitions appear in Listing 2, Listing 3, Listing 4, Listing 5, Listing 6, and Listing 7.

Developing a Coding Scheme

The first step in writing a GA is to create a coding scheme. A coding scheme is a method for expressing a solution in a string. Many successful types have been discovered, but there is no mechanical technique for creating one. Like programming, creating a coding scheme is part science and part art, but also like programming, it gets easier with practice. Early researchers used binary encoded strings exclusively, but higher order alphabets work without loss of efficiency and power. The type of coding scheme to use depends on the problem.

Order defines the number of different characters in the alphabet. Do not confuse the GA term character with ASCII characters. A GA character is analogous to a gene in that it has a position and a value. A binary alphabet has an order of two, meaning that the characters can only have two values, 0 or 1.

The coding scheme I've chosen for the knapsack uses a fixed, length, binary, position-dependent string. The pile in the example contains fourteen items so each string must have fourteen binary characters, one character for each item. The location of each character in the string represents a specific item and the value of the character indicates whether that item is in the knapsack or left in the pile.

Figure 2 illustrates the coding of fourteen items into a GA string. The coding scheme's equivalent in C++ is the array of struct, ItemDesc, shown in Listing 1. Each column of the table represents a character position in the string. The top three lines give the label, weight, and value of each character position. The bottom three lines show strings that define potential solutions to the knapsack problem. In this case a 1 means the item is in the knapsack and a 0 means the item is in the pile. The first string places six items into the knapsack: one A, B, C, and D, and two Es, for a total weight of 34 and total value of 47. The second string places five items in the knapsack: two Ds, and three Es, for a weight of 17 and value of 22. The third string uses just two items: one A and one E for a weight of 12 and a value of 17.

Creating a Fitness Function

The next step is to create a function that will evaluate how well each string solves the problem — that is, calculate the string's fitness. The knapsack problem requires maximization of the loot's value in the knapsack. If this were the only requirement, a fitness function could simply rank a string by adding up the values of all the items put into the knapsack. The GA would then tell us that the best solution was to put all 14 items in the knapsack. However, a second requirement states that the weight of the items cannot exceed a maximum (17, in this example). So this fitness function fails miserably. In GA terminology, it results in a constraint violation.

GA researchers have explored many approaches to constraint violation but none are perfect. Here are three possibilities:

Elimination

Elimination attempts to determine if a string violates the constraint before it is ever created. This approach has several problems. For starters, it may be too expensive to perform, or simply impossible. Second, preventing the creation of violators may cause GA to overlook perfectly valid solutions. That's because violators could produce legal (non-violating) offspring that would lead to a satisfactory solution more quickly.

High Penalty

This approach imposes a high penalty on violators. It reduces violators' worth while allowing them to occasionally propagate offspring. A weakness of this approach becomes apparent when a population contains a large percentage of violators. In this case, legal strings will dominate the following generations and the violators will be left unexploited. This effect could lead to population stagnation.

Moderate Penalty

This approach imposes a moderate penalty on violators. It increases the probability that violators will procreate, thus reducing the chance of population stagnation. This approach exhibits its own problems, especially when violators rate higher than legal strings. In this case, if the violators do not create legal strings then violators will dominate the following generations. Furthermore, if violators rate higher than legal strings then the criteria for ending the search must incorporate a mechanism for detecting violators.

The knapsack example employs the third technique. Its fitness function (CalcFitness in Listing 1) adds up the value of each item and subtracts a moderate penalty for violators. The penalty is three times the amount of excess weight. Table 2 shows the resulting fitness of the three example strings previously defined.

Initialization

After the coding scheme and fitness function are integrated into the GA it is ready to run. The GA's first task is to create an initial population of strings. The demo program stores this population in an object (Pop) of class CGAPopulation (defined in Listing 4) . Each string in the population is an object of class CGAChromosome (Listing 2) .

There are many ways to select an initial population; approaches range from randomly setting each character of every string to modifying the results of a search made previously by a human. The knapsack example uses a modified weighted random design. The initialization function (class CGAPopulation's constructor, Listing 5) creates strings with an increasing likelihood of setting each bit to 1. Figure 3 shows the result of creating ten strings. The probability of setting any bit to 1 for the first string, labeled U, is 10%. The probability increases incrementally for each new string created until all the strings are created and the probability reaches about 50%.

After creating and initializing each string, the constructor creates a complement of that string, by calling member function Complement (Listing 3) . The complement string has the opposite bit pattern of the original.

Note that in the top half of the table the U string contains only one one-bit, whereas each successive string has an increasing number of one-bits, until the fifth string has about half ones and zeros. The bottom half of the figure shows the complement strings to the original five.

The composition of the initial population can dramatically affect the performance of the genetic algorithm. The more diverse the initial population the more opportunities the GA will have to exploit the search space. The above initialization scheme has the advantage of simplicity and diversity. It is simple in that it does not require any information about the problem. The scheme is diverse because the function creates strings ranging from mostly zeros to mostly ones and everything in-between.

How large should the initial population be? The population should be large enough to create a diverse set of individuals for the GA to exploit but not so large that creating the initial population dominates computer time. The knapsack example sets the initial population to 30. I picked this rather small population to better illustrate how GAs work.

Parent Selection and Procreation

After creating an initial population, the GA selects two parents for the purpose of procreation. Parent selection is based on string fitness. While creating the initial population, the fitness function calculates the worth of each string. This calculation occurs within each string's constructor, CGAChromosome::CGAChromosome (Listing 3) . The population constructor CGAPopulation::CGAPopulation then ranks the strings according to their fitness, by calling member function Merge (Listing 5) .

After the popuplation constructor returns, the main program enters a while loop, and stays there until a solution is found. It's unlikely that any strings in the initial generation contain a solution; if the while condition is satisfied (no solution found), the program calls CGAPopulation::CreateNextGeneration (Listing 5) to create a new generation of strings.

The first step in creating a new generation is selection of two parents. The GA does not select strings directly by their rank in the population, so the best string is not guaranteed to be a parent. Instead, the string's worth, based on its rank in the population as a whole, biases the probability of that string being selected to parent the next generation. If a string ranks as the 25th best out of 100 strings, then it has a 75% chance of becoming a parent.

A GA's method of selecting parents is very important; it can significantly impact the efficiency of the search. Among the many types of selection functions, the two most widely used techniques are proportional selection and linear rank selection. The knapsack example uses linear rank selection, first because it is easy to implement (see CGAPopulation::GetParent, Listing 5) . More important, I suspect that linear rank selection is inherently better behaved than proportional selection, because proportional selection has required many fixes to its original design over the years.

Linear rank selection simply calculates the fitness of a string and then ranks it in the entire population. This process involves two random operations. First, GetParent randomly selects a candidate string from the population:

      . . .
Selection=
   Rand0UpTo1();
      . . .

Next, GetParent determines if the candidate string will parent an offspring,

by performing a weighted probability (via function Flip, Listing 7) based on the string's rank. If the string does not qualify as a parent, then GetParent repeats the cycle and randomly selects another string from the population.

Procreation

Once two parents have been selected, the GA combines them to create the next generation of strings. The GA creates two offspring by combining fragments from each of the two parents. The knapsack example uses uniform crossover to cut up fragments from the parents (see function Crossover, Listing 3) . The positions where strings are cut into fragments are called the crossover points. Crossover chooses these points at random; uniform crossover means that every point has an equal chance of being a crossover point.

Crossover selection occurs via a crossover mask. Figure 4 illustrates the use of a crossover mask. The first child will receive the first parent's character if the bit is 1 and the second parent's character if the bit is 0. The second child works in reverse.

Uniform crossover implies many crossover points with an even probability distribution across the entire length of each parent string. The fragments from the first parent combined with their complementary members from the second parent creates two new strings.

Mutation

Sometimes the children undergo mutation. The knapsack example uses an interesting mutation operator (see CGAChromosome::Mutate, Listing 3) . Rather than a fixed mutation probability, Mutate uses a probability that changes based on the makeup of the population. Mutate compares the two parents of the child; greater similarity between parents increases the probability that a mutation will occur in the child. The reason for using a variable mutation probability is to reduce the chance of premature convergence. This condition occurs when the population rushes to a mediocre solution and then simply runs out of steam. There is little diversity left and the search becomes a random walk among the average. This is similar to a biological species becoming so inbred that it is no longer viable. To reduce premature convergence the mutation operator kicks in when the population shows little diversity and adds new variety by introducing random mutation.

Finding the Answer

The two children now can replace two older strings from the population. This occurs in function CGAPopulation::ReplaceChromosome (Listing 5) . As it does with parent selection, the GA chooses these older strings with a random bias. In this case, however, the worst string will have the greatest chance of being removed. After the insertion of new strings, the population is then ranked again. The program stops when any string solves the problem.

This raises a question: if the best solution is unknown, how can the program determine if a it has found the best answer, or at least, one of the better answers? One approach would be to solve for a fixed solution that meets some predetermined minimally acceptable value. A second approach would be to run the program until its rate of finding better answers drops off or the rate of improvement of those answers flattens out.

The knapsack example ran until it found the known answer, which is 24. It took, on average, about 160 generations to find the solution — about 350 out of 16,000 possibilities, or 2% of the search space. Indeed, this problem is small enough to solve with traditional methods, but by watching how the code

operates in detail you can get a good idea of how GAs work. The example is quite small and expandable. You can try it on different problems simply by creating a new ItemDesc structure and the related CalcFitness function. All the I/O is confined to the PrintPop function (Listing 1) , so you could easily drop the example into a larger program.

Conclusion

Genetic algorithms balance exploitation with exploration. The crossover and mutation operators control exploration while the selection and fitness functions control exploitation. Increasing exploitation decreases exploration. Mutation increases the ability to explore new areas of the search space but it also disrupts the exploitation of the previous generations by changing them.

Genetic algorithms represent a new, innovative approach to search algorithms. Unlike most traditional algorithms, GAs are not deterministic, rather they exploit the power of probabilistic operations. By definition and design they are adaptive. Survival of the fittest governs the progress of the search, and with the possibility of mutations, GAs may explore completely unexpected avenues. GAs exhibit a chaotic search behavior very similar to how humans conduct searches — part analytical, part intuition, and part luck.

Bibliography

[1] Sedgewick, Robert. Algorithms, 2nd ed. Addison-Wesley, 1988.

[2] Holland, H. Adaptation in Natural and Artificial Systems. The University of Michigan Press, 1975.

[3] Schaffer, David. Proceedings of the Third International Conference on Genetic Algorithms. Morgan Kaufmann, 1989.

[4] Goldberg, David E. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, 1989.

[5] Levy, Steven. Artificial Life: the Quest for a New Creation. Pantheon Books, 1992.

[6] Davis, Lawrence. Research Notes in Artificial Intelligence, Genetic Algorithms, and Simulated Annealing. Morgan Kaufmann, 1987.

Previous 1 2 3 4 5 6 7 8 9 10 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

An Introduction to Genetic Algorithms