from: DMNews Software Review by David M. Raab, Raab Associates; February 16, 2004
Playing with a new technology just because it is cool is a luxury few can afford. But if a useful technology just happens to be interesting, who is to complain?
So it is with genetic programming and predictive modeling. The genetic approach evolves models by introducing random variations and letting the fittest survive. The analogy to genetics is exact: Each element of a prediction formula is a gene; exchange of genes among successful models is breeding; random changes in formulas are mutations; and the best-performing models have the highest chance to reproduce.
Genetic systems start by randomly combining predictive variables and mathematical functions to build many formulas. Each formula is tested against cases with known outcomes and given a performance score. After all formulas are tested, the most accurate exchange random elements (breed) and undergo a few random changes (mutate) to produce a new generation.
The system then scores the new formulas and repeats the cycle. As the process randomly discovers and retains the most powerful variables and relationships, the models grow more accurate. The rate of improvement slows over time as fewer untried elements remain, and eventually the best-surviving model is chosen.
It is important to recognize that survival-of-the-fittest is what lets genetic systems function efficiently. An approach that simply created and tested random formulas could run virtually forever without homing in on the best results. Even with evolutionary assistance, genetic systems create tens of thousands of models before they declare a winner.
The genetic approach has two grand advantages over traditional model development: It takes much less effort by the modeler, and it produces better results. The labor savings are obvious: no need to preselect variables, identify appropriate data transformations, define likely relationships among variables or assess alternative models. Hands-on effort is reduced literally from days to minutes.
Better results share the same origins. The system can test more options and find variables, transformations and relationships that work better than the more obvious choices. Nor is it constrained by the preconceptions and rules of thumb that human modelers must apply to work efficiently.
For example, the system may pick a less-common of several closely correlated variables or find multi-way interactions among several variables. Vendors of genetic systems report they consistently outperform models built by experienced statisticians by 5 percent to 20 percent.
So if genetic systems are so great, why have not more companies adopted them? It is not because users fear for their job security: Most statisticians would be delighted to find a tool that let them produce better models with less work. But the random, hidden nature of genetic model building makes some people nervous.
This is compounded by the difficulty of interpreting models containing odd variables or calculations. Accepting these uncertainties requires violating a basic rule of modeling: Do not use a model you do not understand, because it might contain a hidden error.
But there are ways to address these issues, and the benefits of genetic approaches are too compelling to ignore. So developers keep trying.
GenIQ© (DM STAT-1 Consulting, 800/367-8281, www.dmstat1 dot com) takes a very pure genetic approach, letting the system try any mathematical relationship among any variables. Scoring formulas are built from two variables connected by a mathematical operator (add, subtract, multiply, etc.).
Each variable may represent another variable/operator/variable combination, and the variables within those combinations may be combinations themselves, and so on. The resulting model formula is thus a set of nested calculations. A typical GenIQ model runs several layers deep and uses about a dozen input variables.
In addition to basic genetic techniques, GenIQ applies sophisticated methods to handle missing data, avoid overfitting to data anomalies and remove unnecessary complexity. Users can control the details of these and other options during model development, though the default settings usually suffice.
GenIQ usually builds 250 models per generation and runs about 20 generations. While most modeling systems, genetic and otherwise, aim to make the most accurate predictions across all cases, GenIQ focuses on finding the top-responding file segments. This is measured by lift, that is, the response rate for the top few deciles vs. the rate for entire group. Maximizing lift is typically the real goal of direct response modeling, so focusing on lift directly lets GenIQ build the most useful model possible.
GenIQ displays the model formula as a branching tree, making it easy to read. But it is still virtually impossible to understand, because many calculations will involve apparently unrelated inputs or intuitively meaningless derived values. GenIQ does provide some comfort by giving a report that shows the importance attached to each input used in the model. But users hoping for a comprehensible explanation of the underlying logic will not be satisfied.
Users more interested in results, speed and ease of use are more likely to be pleased. GenIQ runs on a Windows PC and can build a model in 15 minutes on 20,000 test cases. The system accepts flat file input and requires virtually no data preparation. Setting up a model requires specifying the target and predictor variables and selecting other parameters or simply accepting the defaults.
Displays include the model tree, gains chart and variable importance ranks. The model formula can be exported in SAS, SPSS, XML, SQL or Basic formats to use in production scoring. Since there is no preprocessing of test data to create transformations and derived variables, there is no need to recreate such preprocessing on production data before scoring.