In this document, we attempt to put down on paper our current thoughts on what is to us a fairly ambitious development project for intelligent data analysis for a variety of purposes; e.g. modelling from heterogeneous data sources and the use of such models in adaptive control. For some time now, we have been interested in developing an environment that allows us to do rapid prototyping of algorithm designs for the control of nonlinear systems using on-line information obtained from sensors in a real-time frame. This requires that we build models of the state and the performance of our system and then use those models to estimate the next time steps control action. This endeavor can be quite mathematical and the implementation of these mathematical ideas into a robust set of software tools is equally daunting. In addition, we have focused on models that are built using function approximation techniques that are loosely based on rather simplistic models of animal neurophysiology-the so-called artificial neural architectures.
We have always believed that we could do much better if we could find the right abstraction of the wealth of neurobiological detail that is available; the right design to guide us in the development of our software environment. Indeed, we feel very strongly that there are three distinct and equally important areas that must be jointly investigated, understood and synthesized for us to make major progress in these areas. We refer to this as the Software, Hardware and Wetware SWH triangle, Figure 1;
the term wetware is used to indicate things that are of biological and/or neurobiological scope. We use double-edged arrows to indicate that ideas from these disciplines can both enhance and modify ideas from the others. The labels on the edges indicate possible intellectual pathways we can travel upon in our quest for unification and synthesis: ANALOG VLSI is currently being used to build hardware versions of a variety of low-level biological computational units (aural and eye substrates are notable achievements in what is now called neuromorphic engineering). In addition, there are the new concepts of what is called evolvable hardware or EVOLWARE. Here, new hardware primitives referred to as Field Programmable Gate Arrays) are offering us the ability to program the devices Input/ Output response via a bit string which in principle can be chosen as a consequence to environmental input (see 1, 3 and 4). There are several ways to do this: online and offline. In online strategies, groups of FPGAs are allowed to interact and "evolve" towards an appropriate bit string input to solve a given problem and in offline strategies, the evolution is handled via software techniques similar to those used in genetic programming and the solution is then implemented in a chosen FPGA. In a related approach, it is even possible to perform software evolution within a pool of carefully chosen hardware primitives and generate output directly to a standard hardware htmllist language (VHDL) so that the evolved hardware can then be fabricated once an appropriate fitness level is reached. Thus current active areas of research use new relatively "plastic" hardware elements (their I/O capabilities are determined at run-time) or carefully reverse-engineered analog vlsi chipsets to provide us with a means to take abstractions of neurobiological information and implement them on silicon substrates; in effect, there is a blurring between the traditional responsibilities of hardware and software for the kinds of typically event driven modeling tasks we envision here. Further, Astraction is the principle tool that we can use to move back and forth in the fertile grounds of software, neurobiology and hardware.
All of these ideas should be brought to bear on the problems we are facing in the modeling from heterogeneous data elements under the rubric of Intelligent Data Analysis. We will begin by reviewing two important plenary talks which were given at the Second International Symposium on Intelligent Data Analysis (IDA-97) held in London, UK in August of 1997. Both were overview talks of the current state of the art and both had valuable insights. The refereed proceedings of this conference should be consulted for many valuable additional insights as well (see 2). These overviews are based on our personal notes of the talks and therefore are biased both overtly and subtly by what we considered interesting; however, it will at least begin the dialogue.
This implies inevitable tensions which will bring benefits. These disciplines and others must work in parallel to sove problems using many different techniques. For example, we might compare Computational Models and Statistical Models:
Computational | Statistical |
Brain Models | Object Classification |
Perceptrons | LDA, QDA, Logistic |
Adaptive Estimation | Batch, Iterative |
Training Set | Design Set |
Emphasize Non-Overlapping | Emphasize Overlapping |
Classes | Classes, Distribution |
Functions | |
Error Rate Criteria | Seperability Criteria |
Overfitting | Model form prevents |
Overfitting | |
Implementation soft | Much mathematical rigor |
Tackle tough problems | Tackle artificial |
problems: give rigorous | |
solution to easy problems |
The tension between these two areas has led to new progress and new ideas. Information is data which has been processed with some objective in mind and Data Analysis is what we do when we turn data into a simulation. Consider the following hierarchy of data:
We want more than just number information. We want to know structure and metadata-data about data. It seems the notion of a single data set is old-fashioned. Currently, there are shifts to
These new modern data sets thus will require new tools and ideas such as
To handle these problems, we are developing new model forms:
All of these model forms have several parts:
What emerges from this discussion are some guidelines for what we should do in IDA: we need to make inferences about new data and fit the underlying process not the underlying data-we do not want to overfit. To make inferences about the process, we are led naturally into modeling questions. In general, there are two types of errors:
The handling of Type B errors dominates the statistical literature but it is a futilewaste of effort to reduce the size of one error ( A or B) below the size of the other. We could call this Unintelligent Data Analysis!
Currently, Hidden Markov Models are doing very well for sequence alignment, but we are doing pretty badly at Protein Prediction. This is probably because of non-local interactions between parts of the protein.
Now the challenges of IDA in this setting are that
We have learned some lessons so far.
We need to apply these techniques and most assuredly new ones to
Finally, we should be inspired by living systems. Computation is to Biology as Mathematics is to Physics: there appears to be a very deep a relationship between computation and biology.