$next$ $up$ $previous$
Next: References: Up: SWH Triangles: Combining Computer Previous: SWH Triangles: Combining Computer

Introduction:

In this document, we attempt to put down on paper our current thoughts on what is to us a fairly ambitious development project for intelligent data analysis for a variety of purposes; e.g. modelling from heterogeneous data sources and the use of such models in adaptive control. For some time now, we have been interested in developing an environment that allows us to do rapid prototyping of algorithm designs for the control of nonlinear systems using on-line information obtained from sensors in a real-time frame. This requires that we build models of the state and the performance of our system and then use those models to estimate the next time steps control action. This endeavor can be quite mathematical and the implementation of these mathematical ideas into a robust set of software tools is equally daunting. In addition, we have focused on models that are built using function approximation techniques that are loosely based on rather simplistic models of animal neurophysiology-the so-called artificial neural architectures.

We have always believed that we could do much better if we could find the right abstraction of the wealth of neurobiological detail that is available; the right design to guide us in the development of our software environment. Indeed, we feel very strongly that there are three distinct and equally important areas that must be jointly investigated, understood and synthesized for us to make major progress in these areas. We refer to this as the Software, Hardware and Wetware SWH triangle, Figure 1;

**Figure 1:** Software-Hardware-Wetware Triangle
$\begin{figure} \centerline{ \psfig {figure=/home/peterson/books/neural_objects/drawings/swh_triangle.eps,height=2.0in} }\end{figure}$

the term wetware is used to indicate things that are of biological and/or neurobiological scope. We use double-edged arrows to indicate that ideas from these disciplines can both enhance and modify ideas from the others. The labels on the edges indicate possible intellectual pathways we can travel upon in our quest for unification and synthesis: ANALOG VLSI is currently being used to build hardware versions of a variety of low-level biological computational units (aural and eye substrates are notable achievements in what is now called neuromorphic engineering). In addition, there are the new concepts of what is called evolvable hardware or EVOLWARE. Here, new hardware primitives referred to as Field Programmable Gate Arrays) are offering us the ability to program the devices Input/ Output response via a bit string which in principle can be chosen as a consequence to environmental input (see 1, 3 and 4). There are several ways to do this: online and offline. In online strategies, groups of FPGAs are allowed to interact and "evolve" towards an appropriate bit string input to solve a given problem and in offline strategies, the evolution is handled via software techniques similar to those used in genetic programming and the solution is then implemented in a chosen FPGA. In a related approach, it is even possible to perform software evolution within a pool of carefully chosen hardware primitives and generate output directly to a standard hardware htmllist language (VHDL) so that the evolved hardware can then be fabricated once an appropriate fitness level is reached. Thus current active areas of research use new relatively "plastic" hardware elements (their I/O capabilities are determined at run-time) or carefully reverse-engineered analog vlsi chipsets to provide us with a means to take abstractions of neurobiological information and implement them on silicon substrates; in effect, there is a blurring between the traditional responsibilities of hardware and software for the kinds of typically event driven modeling tasks we envision here. Further, Astraction is the principle tool that we can use to move back and forth in the fertile grounds of software, neurobiology and hardware.

All of these ideas should be brought to bear on the problems we are facing in the modeling from heterogeneous data elements under the rubric of Intelligent Data Analysis. We will begin by reviewing two important plenary talks which were given at the Second International Symposium on Intelligent Data Analysis (IDA-97) held in London, UK in August of 1997. Both were overview talks of the current state of the art and both had valuable insights. The refereed proceedings of this conference should be consulted for many valuable additional insights as well (see 2). These overviews are based on our personal notes of the talks and therefore are biased both overtly and subtly by what we considered interesting; however, it will at least begin the dialogue.

$*$ Professor David Hand of the Open University:

Intelligent Data Analysis (IDA) combines many languages for attacking the same set of problems: e.g.

$*$ Statistics:
$*$ Computer Science:
$*$ Pattern Recognition:
$*$ Artificial Intelligence:
$*$ Machine Learning:

This implies inevitable tensions which will bring benefits. These disciplines and others must work in parallel to sove problems using many different techniques. For example, we might compare Computational Models and Statistical Models:

**Figure 1:** Software-Hardware-Wetware Triangle

Computational	Statistical
Brain Models	Object Classification
Perceptrons	LDA, QDA, Logistic
Adaptive Estimation	Batch, Iterative
Training Set	Design Set
Emphasize Non-Overlapping	Emphasize Overlapping
Classes	Classes, Distribution
	Functions
Error Rate Criteria	Seperability Criteria
Overfitting	Model form prevents
	Overfitting
Implementation soft	Much mathematical rigor
Tackle tough problems	Tackle artificial
	problems: give rigorous
	solution to easy problems

The tension between these two areas has led to new progress and new ideas. Information is data which has been processed with some objective in mind and Data Analysis is what we do when we turn data into a simulation. Consider the following hierarchy of data:

$*$ Classical Data:

$*$ Small, clean, numerical data sets

$*$ Modern Data:

$*$ Large, Very Large, Huge (GigaByte to TeraByte in size)

$*$ Commercial Transaction DataBases
$*$ Electronic Point of Sale (EPOS)
$*$ Space Probes
$*$ Official Statistics

$*$ Number of records and dimensions can vary

$*$ finding data of interest which can illuminate is tough.

$*$ Data is often collected for a purpose which has nothing to do with subsequent requests for data analysis: e.g., EPOS records which a business wishes to analyze for trends and advice on stocking etc.

$*$ Concept of Structured vs. Unstructured Data Mining

$*$ Data is dirty

$*$ missing values
$*$ outliers
$*$ contamination
$*$ not just ordinals

We want more than just number information. We want to know structure and metadata-data about data. It seems the notion of a single data set is old-fashioned. Currently, there are shifts to

$*$ metadata analysis

$*$ Bayesian Methods and Notions

$*$ Secondary Data Analysis-that is, the data was collected for other reasons originally but we now want to analyze this database for illumination:

$*$ data merging (different kinds of ``data'' are combined).
$*$ context-free metadata-e.g. temperature
$*$ context specific metadata

These new modern data sets thus will require new tools and ideas such as

$*$ An algebra of metadata to allow us to manipulate metadata

$*$ Methods to hande huge data sets as their storage requirements are prohibitive and our model estimates are too large to build in memory.

$*$ Interactive graphical tools for IDA exploration

$*$ Anomaly detection

$*$ Continuous Monitoring

$*$ Relevance and Irrelevance of Data

$*$ Procedures to handle the application of IDA toolsets by nonexperts

$*$ More guidance?
$*$ Do we need licenses for operation to avoid poor and/or unscrupulous use of the tools?
$*$ Tool use can have major policy meaning, so is the use of these tools potentially dangerous?

$*$ Models should shift towards Pattern Centered Problems

$*$ object oriented data
$*$ local instead of global structure

$*$ Need autonomous programs that automatically look for data structure. There are already techniques in statistics that look for automatic selection of important variables and we have a burgeoning technology of intelligent software agents that should be modifiable for this purpose.

$*$ How do we pick candidate structure?

$*$ Problems of Nonrandomness: is sampling theory and probability even relevant?

$*$ Nonstationarity of the large data sets

To handle these problems, we are developing new model forms:

$*$ Rule Based Systems: emphasize modularity and explanation
$*$ Hidden Markov Models (HMM): nice, general formulation and perhaps even realistic
$*$ Neural Models: flexible nonlinear models which are being combined with statistical techniques to give us better understanding
$*$ Genetic Algorithms (GA): nice optimization strategies which are very general
$*$ Bayesian Models: use new integration tools such as MCMC
$*$ Markov Chain: for nonstandard database operations
$*$ Monte Carlo:

All of these model forms have several parts:

$*$ structural: identify major aspects and summarize data
$*$ predictive: predict future values
$*$ descriptive: phenomenonogical
$*$ mechanistic: theoretical model based on some underlying theory

What emerges from this discussion are some guidelines for what we should do in IDA: we need to make inferences about new data and fit the underlying process not the underlying data-we do not want to overfit. To make inferences about the process, we are led naturally into modeling questions. In general, there are two types of errors:

$*$ Type A: uncertainties about the empirical system and the queries we ask our system. Our questions need to be precise enough to permit unambiguous answers. In essence, this is about the level of uncertainty in our model itself.
$*$ Type B: uncertainties about our model's accuracy.

The handling of Type B errors dominates the statistical literature but it is a futilewaste of effort to reduce the size of one error ( A or B) below the size of the other. We could call this Unintelligent Data Analysis!

$*$ Dr. Larry Hunter of the National Library of Medicine:

There are big problems in the application of IDA techniques to molecular biology. Some of these are:

$*$ Gene finding-where do genes begin?

$*$ Multiple Sequence Alignment

$*$ Where are the mutations?
$*$ Where are the insertions?
$*$ Where are the deletions?

$*$ Protein Synthesis Prediction-what is the structure of a given protein?

$*$ Primary-amino acid sequence as polypeptide
$*$ Secondary-3D geometry
$*$ Tertiary-local objects combined into functional structures

$*$ Drug Binding Affinity:

$*$ Given a large set of potential drugs, which drugs will bind to which genes and proteings?

Currently, Hidden Markov Models are doing very well for sequence alignment, but we are doing pretty badly at Protein Prediction. This is probably because of non-local interactions between parts of the protein.

Now the challenges of IDA in this setting are that

$*$ Data is not in a traditional feature, value format

$*$ Data is not in packets

$*$ Global interactions between local structures predominate

$*$ there is very high dimensionality
$*$ high order correlations and relationships between thousands of positions
$*$ many biologists are happy to confront your predictions with real data, so algorithms will be used right away and discarded if not useful.
$*$ Lots of competition from non IDA points of view, so there is an incredible amount of cross-disciplinary, cross-language interaction and tension

We have learned some lessons so far.

$*$ A combination of local (sliding windows) and global (gene grammar) models are somewhat successful

$*$ Learning methods need to be combined: statistical analyses linked by ANNs or HMMs and constrained by context free grammars; use majority votes to compare answers and to choose answers

$*$ boosting: this means you apply additional methods to the harder parts of the problem after the first cut solution has been computed

$*$ we need to insert more biological information into our models

$*$ More biologically realistic ANNs

$*$ Complex Dynamics-Genesis
$*$ Several different rules combined in each neuron
$*$ connectivity is neither complex nor random
$*$ using L-systems to generate reasonable cytoarchitectures; real data on such connectivity is now becomming available to help us constrain our models

$*$ we can teach systems to learn-see the Skinner Bots from touretsky at CMU which are trainable via standard animal trainer techniques

$*$ GAs need regulatory genes

$*$ We need to look at immune system models

$*$ We need to collect and analyze statistics on intuition. Recent work from Damasio in Iowa seems to suggest that there may be a specific brain region for intuition. Can we gain insight from its cytocellular structure? It's neuron network structure? Its interconnection patterns?

$*$ We need to look at coevolutionary schemes

We need to apply these techniques and most assuredly new ones to

$*$ The reconstruction of regulatory networks

$*$ Genes turn each other on and off-what is the code?
$*$ If we look at sets of activity levels of gene products, can we infer relationships between genes?

$*$ Prediction of small molecule (ie drug) activities from data sets of similar activities

$*$ Large scale screening is beginning
$*$ chemical structure representation may be useful

Finally, we should be inspired by living systems. Computation is to Biology as Mathematics is to Physics: there appears to be a very deep a relationship between computation and biology.

$next$ $up$ $previous$
Next: References: Up: SWH Triangles: Combining Computer Previous: SWH Triangles: Combining Computer

Jim Peterson
2/28/1998