Assignment #2 (Solution Sketches)

Comparisons of data mining methods
- The key factor here is "expressiveness". As described in class, HOODG was designed to overcome two of the drawbacks of ID3, namely duplication of subtrees in disjunctive concepts (replication) and partitioning of data into fragments, where a high-arity attribute is tested at each node (fragmentation). It is most useful in cases where the concepts are best represented as graphs and it is important to understand the structure of the learned concept. This is easily understood when you try to build a decision graph for something like XOR problem, in particular for high dimensional domains.
- PROGOL is naturally better than ID3 in two respects: First, it is not an attribute-value scheme like ID3 is and so is most natural when attributes need to be interpreted in symbolic form. Second, it learns concepts involving relations from more than one table in a database. This is very useful in situations where we need to utilize the expressiveness of first order predicate logic, and in situations where a relation can be naturally thought of as the value of another predicate.
- ID3 is better than PROGOL, in a similar argument, when decisions need to be based on the values of attributes at different stages in the mining process. ID3 (as does PROGOL) has the advantage that the learned concept has a simple interepretation in the domain sense.
- A feedforward neural network is best when all we would like to do is to accurately characterize a numerical functional dependency from input to output, as is the case in curve fitting and function modeling. If the output needs to be estimated from the input, a neural network is best because it serves as a universal function approximator by providing sigmoidal units as building blocks. It was mentioned in class that a 3-layer network can be made to arbitrarily approximate any chosen function. It is also a model-free estimator in that it could approximate a wide variety of functional forms and is not restricted to schemes that are restricted by the choice of the learning mechanism, like ID3 and HOODG. - ID3 can only learn concepts that are expressible as decision trees, HOODG only decision graphs and so on.
As mentioned in class, stacking is a way of combining the performances of different classifiers and forming a "composite classifier" from several basic ones. Jeff Bradford has constructed one such classifer from three inducers (a modified decison tree inducer, IB and disc-naive-bayes). Here's his script. His results show that the IB algorithm yields 100% accuracy over the training set, so the ID3 stacking inducer always "selects" only this inducer to perform generalization. In a different case study, it is conceivable that no single classifier would be adjudged best and the decision tree constructed by ID3 would be truly "composite".
This was a kind of open question and we wanted to get a feel for your ideas. The kind of mining depicted here is called "discovery" in AI circles and true enough, AM (Automated Mathematician) was one such program meant to do scientific discovery. To get a feel for the answer, lets revisit AM and how it does what it does. AM used a variety of general purpose techniques to identify concepts that are `interesting'. To this end, it has a set of heuristics that determine if something is worth investigating or not.
For example, here is one such heuristic:

If a function f(x,y) is interesting, consider what happens when x=y.

If function f corresponds to multiplication, then this leads to discovering the concept of squaring. If f corresponds to division, then this leads to discovering that "A number divided by itself equals 1.". If f corresponds to subtraction, then this leads to discovering that "A number subtracted from itself equals nothing.", and so on.
This para is verbatim from "Artificial Intelligence, by Rich and Kinight." In one run, AM discovered the concept of prime numbers. How did it do that? Having stumbled onto the natural numbers AM exploited operations such as addition, multiplication and their inverses. It created the concept of divisibility and noticed that some numbers had very few divisors. AM has a built-in heuristic that tells it to explore extreme cases. It attempted to list all numbers with zero divisors (finding none), one divisor (finding one: 1), and two divisors. AM was instructed to call the last concept "primes". Before pusrsuing this concept, AM went on to list numbers with three divisors, such as 49. AM tried to relate this property with other properties of 49, such as its being odd and a perfect square. AM generated other odd numbers and other perfect squares to test its hypothesis. A side effect of determining the equivalence of perfect squares with numbers with three divisors was to boost the "interestingness" rating of the divisor concept. This led AM to investigate ways in which a number could be broken down into factors. This led AM to investigate ways in which a number could be broken down into factors. AM then noticed that there was only one way to break a number down into prime factors (known as the Unique Factorization Theorem)......... [and finally, it hit upon] Goldbach's Conjecture which is widely believed to be true, but a proof of it has yet to be found in mathematics.
Now, we ask ourselves, how do we replicate this behavior in a real life data mining system? First we need to ascertain the reason's for AM's success. One source of power, is obviously, the big lump of heuristics about what constitute interesting things. "The less obvious source of power is the natural relationship between number theoretical concepts and their compact representations in AM. AM worked by syntactically mutating old concept definitions - stored essentially as short LISP programs - in the hope of finding new, interesting concepts. It turns out that a mutation in a short LISP program very likely results in another well-formed LISP program. This accounts for AM's ability to generate so many novel concepts."
Another way of saying the same is that AM is exploring the space of small possible LISP programs, and moreover concepts in number theory have succinct representation as LISP programs.
This leads us to ask the question. What is best suited from our point of view? I feel genetic programming is best suited for the reasons stated above (and in fact AM can be viewed as a very primitive genetic programming exercise). Recall that genetic programming is different from genetic algorithms in that we are mutating and "crossing-over" actual programs which are the members of a population at any given point of time. ILP would also be fruitful provided you were able to enumerate all possible interesting concepts. This would also entail some kind of mathematical engine that keeps "feeding" input to your ILP engine.
Note: These are just indicative solutions. More are possible if you can possibly determine ways of solving the "What is interesting" problem.