Project 1: Decision Tree Pruning

Start date 7 February, due 23 February beginning of class.

Your task for this project is to extend the ID3 classifier (provided in the Weka package) to support cost-sensitive learning and pruning. You may choose the type of pruning you wish to do (there is a fairly simple solution using postpruning, but you may find other techniques more interesting.)

The key behind cost-sensitive learning is that the cost of incorrect classification varies depending on the class. For this project, you will need to choose one class where misclassification is deemed four times as expensive as other classes; i.e., if you fail to put an item that belongs in that class into it, the penalty is four times as great as if you fail to correctly put an item into some other class. As you can see, this will bias you toward putting things into that class that don't belong.

For evaluation, please use two datasets. One should be the UCI Machine Learning Repository Iris dataset. The other should be a dataset of your choice.

Project Report

The project report should contain the following:

Description of the method used (e.g., cost-based pruning - note that this isn't identical to cost-based learning).
Documentation for how to use your class (should probably inherit from weka.classifiers.trees.Id3).
Sample run and results. This should include, for each dataset:
1. Discussion of basic ID3 results (as well as an actual run as an attachment). Talk about both the quality of the results and the complexity of the resulting model.
2. Expectations for how pruning will change this, and why.
3. Discussion of results with pruning (as well as an actual run as an attachment).
4. Comparison of classification accuracy from the cost-based learning vs. non-cost-based (you may find it easier to calculate the classification errors and costs manually...)
5. Discussion of differences between your expectations and actuality, if any.
Summary commentary: Does it work well (e.g., accuracy, efficiency)? What do you think are the advantages/disadvantages? If you were to do it again, what would you do differently?

Also turn in your code (obviously.)

Scoring

Scoring will be based on:

Correctness of execution (1-2 points)
Quality/extensibility of interface defined (1 point)
Quality of documentation (1 points)
Quality/readability of code (1 point)
Difficulty of pruning method used (0-1 point)
Quality of your evaluation methodology (1 point)
Demonstration of understanding of tradeoffs/issues (2-3 points)

Turning in the project

Electronic submission required. You can tar/zip and email to clifton_nospam@cs_nojunk.purdue.edu . Assume I already have the WEKA package (don't include it in your tar file.) Pdf is the safest for capturing non-text.