Project 1: Decision Tree Pruning
Start date 7 February, due 23 February beginning of class.
Your task for this project is to extend the ID3 classifier
(provided in the Weka package) to support cost-sensitive learning
and pruning.
You may choose the type of pruning you wish to do (there is a fairly
simple solution using postpruning, but you may find other techniques
more interesting.)
The key behind cost-sensitive learning is that the cost of incorrect
classification varies depending on the class. For this project, you
will need to choose one class where misclassification is
deemed four times as expensive as other classes; i.e., if you fail to put
an item that belongs in that class into it, the penalty is four times
as great as if you fail to correctly put an item into some other class.
As you can see, this will bias you toward putting things into that class
that don't belong.
For evaluation, please use two datasets. One should be the
UCI Machine Learning Repository
Iris
dataset.
The other should be a dataset of your choice.
Project Report
The project report should contain the following:
- Description of the method used (e.g., cost-based pruning - note that
this isn't identical to cost-based learning).
- Documentation for how to use your class (should probably inherit from
weka.classifiers.trees.Id3).
- Sample run and results. This should include, for each dataset:
- Discussion of basic ID3 results (as well as an actual run as an
attachment). Talk about both the quality of the results and the
complexity of the resulting model.
- Expectations for how pruning will change this, and why.
- Discussion of results with pruning (as well as an actual run as an
attachment).
- Comparison of classification accuracy from the cost-based learning vs.
non-cost-based (you may find it easier to calculate the classification
errors and costs manually...)
- Discussion of differences between your expectations and actuality,
if any.
- Summary commentary: Does it work well (e.g., accuracy, efficiency)?
What do you think are the advantages/disadvantages?
If you were to do it again, what would you do differently?
Also turn in your code (obviously.)
Scoring
Scoring will be based on:
- Correctness of execution (1-2 points)
- Quality/extensibility of interface defined (1 point)
- Quality of documentation (1 points)
- Quality/readability of code (1 point)
- Difficulty of pruning method used (0-1 point)
- Quality of your evaluation methodology (1 point)
- Demonstration of understanding of tradeoffs/issues (2-3 points)
Turning in the project
Electronic submission required.
You can tar/zip and email to
.
Assume I already have the WEKA package (don't include it in your tar file.)
Pdf is the safest for capturing non-text.
