Data Mining has emerged at the confluence of machine learning, statistics, and databases as a technique for discovering summary knowledge in large datasets. This course introduces students to the process and main techniques in data mining, including association rule learning; classification approaches such as inductive inference of decision trees and neural network learning, clustering techniques, and research topics such as inductive logic programming / multi-relational data mining and time series mining.
The emphasis will be on algorithmic issues and data mining from a data management and machine learning viewpoint, it is anticipated that students interested in additional study of data mining will benefit from taking offerings in statistics such as Stat 598M or Stat 695A. It is probably not appropriate for students who have taken ECE 632.
Please send questions to the course newsgroup
purdue.class.cs590d.
This should be used for most questions. If you have something
you don't want made public, send it to
.
Critical announcements will be made via the
course mailing list.
We will be using
WebCT Vista
for recording and distributing grades.
For now, Professor Clifton will not have regular office hours. Feel free to drop by anytime, or send email with some suggested times to schedule an appointment. You can also try H.323/T.120 desktop videoconferencing (e.g., SunForum, Microsoft NetMeeting.) You can try opening an H.323 connection to blitz.cs.purdue.edu - send email if there is no response, and I'll start it up if I'm in.
Undergraduate-level expertise in database, algorithms, and statistics; Java programming experience. Students without this background should discuss their preparation with the instructor.
Students from outside Computer Science should send me email explaining why they feel they meet the prerequisites, or come talk to me. When I've approved that I feel you meet the prerequisites, I'll send email, then you can follow the information on non-CS students registering for CS courses to register.
Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Introduction to Data Mining, Addison-Wesley, 2006. ISBN 0-321-32136-7.
This will be supplemented with readings from the current research literature.
You might also find the following useful if you find on-line documentation
hard to follow (it is the companion book to
WEKA,
which will be used for course projects):
Ian H. Witten and Eibe Frank,
Data Mining:
Practical Machine Learning Tools and Techniques, Second Edition,
Morgan Kaufmann Publishers,
June 2005. 560 pages. ISBN 0-12-088407-0.
Evaluation will be a subjective process (see my grading standards), however it will be based primarily on your understanding of the material as evidenced in:
Exams will be open note / open book. To avoid a disparity between resources available to different students, electronic aids are not permitted.
Projects and written work will be evaluated on a ten point scale:
Late work will be penalized 1 point per day (24 hour period). This penalty will apply except in case of documented emergency (e.g., medical emergency), or by prior arrangement if doing the work in advance is impossible due to fault of the instructor (e.g., you are going to a conference and ask to start the project early, but I don't have it ready yet.)
Each student will be expected to read and present a paper from the research literature. You should view this as if you were presenting the paper at a conference - be prepared to answer detailed technical questions. However, you do not need to be an advocate for the paper - if you feel the work has problems, feel free to critique it. You are encouraged to meet with me before the presentation to go over your preparation/materials.
Presentations should be prepared for display on a projector. If you make the web-accessible or place them in your ITAP account, they will be accessible on the built-in machine. If you choose to use your own machine, the projector works best at XGA (1024x768) resolution.
Preentations will be scored on a roughly equal weight of how well you demonstrate your knowledge of the paper - not just details, but also the overall importance/contributions - and how well you communicate that knowledge to the class.
Each student will review two papers, and write a written report (as if reviewing a journal article). Read the following for suggestions on how to review a paper:
The review form is from IEEE Transactions on Knowledge and Data Engineering review form. The real IEEE form is an electronic submission - see here for an example of what it really looks like. I prefer you email a text result (the "submit" button won't work). You can use the text-only version I have created.
Reviews are due at the beginning of the class when the reviewed paper is being presented. The hope is that if you review a paper, you will be ready to contribute to / enliven the discussion of the paper.
Reviews will be scored primarily on your demonstration of the understanding of the material in the paper and its importance/impact on data mining. A secondary criteria will be the value of the review to an editor (in deciding if it is worthy of publication) and the author (to improve it.) Don't be afraid to criticize a paper - if you find a critical flaw in a published paper (and it really is a flaw), then you've demonstrated better understanding of the material than the reviewers who decided it should be published, and certainly would have been valuable to the editor.
Email submission of reviews is preferred (to
),
but hard copy is acceptable, if you prefer.
Please read the departmental academic integrity policy above. This will be followed unless I provide written documentation of exceptions. In particular, I encourage interaction: you should feel free to discuss the course with other students. However, unless otherwise noted work turned in should reflect your own efforts and knowledge.
For example, if you are discussing an assignment with another student, and you feel you know the material better than the other student, think of yourself as a teacher. Your goal is to make sure that after your discussion, the student is capable of doing similar assignments independently; their turned-in assignment should reflect this capability. If you need to work through details, try to work on a related, but different, problem.
If you feel you may have overstepped these bounds, or are
not sure, please come talk to me or note on what you turn in that
it represents collaborative effort (the same holds for information
obtained from other sources that you feel may cause what you turn
in to not reflect your true ability.) If I feel you have gone beyond
acceptable limits, I will let you know, and if necessary we will find
an alternative way of ensuring you know the material.
Help you receive in such a borderline case
, if cited
and not part of a pattern of egregious behavior,
is not in my opinion academic dishonesty, and will at most
result in a requirement that you demonstrate your ability
in some alternate manner.
Note: Material after the break is from Spring 2005, and is representative. You can expect it to be different.
Web Mining: Information and Pattern Discovery on the World Wide Web, In Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'97), November 1997.
TopCat: Data Mining for Topic Identification in a Text Corpus, Transactions on Knowledge and Data Engineering 16(8), IEEE Computer Society Press, Los Alamitos, CA, August, 2004. (Slides.)
Mining in the phrasal frontier, Principles of Knowledge Discovery in Databases Conference, Trondheim, Norway, June 1997. Lecture Notes in Computer Science, Springer Verlag, 1997.
Change Detection in Overhead Imagery using Neural Networks, International Journal of Applied Intelligence 18(2), Kluwer Academic Publishers, Dordrecht, The Netherlands, March 2003.
Quakefinder: A Scalable Data Mining System for Detecting Earthquakes from Space, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, pp. 208-213.
Evaluating the novelty of text-mined rules using lexical knowledgeProceedings of the seventh ACM SIGKDD international conference on > Knowledge discovery and data mining.
Amazon.com Recommendations: Item-to-Item Collaborative Filtering, IEEE Internet Computing, 7(1):76-80, 2003.
Recommender Systems Research: A Connection-Centric SurveyJ. Intell. Inf. Syst. 23(2):107-143, 2004.
Information filtering and information retrieval: two sides of the same coin?, Commun. ACM 35(12):29-37, 1992.
Combining Collaborative Filtering with Personal Agents for Better RecommendationsAAAI/IAAI 1999, pp. 439-446.
Item-based collaborative filtering recommendation algorithms, World Wide Web 2001, pp. 285-295.
E-Commerce Recommendation ApplicationsData Mining and Knowledge Discovery 5(1/2):115-153, 2001.
Application of dimensionality reduction in recommender systems-a case studyACM WebKDD Workshop 2000.
Eliminating noisy information in Web pages for data mining, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003.
Mining Web Log Sequential Patterns with Position Coded Pre-Order Linked WAP-Tree, Data Mining and Knowledge Discovery. Vol. 10: 1, 5 - 38. 2005.
Web page classification: Web site mining: a new way to spot competitors, customers and suppliers in the world wide web, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 2002.
Mining of Web-Page Visiting Patterns with Continuous-Time Markov Models, Advances in Knowledge Discovery and Data Mining: 8th Pacific-Asia Conference, PAKDD 2004, Sydney, Australia, May 26-28, 2004, Pages 549 - 558
Building Decision Tree Classifier on Private DataIEEE International Conference on Data Mining Workshop on Privacy, Security, and Data Mining, December 9, 2002, Maebashi City, Japan, pp. 1-8.
When do Data Mining Results Violate Privacy?, The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 22-25, 2004, Seattle, Washington.
Privacy-Preserving Data Mining, Proceedings of the 2000 ACM SIGMOD Conference on Management of Data, May 14-19, 2000, Dallas, TX, pp. 439-450.
Frequent Subgraph DiscoveryThe IEEE International Conference on Data Mining (ICDM), 2001. Available in the IEEE digital library.
A Probablistic Approach to Fast Pattern Matching in Time Series Databases, in Proceedings of the 3rd International Conference on Knowledge Discovery and data Mining, Newport Beach, CA, August 14-17, 1997, pp. 24-20. (Best paper runner-up.) I have hard-copy of this.
Discovering similar patterns in time series, In proceedings of the 6th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data mining Boston, MA, Aug 20-23, 2000. pp 497-505.
Discovery of relational association rules, Luc Dehaspe and Hannu TT Toivonen. In N. Lavrac and S. Dzeroski, editors, Relational Data Mining, 189 - 212. Springer-Verlag, 2001. (preliminary version)
Distribution forecasting of high frequency time series, Decision Support Systems 37(4): 501-513.
Final Project due date: April 30, 2005 (official last day of classes). If you'd like to give a demo as part of your project report, we can schedule it during the last week of classes (if you are ready), or during finals week. The report/writeup is due on 4/30.