Security Issues in Data Mining

Tuesdays and Thursdays, 9:00-10:15

Heavilon Hall 123

Chris Clifton

Email:

Data mining, the discovery of new and interesting patterns in large datasets, is an exploding field. Recently there has been a realization that data mining has an impact on security (including a workshop on Data Mining for Security Applications.) One aspect is the use of data mining to improve security, e.g., for intrusion detection. A second aspect is the potential security hazards posed when an adversary has data mining capabilities.

This seminar will explore the field of data mining from a security perspective. My goal is that on completing the course you will have a solid background in the area, such that you will be ready to pursue research on some aspect of data mining security.

Course Methodology

The course will begin with a tutorial on data mining. The contents and scope of this tutorial will depend on the background and preparation of the students. The bulk of the course will concentrate on exploring recent advances in the field through investigation of the research literature.

The workload in the course will be as follows:

Presentation of papers. Each student will be expected to read and present one or more papers from the research literature. You should view this as if you were presenting the paper at a conference - be prepared to answer detailed technical questions. However, you do not need to be an advocate for the paper - if you feel the work has problems, feel free to critique it. The number of papers presented by each student will depend on the number of students in the course, but will probably be three to four over the course of the semester. You must prepare your own materials for presentation. A draft of your presentation materials is due 24 hours before the presentation is to be given. Schedule time with me to go over your presentation after the draft has been turned in.
Presentations should be prepared for display on a projector. My notebook is available to use during class, provided you make materials available to me before class so I can load them (once I leave the CS building, I won't be able to load anything.) It is a Windows 2000 machine, and has Office XP installed. If you choose to use your own machine, the projector works best at XGA (1024x768) resolution.
The person presenting first at each class is responsible for bringing the projector to class. You can pick it up before 8:45 from Candace Walters in CS-210 (Randy Bond can get it for you if she isn't in.) The person presenting second is responsible for returning it after class. If this poses problems (e.g., a tight schedule), let me know and I'll take care of picking up / returning the projector.
Written reviews. Each student will review papers, and write a written report (as if reviewing a journal article). Read the following for suggestions on how to review a paper:
- Alan Jay Smith, The Task of the Referee, IEEE, 1990.
- KAIS Editorial Board, Knowledge and Information Systems Guidelines for Reviews, 1998-2001.
Expect to write one review each week that you do not present a paper. The review form has changed to the IEEE Transactions on Knowledge and Data Engineering review form. The real IEEE form is an electronic submission - see here for an example of what it really looks like. I prefer you email a text result (the "submit" button won't work). You can use the text-only version I have created.
Reviews are due at the beginning of the class when the reviewed paper is being presented.
Email submission of reviews is preferred (to ), but hard copy is acceptable, if you prefer.

There may be additional/alternative work assigned, especially during the tutorial portion of the class.

Prerequisites

Ideally, students in the course would have a good background in data mining, some database experience, a knowledge of probability and statistics, and a good background in computer security. However, I doubt many students will have such a background. What I consider a reasonable set of prerequisites is two of the following three:

Some machine learning background (e.g., CS 471, 662 or ECE 473), or some knowledge of statistics (e.g., STAT 511)
Undergraduate level database (CS 348 or 448)
Some security background (e.g., CS 426)

Permission of instructor is of course a sufficient prerequisite. If you are interested in the course, but do not have two of the above three (or you are unsure if you have sufficient background), please email me with why you are interested and what you consider to be your relevant background.

Policy on Intellectual Honesty

Please read the above link to the policy written by Professor Spafford. This will be followed unless I provide written documentation of exceptions.

Late work will only be accepted in case of documented emergency (e.g., medical emergency), or by prior arrangement if doing the work in advance is impossible due to fault of the instructor (e.g., you can't do a review early because the paper hasn't been assigned yet.)

Reviews should be an independent analysis of the paper - collusion between reviewers is poor practice. Therefore I ask that reviewers of a paper not discuss the paper with the other reviewers before writing their own review. This will help bring a healthy difference of opinion into classroom discussions. One exception to this: If you are presenting a paper and have difficulty understanding it, you are encouraged to talk to the people reviewing the paper to see if they have insights that may help you in your presentation.

Evaluation/Grading:

Evaluation will be a subjective process, however it will be based primarily on your understanding of the material as evidenced in:

Your presentations (45%)
Your written reviews (35%)
Your contribution to classroom discussions (20%)

I will evaluate presentations and reviews on a five point scale:

5: Exceptional work. So good that it makes up for substandard work elsewhere in the course. These will be rare.
4: What I'd expect of a Ph.D. candidate. This corresponds to an A grade.
3: Good enough for a Master's degree, but not what I'd like to see for a Ph.D. candidate. This corresponds to a B grade.
2: Okay for a Master's candidate who does extremely well in other courses. This corresponds to a C grade.
1: Not good enough for a graduate student. But something.
0: Missing work, or so bad that you needn't have bothered.

If the number of students is in the right range (allowing between two and three classroom presentations for each student), you will have the option of doing a final project: a research proposal for work in this area. This will be done instead of presentations and reviews in the final two weeks of the course. Students opting not to write a proposal will have one additional presentation and additional review (during the final two weeks) giving equal opportunity to demonstrate knowledge of the material.

Note: The time may be changed to 7:30-8:45 or 4:30-5:45 if there are no conflicts. This would be done to get a room where the lectures can be videotaped - I'd like you to have a chance to see yourself presenting. This will be done later in the term, if necessary. For now, the 9:00-10:15 slot is the one we will use.

Please add yourself to the course mailing list. Send mail to mailer@cs.purdue.edu containing the line:

add your email to cs590m

List of Papers to be covered

You may want to use the Purdue Libraries proxy server to get on-line access to more papers.

Syllabus:

Week 1: Course outline, data mining overview, clustering.
Week 2: Classification, Association Rules.
Week 3: Market Basket Associations, Data Mining Process

Weeks 4-5: Classifiers for Intrusion Detection

Date	Paper	Presenter	Reviewers
9/11	Charles Elkan, KDD Cup '99	Chris Clifton (slides)	Jaideep Shrikant Vaidya Amit J. Shirsat
	Wenke Lee, Sal Stolfo. ``Data Mining Approaches for Intrusion Detection'' In Proceedings of the Seventh USENIX Security Symposium (SECURITY '98), San Antonio, TX, January 1998.	Murat Kantarcioglu	Addam Schroll, Eirik Herskedal, Ann-Sofie Nystrom
9/13	James Cannady. The Application of Artificial Neural Networks to Misuse Detection: Initial Results First International Workshop on the Recent Advances in Intrusion Detection (RAID98), September 14-16, 1998, Louvain-la-Neuve, Belgium.	Evimaria Dimitrios Terzi	Rajeev Gopalkrishna, Xiaodong Lin
	Wenke Lee, Sal Stolfo, and Kui Mok. ``A Data Mining Framework for Building Intrusion Detection Models'' In Proceedings of the 1999 IEEE Symposium on Security and Privacy, Oakland, CA, May 1999	Pat Gorman	Benjamin Lee, Mohamed Galal Elfeky, James Joshi
9/18	Wenke Lee, Sal Stolfo, and Kui Mok. ``Mining Audit Data to Build Intrusion Detection Models'' In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD '98), New York, NY, August 1998	Rajeev Gopalakrishna	James Joshi, Murat Kantarcioglu, Evimaria Dimitrios Terzi
	Wenke Lee, Sal Stolfo, and Kui Mok. ``Mining in a Data-flow Environment: Experience in Network Intrusion Detection'' In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '99), San Diego, CA, August, 1999	Benjamin Lee	Mohamed Galal Elfeky, Pat Gorman
9/20	Filippo Neri, Mining TCP/IP Traffic for Network Intrusion Detection by Using a Distributed Genetic Algorithm, in Proceedings of the 11th European Conference on Machine Learning, Barcelona, Catalonia, Spain, May 31-June 2, 2000. (on reserve in the Math Library)	Eirik Herskedal	Jaideep Shrikant Vaidya, Xiaodong Lin
	R. Lippman and S. Cunningham, ``Improving Intrusion Detection Performance using Keyword Selection and Neural Networks, In Proceedings of the Second International Workshop on Recent Advances in Intrusion Detection (Raid99), September 7-9, 1999, West Lafayette, Indiana.	Ann-Sofie Nystrom	Addam Schroll, Amit J. Shirsat
9/25	Wenke Lee, Sal Stolfo, and Kui Mok, ``Adaptive Intrusion Detection: a Data Mining Approach'' Artificial Intelligence Review, Kluwer Academic Publishers, 14(6):533-567, December 2000.	Addam Schroll	Pat Gorman,
	Wenke Lee and Sal Stolfo, A Framework for Constructing Features and Models for Intrusion Detection Systems ACM Transactions on Information and System Security 3(4) (November 2000).	Jaideep Shrikant Vaidya	Xiaodong Lin, Evimaria Dimitrios Terzi
9/27	Stefanos Manganaris, Marvin Christensen, Dan Zerkle, Keith Hermiz, ``A Data Mining Analysis of RTID Alarms'', First International Workshop on the Recent Advances in Intrusion Detection (RAID98), September 14-16, 1998, Louvain-la-Neuve, Belgium. Better yet, see Computer Networks, Volume 34 for a later version.	Mohamed Galal Elfeky	James Joshi, Ann-Sofie Nystrom, Murat Kantarcioglu
	Daniel Barbará, Ningning Wu, Julia Couto, and Sushil Jajodia, ``Mining Unexpected Rules in Network Audit Trails'', journal article in preparation/review.	Amit J. Shirsat	Rajeev Gopalkrishna, Eirik Herskedal, Benjamin Lee
10/2	Stefan Axelsson, ``The Base-Rate Fallacy and the Difficulty of Intrusion Detection'', In Proceedings of the 6th ACM Conference on Computer and Communications Security, pp. 1-7, November 1-4, 1999, Kent Ridge Digital Labs, Singapore. See also his licentiate of engineering thesis.	Xiaodong Lin	Amit J. Shirsat, Murat Kantarcioglu, Eirik Herskedal, Rajeev Gopalkrishna
	Wenke Lee, Wei Fan, Matt Miller, Sal Stolfo, and Erez Zadok ``Toward Cost-Sensitive Modeling for Intrusion Detection and Response'' to appear in Journal of Computer Security, 2001.	James Joshi	Evimaria Dimitrios Terzi, Jaideep Shrikant Vaidya, Benjamin Lee, Pat Gorman
10/4	Corinna Cortes, Kathleen Fisher, Daryl Pregibon and Anne Rogers, ``Hancock: a language for extracting signatures from data streams'', Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining August 20 - 23, 2000, Boston, MA USA.	Kathleen Fisher	Mohamed Galal Elfeky, Ann-Sofie Nystrom
10/9	October Break. Students presenting or turning in a review for the papers on October 11 will not be expected to due one the week of Thanksgiving vacation. I would like to have at least six students presenting/reviewing on October 11 and November 20, so if you have a preference for which week you have "off", get your requests in early.
10/11	Xinzhou Qin, Wenke Lee, Lundy Lewis and Joao B. D. Cabrera ``Using MIB II Variables For Network Anomaly Detection: A feasibility Study'' Workshop on Data Mining for Security Applications.	Pat Gorman	Benjamin Lee, Amit J. Shirsat
	Jianxiong Luo and Susan M. Bridges, ``Mining fuzzy association rules and fuzzy frequency episodes for intrusion detection'', International Journal of Intelligent Systems 15(8), 2000. Pages: 687-70	Murat Kantarcioglu	Jaideep Shrikant Vaidya, Mohamed Galal Elfeky, Ann-Sofie Nystrom
10/16	Zheng Zhang, Jun Li, Constantine Manikopoulos, Jay Jorgenson and Jose Ucles ``HIDE: a Hierarchical Network Intrusion Detection System Using Statistical Preprocessing and Neural Network Classification'', 2001 IEEE Man Systems and Cybernetics Information Assurance Workshop.	Robert Gwadera	Addam Schroll, Murat Kantarcioglu
	Discussion of the progress of a research project from first publication to Ph.D. For additional reading see Wenke Lee, Sal Stolfo, and Phil Chan, ``Learning Patterns from Unix Process Execution Traces for Intrusion Detection'', AAAI Workshop: AI Approaches to Fraud Detection and Risk Management, July 1997; and Wenke Lee's dissertation.	Ben Kuperman	(none)
10/18	Oliver M. Dain and Robert K. Cunningham, ``Fusing Heterogeneous Alert Streams into Scenarios'', Workshop on Data Mining for Security Applications.	Addam Schroll	Eirik Herskedal, Evimaria Dimitrios Terzi, Amit J. Shirsat, James Joshi, Pat Gorman, Rajeev Gopalkrishna
	Maloof, M.A. and Michalski, R.S., ``A Partial Memory Incremental Learning Methodology and its Application to Intrusion Detection'' Proceedings of the 7th IEEE International Conference on Tools with Artificial Intelligence, 1995. (For more details on the learning algorithm used, see: Maloof, M.A. and Michalski, R.S., ``Selecting examples for partial memory learning'', Machine Learning 41:27-52, 2000.)	Chris Clifton	Mohamed Galal Elfeky, Jaideep Shrikant Vaidya, Xiaodong Lin, Ann-Sofie Nystrom
10/23	Nong Ye and Xiangyang Li, ``A Scalable Clustering Technique for Intrusion Signature Recognition'' 2001 IEEE Man Systems and Cybernetics Information Assurance Workshop, West Point, NY, June 5-6, 2001.	Evimaria Dimitrios Terzi	Benjamin Lee, James Joshi
	Leonid Portnoy, Elezar Eskin, and Sal Stolfo ``Intrusion detection with unlabeled data using clustering'' Workshop on Data Mining for Security Applications. (Leonid Portnoy's thesis is also available.)	Xiaodong Lin	,
10/25	Matthew G. Schultz, Eleazar Eskin, Erez Zadok, and Salvatore J. Stolfo, ``Data Mining Methods for Detection of New Malicious Executables'', IEEE Symposium on Security and Privacy, Oakland, CA, May 2001.	Mohamed Galal Elfeky	Pat Gorman, Murat Kantarcioglu, Addam Schroll, Ann-Sofie Nystrom
	Christoph Michael and Anup K. Ghosh, ``Using Finite Automata to Mine Execution Data for Intrusion Detection: A Preliminary Report'', Third International Workshop on the Recent Advances in Intrusion Detection, October 2-4, 2000, Toulouse, France. (also available here).	Jaideep Shrikant Vaidya	Eirik Herskedal, Amit J. Shirsat, Rajeev Gopalkrishna
10/30	O. de Vel, A. Anderson, M. Corney, and G. Mohay, ``Multi-Topic E-mail Authorship Attribution Forensics'' Workshop on Data Mining for Security Applications.	Ann-Sofie Nystrom	Pat Gorman, Addam Schroll
	Terran Lane and Carla E. Brodley, ``Temporal sequence learning and data reduction for anomaly detection, ACM Transactions on Information Systems Security 2(3) (Aug. 1999), Pages 295 - 331.	Benjamin Lee	,
11/1	J. Hale, J. Threet, S. Shenoi, ``A Practical Formalism for Imprecise Inference Control'', Proceedings of the 8th IFIP WG11.3 Workshop on Database Security.	James Joshi	Eirik Herskedal, Evimaria Dimitrios Terzi, Mohamed Galal Elfeky
	S. Rath, D. Jones, J. Hale, S. Shenoi, ``A Tool for Inference Detection and Knowledge Discovery in Databases'', in Proceedings of the 9th IFIP WG11.3 Workshop on Database Security.	Amit J. Shirsat	Murat Kantarcioglu, Jaideep Shrikant Vaidya, Xiaodong Lin
11/6	J. Hale and S. Shenoi, ``Analyzing FD Inference in Relational Databases'', Data and Knowledge Engineering Journal, vol. 18, pp. 167-183, 1996	Eirik Herskedal	Benjamin Lee, James Joshi
	Chris Clifton and Don Marks, ``Security and Privacy Implications of Data Mining'', ACM SIGMOD Workshop on Data Mining and Knowledge Discovery, Montreal, Canada, June 2, 1996.	Rajeev Gopalkrishna	Pat Gorman, Jaideep Shrikant Vaidya
11/8	M. Atallah, M., E. Bertino, E., A. K. Elmagarmid, A.K., M. Ibrahim, and V. S. Verykios, ``Disclosure Limitation of Sensitive Rules'', In Proceedings of 1999 IEEE Knowledge and Data Engineering Exchange Workshop (KDEX'99) pp. 45-52, November 1999, Chicago, IL.	Amit J. Shirsat	Mohamed Galal Elfeky, Ann-Sofie Nystrom, Addam Schroll, Xiaodong Lin
	Chris Clifton ``Using Sample Size to Limit Exposure to Data Mining'', Journal of Computer Security 8(4), IOS Press, November 2000.	Chris Clifton	Rajeev Gopalkrishna, Evimaria Dimitrios Terzi
11/13	T. D. Johnsten and V. V. Raghavan, ``Impact of decision-region based classification mining algorithms on database security'', In V. Atluri and J. Hale, editors, Research Advances in Database and Information Systems Security, pages 171-191. Kluwer Academic, Norwell, MA, 2000. (See also the conference preproceedings version for a slightly longer treatment: ``Impact of decision-region based classification mining algorithms on database security'', In Proc. of Thirteenth IFIP WG 11.3 Working Conference on Database Security, Seattle, WA, July 1999.)	Benjamin Lee	Pat Gorman, Eirik Herskedal
	T. D. Johnsten and V. V. Raghavan, ``Security Procedures for Classification Mining Algorithms'', Fifteenth Annual IFIP WG 11.3 Working Conference on Database and Application Security, Niagara on the Lake, Ontario, CANADA, July 15-18, 2001.	Chris Clifton	Addam Schroll,
11/15	Y. Lindell and B. Pinkas, ``Privacy Preserving Data Mining'', In Crypto 2000, Springer-Verlag (LNCS 1880), pages 36-54, 2000.	Jaideep Shrikant Vaidya	Murat Kantarcioglu, James Joshi, Xiaodong Lin
	R. Agrawal and R. Srikant, "Privacy-Preserving Data Mining", Proc. of the ACM SIGMOD Conference on Management of Data, Dallas, May 2000.	Mohamed Galal Elfeky	Ann-Sofie Nystrom, Evimaria Dimitrios Terzi, Amit J. Shirsat, Rajeev Gopalkrishna
11/20	Edith Cohen, Mayur Datar, Shinji Fujiwara, Aristides Gionis, Piotr Indyk, Rajeev Motwani, Jeffrey D. Ullman and Cheng Yang ``Finding Interesting Associations without Support Pruning'', in Proceedings of the 16th International Conference on Data Engineering, 28 February - 3 March, 2000, San Diego, California.	Murat Kantarcioglu	Xiaodong Lin,
	Yucel Saygin, Vassilios S. Verykios, and Chris Clifton, ``Using Unknowns to Prevent Discovery of Association Rules'', Submitted to ACM SIGMOD Record special issue on Data Mining and Security.	James Joshi	Evimaria Dimitrios Terzi, Addam Schroll, Eirik Herskedal, Rajeev Gopalkrishna
11/22	Thanksgiving.
11/27	Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava, ``Grouping Web Page References into Transactions for Mining World Wide Web Browsing Patterns'', in Proceedings of the 1997 IEEE Knowledge and Data Engineering Exchange Workshop (KDEX-97), November 1997.	Evimaria Dimitrios Terzi	Benjamin Lee, James Joshi
	Wai Chiu Wong and A. Fu, ``Incremental Document Clustering for Web Page Classification'', IEEE 2000 Int. Conf. on Info. Society in the 21st century: emerging technologies and new challenges (IS2000), Nov 5-8, 2000, Japan.	Amit J. Shirsat	Mohamed Galal Elfeky, Jaideep Shrikant Vaidya
11/29	S. Hofmeyr, S. Forrest, and A. Somayaji, ``Intrusion Detection Using Sequences of System Calls'', Journal of Computer Security Vol. 6, pp. 151-180 (1998).	Ann-Sofie Nystrom	Eirik Herskedal, Rajeev Gopalkrishna
	Christopher J.C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery 2(2): 121-167, June 1998.	Xiaodong Lin	,
Bonus session: 10:30am, LAEB B254 (not mandatory)	Privacy Preserving Association Rule Mining in Vertically Partitioned Data.	Jaideep Shrikant Vaidya	None
12/4	Data Mining applied to File Integrity.	Pat Gorman	None
	Dakshi Agrawal and Charu C. Aggarwal, ``On the design and quantification of privacy preserving data mining algorithms'', in Proceedings of the twentieth ACM SIGMOD_SIGACT-SIGART symposium on principles of Database Systems on Principles of database systems, 2001.	Eirik Herskedal	Murat Kantarcioglu, Amit J. Shirsat, Benjamin Lee
12/6	Classifying disk blocks into file type.	Addam Schroll	None
	Bing Liu, Yiming Ma, Philip S. Yu, ``Discovering Unexpected Information from your Competitors' Web Sites'' in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2001).	Rajeev Gopalkrishna	Evimaria Dimitrios Terzi, Ann-Sofie Nystrom, James Joshi, Xiaodong Lin, Mohamed Galal Elfeky