CS 373: Data Mining and Machine
Learning
Semester: |
Spring 2019 |
Time: |
Tuesday,
Thursday, 12-1:15pm |
Place: |
Haas G066 |
Staff
Instructor |
Dan
Goldwasser |
Office
hours |
See Piazza page. LWSN 2142A |
Contact |
dgoldwas AT purdue
DOT edu |
TA |
See Piazza page |
Description and Learning Objectives
Machine learning offers a new paradigm of computing
– computer systems that can learn to perform tasks by finding patterns in
data, rather than by running code specifically written by a human programmer to
accomplish the task. This class will introduce students to this field, which
sits at the intersection of statistics and computer science. We will look into
algorithms that can automatically discover patterns and learn models from large
datasets.
After taking this course students will be able to: (1) Identify
key elements of data mining and machine learning algorithms (2) Understand how
algorithmic elements interact to impact performance (3) Understand how to
choose algorithms for different analysis tasks (4) Analyze data in both an
exploratory and targeted manner (5) Implement and apply basic algorithms for
supervised and unsupervised learning (6) Accurately evaluate the performance of
algorithms, as well as formulate and test hypotheses
Textbook
D. Hand, H. Mannila, P. Smyth
(2001). Principles of Data Mining. MIT
Press.
M. Mohri (2012) . Foundations
of Machine Learning. MIT Press.
T. Mitchel (1997)
Machine Learning. McGrew Hill.
There
will be five homework/programming assignments that will be posted on the
schedule. Homework assignments should be submitted in class, unless otherwise
noted. Programming assignments should written in
python, unless otherwise noted, and should be submitted on data.cs.purdue.edu
using Turnin. Details will be provided in the
assignments.
In
general, questions about the details of homework assignments should be directed
to the TA, though you should feel free to mail the instructor whenever you have
a question. Example solutions, when applicable, will be made available after
homework is returned to students.
There
will be several online quizzes as well as a midterm and comprehensive final
exam. Exams will be closed book and closed notes.
�
Quizzes/participation:
10%
�
Homework:
45%
�
Midterm:
20%
�
Final
exam: 25%
Grades will be posted on Blackboard.
Assignments
are to be submitted by the due date listed. Each person will be allowed four
days of extensions which can be applied to any
combination of assignments during the semester without penalty. After that a
late penalty of 15% per day will be assigned. Use of a partial day will be
counted as a full day. Use of extension days must be stated
explicitly in the late submission (either directly in the submission
header or by accompanying email to the TA), otherwise late
penalties will apply. Extensions cannot be used after the final day of
classes (ie., Dec 11
midnight). Extension days cannot be rearranged after they are applied to a
submission. Use them wisely!
Assignments
will NOT BE accepted if they are more than five days late. Additional
extensions will be granted only due to serious and documented medical or family
emergencies.
Please read the departmental academic
integrity policy. This will be followed unless we provide written
documentation of exceptions. We encourage you to interact amongst yourselves:
you may discuss and obtain help with basic concepts covered in lectures or the
textbook, homework specification (but not solution), and program implementation
(but not design). However, unless otherwise noted, work turned in should
reflect your own efforts and knowledge. Sharing or copying solutions is
unacceptable and could result in failure. We use copy detection software, so do
not copy code and make changes (either from the Web or from other students).
You are expected to take reasonable precautions to prevent others from using
your work.
Please read the general course policies here.
Introduction (1
week)
What is data mining? What is machine learning? Overview of the process and associated tasks. Example
applications.
Background
and basics (1 week)
Types of data: attributes, instances. Populations and
samples. Random variables and distributions. R
and Python.
Exploratory data analysis (2
weeks)
Data cleaning and preprocessing. Sampling. Feature
construction and discovery. Visualization methods. Hypothesis testing.
Predictive Modeling (3 weeks)
Classification problem formulation. Algorithmic elements:
representation, scoring functions, search, inference. Overview of basic algorithms (e.g., naive Bayes, decision trees,
nearest neighbor). Evaluation: metrics, cross-validation, learning curves.
Understanding and Extending Model Performance (1
week)
Error analysis. Feature selection. Ensemble techniques.
Descriptive Modeling (3 weeks)
Clustering problem formulation. Algorithmic elements:
representation, scoring functions, search, inference. Overview of basic algorithms (e.g., k-means, hierarchical
clustering, spectral clustering). Evaluation: metrics, subjective
assessment.
Pattern
Mining (2 weeks) (subject to change)
Pattern detection formulation. Algorithmic elements: representation, scoring
functions, search, inference. Overview
of basic algorithms (e.g., association rules, anomaly detection).
Evaluation: metrics, interestingness, understandability.