Purdue CS440: Large-scale Data Analytics
(Fall 2023)




Course Description

"Big data" has been a buzzword for a long time. Many disruptive techniques have been developed to address various aspects of big data. This course will cover the key concepts, design principles, and systems to analyze large-scale data in order to extract novel and transformative insights. Tentative topics include database fundamentals, big data storage (e.g., HDFS), big data computing frameworks (e.g., Hadoop and Spark), data warehouses, data lakes, graph analytics (e.g., Spark Graph), data streaming (e.g., Spark Streaming), large-scale machine learning (e.g., Spark MLlib), and cloud-native data analytics.




Instructor

  • Jianguo Wang
  • Email: csjgwang@purdue.edu (note: must include "[CS440]" in the subject)



Teaching Assistants

  • Shige Liu (liu3529@purdue.edu)
  • Chenzhe Jin (jin467@purdue.edu)
  • Yunxin Sun (sun1114@purdue.edu)



Logistics

  • When: MW 2:30p-3:20p
  • Where: WTHR 104
  • Office hour: after class or make appointment
  • Pre-requisites: CS242, CS251, and CS373



Labs and PSOs

Labs and PSOs will start from the 2nd week (in LWSN B148).
  • L01: Thursday 11:30-1:20pm
  • L02: Thursday 3:30-5:20pm
  • L03: Wednesday 5:30-7:20pm
  • L04: Thursday 9:30-11:20am



Online communications

  • We'll use Piazza, e.g., announcements, discussions, and Q&A.
  • We'll NOT use Brightspace except for sending emails occasionally.
  • We'll use Gradescope for submitting and grading homeworks.



Textbooks (Optional)

Note that textbooks are optional and the lectures slides are self-contained.



Grading

  • Homeworks: 20% (2 * 10%)
  • Midterm exam: 25%
  • Final exam: 35%
    • 8:00am - 10:00am, WTHR 172, Dec 13
  • Project: 20% (4 * 5%)
    • Projects are related to the labs and will be explained in the labs.
  • Extra credits: 5%



Academic Integrity and More




Schedule

Lecture

Topic

Lec 1 (08/21) Course Introduction
Lec 2 (08/23) Relational DB
Lec 3 (08/28) SQL 1
Lec 4 (08/30) SQL 2
Lec 5 (09/04) No class due to Labor day
Lec 6 (09/06) Database Storage
Lec 7 (09/11) Index
Lec 8 (09/13) Query Processing 1
Lec 9 (09/18) Query Processing 2
Lec 10 (09/20) Transaction
Lec 11 (09/25) Concurrency Control
Lec 12 (09/27) Crash Recovery
Lec 13 (10/02) Crash Recovery 2
Lec 14 (10/04) Distributed Databases
Lec 15 (10/09) No class due to October break
Lec 16 (10/11) Midterm Exam (In-class)
Lec 17 (10/16) Hadoop
Lec 18 (10/18) SQL-on-Hadoop
Lec 19 (10/23) Big Data Storage
Lec 20 (10/25) Big Data Storage 2
Lec 21 (10/30) Spark Core
Lec 22 (11/01) Spark Core 2
Lec 23 (11/06) Spark SQL
Lec 24 (11/08) Spark ML
Lec 25 (11/13) Spark Streaming
Lec 26 (11/15) Spark Streaming 2
Lec 27 (11/20) Spark Graph
Lec 28 (11/22) No class due to Thanksgiving
Lec 29 (11/27) Cloud-Native Data Analytics
Lec 30 (11/29) Cloud-Native Data Analytics 2
Lec 31 (12/04) Vector Data Analytics
Lec 32 (12/06) Review