Advanced Big Data Analysis
Math 389L, Spring 2019 Claremont Graduate University Professor: Weiqing Gu Teaching Assistant: Conner DiPaolo
Meeting Time
T 07:00-09:50PM. SHAN 3460/3485
Office Hours
T 06:15-07:00PM. SHAN 3460/3485 (with Conner)
Course Description
This graduate level course is designed to give students a snapshot of recent techniques used to analyze, statistically and algorithmically, extremely large datasets. To accomplish this goal, the course will start with an applied and quick introduction to necessary optimization background. From there we will introduce students to topics such as spectral graph clustering, fast kernel methods, compressed sensing, among others. We will highlight applications of these methods to diverse areas such as genomics and recommender systems, but the bedrock of the course will be theory. To that end, students are expected to have a solid foundation in probability and analysis, as well as comfort with algorithmic thinking.
Textbooks
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge university press. Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science. Cambridge University Press. Woodruff, D. P. (2014). Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical CS.
Grading
- 35% Homework
- 30% Midterm Project
- 35% Final Project
- [Up to 5% Extra Credit]
Homework
Problem sets will be due (virtually) every week, on Tuesday in class. Problems will be discussed in class, and often will be designed to investigate or re-prove results from research in this area. Problems might require coding. For this we recommend either Python (eg. using numpy and scipy), Matlab, or R.
Midterm Project
Details given in class, but this will reflect the final project in nature. If the final project is to be a continuation of the midterm project (which is expected), significant additional progress must be made.
Final Project
The final project is intended to give students, in groups of 2-3, the opportunity to deep-dive into a specific area of interest in linear algebra or matrix analysis. This could be theoretical or applied, but in both cases should be originated from a single question. For example, such questions might be:
- Can we achieve as good empirical performance as deep neural networks by using Gaussian Process methods?
- Can we cluster on sketched data in metrics other than the L2 norm?
- Can we use sketched data to solve logistic regression problems?
- How can we estimate the spectral norm of a matrix using limited space?
- Which properties of matrices can we approximate only by testing a few inputs?
- How can large scale linear system solvers from numerical linear algebra be used to speed up kernel methods?
About a month after proposing their initial question, students will submit a literature review of work that attempts to answer their question. This will be at most four pages in LaTeX, one inch margins, not including references. The review should include important definitions, discuss the body of work surrounding the question. At the top of the paper, as an abstract, the student should include a refined version of their motivating question.
By the end of the course, students are expected to continue investigating their question. In particular, students should be able to find a concrete open problem in the area or blind spot in the research body. (Hint: look at the end of recent papers). Open problems can be empirical (e.g. investigating the geometry of neural network loss surfaces through the spectral information in the Hessians), applied theory (e.g. creating algorithms for robust low rank approximation), or theoretical (e.g. lower bounds on robust low rank approximation).
Before the end of the course, using this prior work on the project, the student will create a paper of at least 12 pages that details the background and progress of the research body on their open problem, promising directions, and demonstrations of results (computations or proofs). If the student is able to solve or even make concrete progress towards the open problem they will get an A on the final paper. Otherwise, experimental evidence towards their open problem is expected. The group will also give a presentation of at most 15 minutes detailing this adventure.
Deadlines
- (Mar 5) Motivating question. Hand in stapled onto back of midterm project.
- (Apr 2) Literature Review. Turn into Prof Gu's office before 6:00pm.
- (May 7) Presentation; Final Report due in class.
Disabilities
Students who need disability-related accommodations are encouraged to discuss this with the instructor as soon as possible.