CSCI 3346, Fall 2018
Data Mining
Course Homepage and Syllabus

Note: The course syllabus may be modified slightly during the term.
Updates will be made only on the course web page, located at
http://www.cs.bc.edu/~alvarez/DataMining/.

Contact | References | Objective | Grading | Schedule

Welcome to the CSCI 3346 homepage! You may use the menu options above to find your way around, if you'd rather not scroll.

Who, where, and when

Instructor

Prof. Sergio A. Alvarez

alvarez@bc.edu

Office: St. Mary's Hall, S255

Office hours: Tu and Th, 10:30 am - noon, or by appointment

TA (TA office hours are held in Fulton 160)

Yueming Chen: Mon 5-6, Thurs 1-2

Tiwalayo Eisape: Sun 2-4

Lectures

The class meets Tu Th 9-10:15 in Fulton 415

Course web page

This web page for CSCI 3346 is located at http://www.cs.bc.edu/~alvarez/DataMining/. You should check both this page and the Canvas page regularly for updates.

References

Textbook (required)

P.-N. Tan, M. Steinbach, A. Karpatne, V. Kumar,
Introduction to Data Mining, 2nd ed., Pearson, 2018

Additional references

Ian H. Witten, Eibe Frank, Mark A. Hall, Christopher Pal. Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, Morgan Kaufmann, 2016
Weka (Waikato Environment for Knowledge Analysis)
scikit-learn (requires Python and NumPy)

Datasets

University of California, Irvine (UCI) Machine Learning Repository
Delft University of Technology Pattern Recognition Laboratory Data
List of datasets at KDnuggets

References on MATLAB

links to MATLAB tutorials:
http://www.mathworks.com/academia/student_center/tutorials/launchpad.html

This course provides an introduction to the field of data mining, and emphasizes an applied machine learning perspective. Data mining deals with the semi-automated analysis of collections of data, with the aim of discovering patterns that are informative and useful. Data mining analysis tasks are important in fields such as medical informatics and bioinformatics, e-commerce, and security. The course will cover fundamental data mining tasks, relevant concepts and techniques from machine learning and statistics, and data mining applications to real-world domains such as document classification, gene expression, analysis of human sleep recordings, and fraud detection.

Prerequisites / Required Background

The prerequisites for CSCI 3346 are Computer Science II (CSCI 1102) or an equivalent course in data structures and object-oriented programming, and Randomness and Computation (CSCI 2244) or a similar introduction to probability and random variables. Students taking CSCI 3346 are assumed to be comfortable with fundamental programming techniques. Some of the problem sets may require writing programs in a high-level programming language such as Python, Java, or MATLAB.

How to do well in this course

Students should expect to work hard. Keeping up with the reading assignments and starting problem sets early are two key habits that should be developed (if necessary) and maintained for the entirety of the course. Assigned reading will usually follow the in-lecture discussion fairly closely, but there may be a few exceptions to this rule, and in such cases it is especially important to make sure that you've understood the material covered in the reading assignment. When preparing problem set solutions, you should strive for maximum clarity. Your write-ups should reveal the key ingredients and the structure of your reasoning. Include detailed explanations. These comments apply equally well to exam solutions.

I'll be happy to discuss any questions during my office hours. If you can't make it to my regular office hours, we can try to schedule an appointment at a mutually convenient time.

Grading

Each student's course grade will be based on exams, the term project presentation and written report, homework, and class participation.

Problem sets will be due several times during the semester. The average score on the problem sets will contribute 20% of your course grade. Problem sets provide guided opportunities to become more familiar with the fundamental concepts and tools of the course. Therefore, it is extremely important to conscientiously work the problem sets to the best of your ability. If you have questions, ask. Late problem set submissions will receive no credit except in very unusual circumstances (at the discretion of the instructor). Problem sets will be posted on Canvas approximately one week before their due dates.
Two in-class midterm exams will be given. See the schedule below for exam dates. Exams will be open book and open notes. No make-up exams will be given. Each of the exam scores will contribute 20% of your course grade.
A term project will substitute for a final exam. The project report(s) and final oral presentation will count as 20% of your course grade.
In-class activities and participation will count as 20% of your course grade. This includes possible in-class discussion of homework problems or the current reading assignment.

Academic Honesty

You are encouraged to discuss assigned homework problems (but not exams) with other people but you must individually design and write your own solutions for all assignments. Furthermore, you should explicitly acknowledge any sources of ideas used that are not your own; this includes other people, books, web pages, etc. You should not discuss exam problems with anyone except the instructor. Please also read the Boston College academic integrity policy. Behavior that is not in the spirit of the above guidelines may be reported to the appropriate class dean for review by the Academic Integrity Board. If you're in doubt about what constitutes academic dishonesty in a particular situation, talk to me.

Downloads and Approximate Schedule of Topics (may change substantially)

Notes:

Deviations from the rough schedule below may occur. Updates will be made during the semester.

Chapter and section numbers in the schedule refer to the textbook (2019 edition, unless otherwise stated).

Problem sets will be available on Canvas to registered students.

The reading assignments and materials provided below complement the lectures, but are not intended to replace them. Students are responsible for topics discussed in the lectures as well as those covered in the reading assignments.

Jump to the end of the schedule

Second midterm exam
In class, Nov. 19 -->

Week Reading (see Canvas for Homework) Topics

Aug. 27-31 1, 2.1-2.4 (the first two chapters are almost the same in both editions)
Download and install Weka (requires Java 1.8 or later) and MATLAB and scikit-learn (requires Python, and NumPy and SciPy - see link)
U.D. Reichel, H.R. Pfitzinger. "Text Preprocessing for Speech Synthesis", TC-Star Workshop on Speech-to-Speech Translation, 19-21 June 2006, Barcelona, Spain.
M.A. Hall, G. Holmes. "Benchmarking Attribute Selection Techniques for Discrete Class Data Mining", IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 3, pp. 1-16, May-June 2003.
Introduction
Data mining tasks
Examples
Types of data
Data preprocessing
Aggregation, sampling
Discretization

Sept. 3-7
PS1 due Sept. 4 (Canvas)
2.3-2.4
Appendices B and C.2
Correlation and covariance
A. Kapur, S. Kapur, and P. Maes. AlterEgo: a Personalized Wearable Silent Speech Interface, 23rd International Conference on Intelligent User Interfaces (IUI 2018), pp 43-53, March 5, 2018.
E. Bair, T. Hastie, D. Paul, and R. Tibshirani. Prediction by Supervised Principal Components, Journal of the American Statistical Association, 101(473): 119-137, 2006
D. J. Bartholomew, F. Steele, I. Moustaki, J. Galbraith. "Multidimensional Scaling", chapter 3 of The Analysis and Interpretation of Multivariate Data for Social Scientists anno(2nd ed), Chapman and Hall / CRC, 2008, pages 55-81
Attribute selection
Feature extraction
PCA
Nonlinear embedding
FFT
Summary statistics
Covariance
Similarity

Sept. 10-14
Term project preliminary proposal due
PS2 due Sept. 11

(3.1-3.3 in first edition here - the 2nd ed. does not include this material)
plus 3.3, 2nd ed. (4.1-4.4 first ed.)
Ron Kohavi and J. Ross Quinlan. "Decision tree discovery", in Will Klosgen and Jan M. Zytkow, editors, Handbook of Data Mining and Knowledge Discovery, chapter 16.1.3, pages 267-276. Oxford University Press, 2002.
Decision tree pruning
J48 decision tree construction / evaluation using Weka API:
TrainTestSplit.java
R2D3 Visual introduction to decision trees
R2D3 Visual introduction to the bias-variance tradeoff
Exploratory data analysis
Data visualization
Classification
Decision tree induction
Overfitting
Bias-variance tradeoff

Sept. 17-21
PS3 due Sept. 18
3.5-3.6 (4.5-4.6 in first ed.)
4.1-4.3 (5.1-5.2 in first ed.)
R.L. Cilibrasi, P.M.B. Vitanyi. "The Google Similarity Distance", IEEE Transactions on Knowledge and Data Engineering, vol 19, no 3, pages 370-383, 2007.
R.C. Holte. "Very Simple Classification Rules Perform Well on Most Commonly Used Datasets", Machine Learning, vol. 11 (1993), 63-91.
R. L. Lawrence and A. Wright. "Rule-based classification systems using classification and regression tree (CART) analysis", Photogrammetric engineering and remote sensing 67.10 (2001): 1137-1142.
Sequential covering
W. W. Cohen. "Fast Effective Rule Induction", Proceedings of the Twelfth International Conference on Machine Learning, 115-123, 1995.
P. Domingos. "A Few Useful Things to Know About Machine Learning", Comm. ACM, vol. 55, no. 10, pages 78-87, 2012.
1-NN classification in Java using Weka API:
WekaAPISampleClient.java

Classifier evaluation
Comparing classifiers
Rule-based classifiers
Instance-based prediction
k-NN classifier / regressor

Sept. 24-28
PS4 due Sept. 25
First midterm exam
In class, Sept. 27
Topics for the first midterm exam
My machine learning notes

Oct. 1-5
Term project final proposal due
Appendix C
4.4 (5.3 in first ed.)
Naive Bayes notes at Georgia Tech
"Naive Bayes text classification", in Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. Introduction to Information Retrieval, Cambridge U. Press, 2008
S. Raschka. Naive Bayes and Text Classification
Probability
Bayesian classifiers
Naive Bayes
NB document categorization

Oct. 8-12
No class Tu Oct. 9
(BC fall break)

4.5 (5.3 in first ed.)
Bayes networks example
E. Klarreich. In Search of Bayesian Inference. Communications of the ACM, vol. 58, no. 1, 21-24, Jan. 2015
Remco R. Bouckaert. "Bayesian networks in Weka", Technical Report, Department of Computer Science, Waikato University, Hamilton, NZ 2005.
Bayesian classifiers
Bayesian networks

Oct. 15-19
PS5 due Oct. 16
4.6, 4.7 (5.4 in first ed.) Logistic regression
Neural networks:
perceptrons

Oct. 22-26
PS6 due Oct. 23
4.7 (5.4 in first ed.), 4.8
Training multilayer ANN
Hidden layer representations
S.A. Alvarez, T. Kawato, C. Ruiz. "Mining over Loosely Coupled Data Sources using Neural Experts", Proc. MDM/KDD 2003, Washington, DC, Aug. 2003.
Multilayer networks
Hidden features

Oct. 29 - Nov. 2
PS7 due Oct. 30
Second midterm exam
In class, Nov. 1
4.8
S. Paisarnsrisomsuk, M. Sokolovsky, F. Guerrero, C. Ruiz, S.A. Alvarez. Deep Sleep: convolutional neural networks for predictive modeling of human sleep time-signals, KDD 2018 Deep Learning Day (in conjunction with KDD 2018), London, UK, Aug. 2018
Topics for the second midterm exam
Deep learning

Nov. 5-9 4.9 (5.5 in first ed.), 4.11 (5.7 in first ed.)
C. Cortes, V. Vapnik. "Support Vector Networks", Machine Learning, vol 20, no 3, 273-297 (1995).
T. Fawcett. "ROC Graphs: Notes and Practical Considerations for Data Mining Researchers", Hewlett-Packard Labs Technical Report HPL-2003-4, 2003.
N. V. Chawla. Data Mining for Imbalanced Datasets: An Overview, in Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers, Springer, 853-867, 2005
Van Hulse, J., Khoshgoftaar, T. M., & Napolitano, A. Experimental perspectives on learning from imbalanced data, In Proceedings of the 24th International Conference on Machine Learning, pp. 935-942, ACM, June 2007.
Support Vector Machines (SVM)
maximum margin classification
nonlinear SVM, kernels
Class imbalance
Class-specific performance
Thresholded classification
ROC examples
ROC optimization
ROC
Cost-sensitive classification

Nov. 12-16
PS8 due Nov. 13
4.10 (5.6 in first ed.), 7.1-7.5 (8.1-8.5 in first ed.), 8.2 (9.2 in first ed.)
D. Pelleg, A. Moore. "X-means: extending k-means with efficient estimation of the number of clusters", ICML 2000
Demo: k-means clustering for image segmentation from U. of Washington
P. D'haeseleer. Primer: How does gene expression clustering work?, Nature Biotechnology 23, 1499 - 1501 (2005)
S. Akaho and O. Michel. Gaussian mixture model EM algorithm
M. Luo, Y.-F. Ma, H.-J. Zhang. A spatial constrained k-means approach to image segmentation, Proc. ICIS-PCM 2003
Ensemble methods
Clustering
k-Means
Hierarchical clustering

Nov. 19-23
No class on Thursday (Thanksgiving)
8.4, 9.2
T. Bertin-Mahieux, D.P.W. Ellis, B. Whitman, P. Lamere. The Million Song Dataset, Proc. 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.

Probabilistic clustering (E-M)
Density-based clustering

Nov. 26-30
Term project presentations begin
10, 6.1-6.3
Y. Liao, V. R. Vemuri. "Using text categorization techniques for intrusion detection", Proceedings of the 11th USENIX Security Symposium, pp 51-59, Aug. 2002.
E. Eskin, A. Arnold, M. Prerau, L. Portnoy, S. Stolfo. A geometric framework for unsupervised anomaly detection, Proceedings of Applications of Data Mining in Computer Security, Kluwer, 78-100, 2002.
Y. Zhong, Y. Deng, and A. K. Jain. Keystroke dynamics for user authentication, CVPR Workshop on Biometrics, Providence, RI, June 16-21, 2012.
Anomaly detection

Dec. 3-7
Last week of classes
Term project presentations
Term project report due Dec. 7

Week	Reading (see Canvas for Homework)	Topics
Aug. 27-31	1, 2.1-2.4 (the first two chapters are almost the same in both editions) Download and install Weka (requires Java 1.8 or later) and MATLAB and scikit-learn (requires Python, and NumPy and SciPy - see link) U.D. Reichel, H.R. Pfitzinger. "Text Preprocessing for Speech Synthesis", TC-Star Workshop on Speech-to-Speech Translation, 19-21 June 2006, Barcelona, Spain. M.A. Hall, G. Holmes. "Benchmarking Attribute Selection Techniques for Discrete Class Data Mining", IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 3, pp. 1-16, May-June 2003.	Introduction Data mining tasks Examples Types of data Data preprocessing Aggregation, sampling Discretization
Sept. 3-7 PS1 due Sept. 4 (Canvas)	2.3-2.4 Appendices B and C.2 Correlation and covariance A. Kapur, S. Kapur, and P. Maes. AlterEgo: a Personalized Wearable Silent Speech Interface, 23rd International Conference on Intelligent User Interfaces (IUI 2018), pp 43-53, March 5, 2018. E. Bair, T. Hastie, D. Paul, and R. Tibshirani. Prediction by Supervised Principal Components, Journal of the American Statistical Association, 101(473): 119-137, 2006 D. J. Bartholomew, F. Steele, I. Moustaki, J. Galbraith. "Multidimensional Scaling", chapter 3 of The Analysis and Interpretation of Multivariate Data for Social Scientists anno(2nd ed), Chapman and Hall / CRC, 2008, pages 55-81	Attribute selection Feature extraction PCA Nonlinear embedding FFT Summary statistics Covariance Similarity
Sept. 10-14 Term project preliminary proposal due PS2 due Sept. 11	(3.1-3.3 in first edition here - the 2nd ed. does not include this material) plus 3.3, 2nd ed. (4.1-4.4 first ed.) Ron Kohavi and J. Ross Quinlan. "Decision tree discovery", in Will Klosgen and Jan M. Zytkow, editors, Handbook of Data Mining and Knowledge Discovery, chapter 16.1.3, pages 267-276. Oxford University Press, 2002. Decision tree pruning J48 decision tree construction / evaluation using Weka API: TrainTestSplit.java R2D3 Visual introduction to decision trees R2D3 Visual introduction to the bias-variance tradeoff	Exploratory data analysis Data visualization Classification Decision tree induction Overfitting Bias-variance tradeoff
Sept. 17-21 PS3 due Sept. 18	3.5-3.6 (4.5-4.6 in first ed.) 4.1-4.3 (5.1-5.2 in first ed.) R.L. Cilibrasi, P.M.B. Vitanyi. "The Google Similarity Distance", IEEE Transactions on Knowledge and Data Engineering, vol 19, no 3, pages 370-383, 2007. R.C. Holte. "Very Simple Classification Rules Perform Well on Most Commonly Used Datasets", Machine Learning, vol. 11 (1993), 63-91. R. L. Lawrence and A. Wright. "Rule-based classification systems using classification and regression tree (CART) analysis", Photogrammetric engineering and remote sensing 67.10 (2001): 1137-1142. Sequential covering W. W. Cohen. "Fast Effective Rule Induction", Proceedings of the Twelfth International Conference on Machine Learning, 115-123, 1995. P. Domingos. "A Few Useful Things to Know About Machine Learning", Comm. ACM, vol. 55, no. 10, pages 78-87, 2012. 1-NN classification in Java using Weka API: WekaAPISampleClient.java	Classifier evaluation Comparing classifiers Rule-based classifiers Instance-based prediction k-NN classifier / regressor
Sept. 24-28 PS4 due Sept. 25 First midterm exam In class, Sept. 27	Topics for the first midterm exam My machine learning notes
Oct. 1-5 Term project final proposal due	Appendix C 4.4 (5.3 in first ed.) Naive Bayes notes at Georgia Tech "Naive Bayes text classification", in Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. Introduction to Information Retrieval, Cambridge U. Press, 2008 S. Raschka. Naive Bayes and Text Classification	Probability Bayesian classifiers Naive Bayes NB document categorization
Oct. 8-12 No class Tu Oct. 9 (BC fall break)	4.5 (5.3 in first ed.) Bayes networks example E. Klarreich. In Search of Bayesian Inference. Communications of the ACM, vol. 58, no. 1, 21-24, Jan. 2015 Remco R. Bouckaert. "Bayesian networks in Weka", Technical Report, Department of Computer Science, Waikato University, Hamilton, NZ 2005.	Bayesian classifiers Bayesian networks
Oct. 15-19 PS5 due Oct. 16	4.6, 4.7 (5.4 in first ed.)	Logistic regression Neural networks: perceptrons
Oct. 22-26 PS6 due Oct. 23	4.7 (5.4 in first ed.), 4.8 Training multilayer ANN Hidden layer representations S.A. Alvarez, T. Kawato, C. Ruiz. "Mining over Loosely Coupled Data Sources using Neural Experts", Proc. MDM/KDD 2003, Washington, DC, Aug. 2003.	Multilayer networks Hidden features
Oct. 29 - Nov. 2 PS7 due Oct. 30 Second midterm exam In class, Nov. 1	4.8 S. Paisarnsrisomsuk, M. Sokolovsky, F. Guerrero, C. Ruiz, S.A. Alvarez. Deep Sleep: convolutional neural networks for predictive modeling of human sleep time-signals, KDD 2018 Deep Learning Day (in conjunction with KDD 2018), London, UK, Aug. 2018 Topics for the second midterm exam	Deep learning
Nov. 5-9	4.9 (5.5 in first ed.), 4.11 (5.7 in first ed.) C. Cortes, V. Vapnik. "Support Vector Networks", Machine Learning, vol 20, no 3, 273-297 (1995). T. Fawcett. "ROC Graphs: Notes and Practical Considerations for Data Mining Researchers", Hewlett-Packard Labs Technical Report HPL-2003-4, 2003. N. V. Chawla. Data Mining for Imbalanced Datasets: An Overview, in Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers, Springer, 853-867, 2005 Van Hulse, J., Khoshgoftaar, T. M., & Napolitano, A. Experimental perspectives on learning from imbalanced data, In Proceedings of the 24th International Conference on Machine Learning, pp. 935-942, ACM, June 2007.	Support Vector Machines (SVM) maximum margin classification nonlinear SVM, kernels Class imbalance Class-specific performance Thresholded classification ROC examples ROC optimization ROC Cost-sensitive classification
Nov. 12-16 PS8 due Nov. 13	4.10 (5.6 in first ed.), 7.1-7.5 (8.1-8.5 in first ed.), 8.2 (9.2 in first ed.) D. Pelleg, A. Moore. "X-means: extending k-means with efficient estimation of the number of clusters", ICML 2000 Demo: k-means clustering for image segmentation from U. of Washington P. D'haeseleer. Primer: How does gene expression clustering work?, Nature Biotechnology 23, 1499 - 1501 (2005) S. Akaho and O. Michel. Gaussian mixture model EM algorithm M. Luo, Y.-F. Ma, H.-J. Zhang. A spatial constrained k-means approach to image segmentation, Proc. ICIS-PCM 2003	Ensemble methods Clustering k-Means Hierarchical clustering
Nov. 19-23 No class on Thursday (Thanksgiving)	8.4, 9.2 T. Bertin-Mahieux, D.P.W. Ellis, B. Whitman, P. Lamere. The Million Song Dataset, Proc. 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.	Probabilistic clustering (E-M) Density-based clustering
Nov. 26-30 Term project presentations begin	10, 6.1-6.3 Y. Liao, V. R. Vemuri. "Using text categorization techniques for intrusion detection", Proceedings of the 11th USENIX Security Symposium, pp 51-59, Aug. 2002. E. Eskin, A. Arnold, M. Prerau, L. Portnoy, S. Stolfo. A geometric framework for unsupervised anomaly detection, Proceedings of Applications of Data Mining in Computer Security, Kluwer, 78-100, 2002. Y. Zhong, Y. Deng, and A. K. Jain. Keystroke dynamics for user authentication, CVPR Workshop on Biometrics, Providence, RI, June 16-21, 2012.	Anomaly detection
Dec. 3-7 Last week of classes Term project presentations Term project report due Dec. 7

CSCI 3346, Fall 2018 Data Mining Course Homepage and Syllabus