Contact | References | Objective | Grading | Schedule
Welcome to the CSCI 3346 homepage! You may use the menu options above to find your way around, if you'd rather not scroll.
What this course is about
This course provides an introduction to the field of data mining,
and emphasizes an applied machine learning perspective.
Data mining deals with the semi-automated analysis of collections of data, with the aim of discovering patterns that are informative and useful.
Data mining analysis tasks are important in fields such as
medical informatics and bioinformatics, e-commerce, and security.
The course will cover fundamental data mining tasks, relevant
concepts and techniques from machine learning and statistics, and data
mining applications to real-world domains such as document classification, gene
expression, analysis of human sleep recordings, and fraud detection.
Prerequisites / Required Background
The prerequisites for CSCI 3346 are Computer Science II (CSCI 1102)
or an equivalent course in data structures and object-oriented programming,
and Randomness and Computation (CSCI 2244) or a similar introduction to
probability and random variables.
Students taking CSCI 3346 are assumed to be comfortable with fundamental
programming techniques. Some of the problem sets may require writing programs
in a high-level programming language such as Python, Java, or MATLAB.
How to do well in this course
Students should expect to work hard. Keeping up with the reading
assignments and starting problem sets early are two key habits
that should be developed (if necessary) and maintained for the
entirety of the course.
Assigned reading will usually follow the in-lecture discussion
fairly closely, but there may be a few exceptions to this rule,
and in such cases it is especially important to make sure that
you've understood the material covered in the reading assignment.
When preparing problem set solutions, you should strive for
maximum clarity. Your write-ups should reveal the key ingredients
and the structure of your reasoning. Include detailed explanations.
These comments apply equally well to exam solutions.
I'll be happy to discuss any questions during my office hours. If you can't make it to my regular office hours, we can try to schedule an appointment at a mutually convenient time.
Grading
Each student's course grade will be based on exams, the term project
presentation and written report, homework, and class participation.
Academic Honesty
You are encouraged to discuss assigned homework problems (but not exams)
with other people but you must individually design and write your own
solutions for all assignments. Furthermore, you should explicitly
acknowledge any sources of ideas used that are not your own; this includes
other people, books, web pages, etc. You should not discuss exam problems
with anyone except the instructor.
Please also read the
Boston College academic integrity policy.
Behavior that is not in the spirit of the above guidelines
may be reported to the appropriate class dean for review by
the Academic Integrity Board.
If you're in doubt about what constitutes academic dishonesty in a
particular situation, talk to me.
Downloads and Approximate Schedule of Topics (may change substantially)
Notes:
Deviations from the rough schedule below may occur. Updates will be made during the semester.
Chapter and section numbers in the schedule refer to the textbook (2019 edition, unless otherwise stated).
Problem sets will be available on Canvas to registered students.
The reading assignments and materials provided below complement the lectures, but are not intended to replace them. Students are responsible for topics discussed in the lectures as well as those covered in the reading assignments.
Jump to the end of the schedule
Week | Reading (see Canvas for Homework) | Topics |
Aug. 27-31 |
1, 2.1-2.4 (the first two chapters are almost the same in both editions)
Download and install Weka (requires Java 1.8 or later) and MATLAB and scikit-learn (requires Python, and NumPy and SciPy - see link) U.D. Reichel, H.R. Pfitzinger. "Text Preprocessing for Speech Synthesis", TC-Star Workshop on Speech-to-Speech Translation, 19-21 June 2006, Barcelona, Spain. M.A. Hall, G. Holmes. "Benchmarking Attribute Selection Techniques for Discrete Class Data Mining", IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 3, pp. 1-16, May-June 2003. |
Introduction
Data mining tasks Examples Types of data
|
Sept. 3-7
PS1 due Sept. 4 (Canvas) |
2.3-2.4
A. Kapur, S. Kapur, and P. Maes. AlterEgo: a Personalized Wearable Silent Speech Interface, 23rd International Conference on Intelligent User Interfaces (IUI 2018), pp 43-53, March 5, 2018. E. Bair, T. Hastie, D. Paul, and R. Tibshirani. Prediction by Supervised Principal Components, Journal of the American Statistical Association, 101(473): 119-137, 2006 D. J. Bartholomew, F. Steele, I. Moustaki, J. Galbraith. "Multidimensional Scaling", chapter 3 of The Analysis and Interpretation of Multivariate Data for Social Scientists anno(2nd ed), Chapman and Hall / CRC, 2008, pages 55-81 |
Attribute selection
Feature extraction PCA Nonlinear embedding FFT
Summary statistics
|
Sept. 10-14
Term project preliminary proposal due PS2 due Sept. 11
|
(3.1-3.3 in first edition here - the 2nd ed. does not include this material)
plus 3.3, 2nd ed. (4.1-4.4 first ed.) Ron Kohavi and J. Ross Quinlan. "Decision tree discovery", in Will Klosgen and Jan M. Zytkow, editors, Handbook of Data Mining and Knowledge Discovery, chapter 16.1.3, pages 267-276. Oxford University Press, 2002. J48 decision tree construction / evaluation using Weka API: |
Exploratory data analysis
Data visualization
Classification
|
Sept. 17-21
PS3 due Sept. 18 |
3.5-3.6 (4.5-4.6 in first ed.)
4.1-4.3 (5.1-5.2 in first ed.) R.L. Cilibrasi, P.M.B. Vitanyi. "The Google Similarity Distance", IEEE Transactions on Knowledge and Data Engineering, vol 19, no 3, pages 370-383, 2007. R.C. Holte. "Very Simple Classification Rules Perform Well on Most Commonly Used Datasets", Machine Learning, vol. 11 (1993), 63-91. R. L. Lawrence and A. Wright. "Rule-based classification systems using classification and regression tree (CART) analysis", Photogrammetric engineering and remote sensing 67.10 (2001): 1137-1142. W. W. Cohen. "Fast Effective Rule Induction", Proceedings of the Twelfth International Conference on Machine Learning, 115-123, 1995. P. Domingos. "A Few Useful Things to Know About Machine Learning", Comm. ACM, vol. 55, no. 10, pages 78-87, 2012. 1-NN classification in Java using Weka API: |
Classifier evaluation Comparing classifiers Rule-based classifiers
Instance-based prediction
|
Sept. 24-28
PS4 due Sept. 25 First midterm exam
|
Topics for the first midterm exam | |
Oct. 1-5
Term project final proposal due |
Appendix C
4.4 (5.3 in first ed.) Naive Bayes notes at Georgia Tech "Naive Bayes text classification", in Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. Introduction to Information Retrieval, Cambridge U. Press, 2008 S. Raschka. Naive Bayes and Text Classification |
Probability
Bayesian classifiers
NB document categorization |
Oct. 8-12
No class Tu Oct. 9
|
4.5 (5.3 in first ed.)
E. Klarreich. In Search of Bayesian Inference. Communications of the ACM, vol. 58, no. 1, 21-24, Jan. 2015 Remco R. Bouckaert. "Bayesian networks in Weka", Technical Report, Department of Computer Science, Waikato University, Hamilton, NZ 2005. |
Bayesian classifiers
Bayesian networks |
Oct. 15-19
PS5 due Oct. 16 |
4.6, 4.7 (5.4 in first ed.) |
Logistic regression
Neural networks:
|
Oct. 22-26
PS6 due Oct. 23 |
4.7 (5.4 in first ed.), 4.8
S.A. Alvarez, T. Kawato, C. Ruiz. "Mining over Loosely Coupled Data Sources using Neural Experts", Proc. MDM/KDD 2003, Washington, DC, Aug. 2003. |
Multilayer networks
Hidden features
|
Oct. 29 - Nov. 2
PS7 due Oct. 30 Second midterm exam
|
4.8
S. Paisarnsrisomsuk, M. Sokolovsky, F. Guerrero, C. Ruiz, S.A. Alvarez. Deep Sleep: convolutional neural networks for predictive modeling of human sleep time-signals, KDD 2018 Deep Learning Day (in conjunction with KDD 2018), London, UK, Aug. 2018 Topics for the second midterm exam |
Deep learning |
Nov. 5-9 |
4.9 (5.5 in first ed.), 4.11 (5.7 in first ed.)
C. Cortes, V. Vapnik. "Support Vector Networks", Machine Learning, vol 20, no 3, 273-297 (1995). T. Fawcett. "ROC Graphs: Notes and Practical Considerations for Data Mining Researchers", Hewlett-Packard Labs Technical Report HPL-2003-4, 2003. N. V. Chawla. Data Mining for Imbalanced Datasets: An Overview, in Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers, Springer, 853-867, 2005 Van Hulse, J., Khoshgoftaar, T. M., & Napolitano, A. Experimental perspectives on learning from imbalanced data, In Proceedings of the 24th International Conference on Machine Learning, pp. 935-942, ACM, June 2007. |
Support Vector Machines (SVM)
maximum margin classification nonlinear SVM, kernels
Class imbalance
ROC Cost-sensitive classification |
Nov. 12-16
PS8 due Nov. 13 |
4.10 (5.6 in first ed.),
7.1-7.5 (8.1-8.5 in first ed.), 8.2 (9.2 in first ed.)
D. Pelleg, A. Moore. "X-means: extending k-means with efficient estimation of the number of clusters", ICML 2000 Demo: k-means clustering for image segmentation from U. of Washington P. D'haeseleer. Primer: How does gene expression clustering work?, Nature Biotechnology 23, 1499 - 1501 (2005) S. Akaho and O. Michel. Gaussian mixture model EM algorithm M. Luo, Y.-F. Ma, H.-J. Zhang. A spatial constrained k-means approach to image segmentation, Proc. ICIS-PCM 2003 |
Ensemble methods
Clustering
|
Nov. 19-23
No class on Thursday (Thanksgiving) |
8.4, 9.2
T. Bertin-Mahieux, D.P.W. Ellis, B. Whitman, P. Lamere. The Million Song Dataset, Proc. 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011. |
Probabilistic clustering (E-M) Density-based clustering |
Nov. 26-30
Term project presentations begin |
10, 6.1-6.3
Y. Liao, V. R. Vemuri. "Using text categorization techniques for intrusion detection", Proceedings of the 11th USENIX Security Symposium, pp 51-59, Aug. 2002. E. Eskin, A. Arnold, M. Prerau, L. Portnoy, S. Stolfo. A geometric framework for unsupervised anomaly detection, Proceedings of Applications of Data Mining in Computer Security, Kluwer, 78-100, 2002. Y. Zhong, Y. Deng, and A. K. Jain. Keystroke dynamics for user authentication, CVPR Workshop on Biometrics, Providence, RI, June 16-21, 2012. |
Anomaly detection |
Dec. 3-7
Last week of classes
Term project presentations
Term project report due Dec. 7
|