60/15 - DATA MINING
Academic Year 2021/2022
Free text for the University
BARBARA PES (Tit.)
- Teaching style
- Lingua Insegnamento
|[60/65] MATHEMATICS||[65/60 - Ord. 2020] MATEMATICA APPLICATA||6||48|
|[60/68] PHYSICS||[68/40 - Ord. 2020] FISICA MEDICA E APPLICATA||6||48|
|[60/73] INFORMATICS||[73/00 - Ord. 2017] PERCORSO COMUNE||6||48|
The course aims at presenting the main concepts, methods and techniques in the field of data mining and knowledge discovery.
In more detail, the course objectives include:
- to provide students with a solid background on the main KDD (Knowledge Discovery in Databases) tasks, including data preparation, extraction of patterns from data using supervised data mining approaches (classification) as well as unsupervised approaches (clustering, association rule mining), and pattern evaluation [KNOWLEDGE AND UNDERSTANDING];
- to develop problem-solving skills; a number of real-world problems will be presented and discussed [APPLYING KNOWLEDGE AND UNDERSTANDING];
- to develop critical thinking and decision-making skills; the students will learn to differentiate between situations for applying different data mining techniques [MAKING JUDGEMENTS];
- to make students able to discuss data mining issues with proper terminology [COMMUNICATION SKILLS];
- to stimulate their capacity to deepen the course topics in an autonomous way, also addressing research topics, and to perform a self-directed piece of practical work, that may require the application of data mining techniques to new problems/contexts [LEARNING SKILLS].
Students should have background knowledge of algorithms and data structures.
Basic concepts of databases, probability and statistics are also useful.
- What is Data Mining?
- Data Mining and Knowledge Discovery.
- Types of data and general characteristics of datasets
- Data Quality
- Data Pre-processing
- Measures of similarity and dissimilarity.
- General approach to solving a classification problem
- Classification techniques: Decision trees, Rule-based classifiers, Nearest-Neighbor classifiers, Bayesian classifiers, Artificial Neural Networks, Support Vector Machines
- The problem of model overfitting
- Evaluating classification models: methods and metrics for performance evaluation, methods for model comparison.
4) ASSOCIATION ANALYSIS
- Problem definition (market-basket model)
- Support and confidence of association rules
- Apriori algorithm: frequent itemset generation, rule generation
- Evaluation of association rules.
- Different types of clustering
- K-means algorithm (and extensions)
- Hierarchical techniques
- Cluster evaluation.
6) THE WEKA DATA MINING WORKBENCH
- How to apply data mining techniques
The teaching activity consists of 48 hours of frontal lectures, which also include guided exercises. Additional exercises are assigned as homework and then discussed in class, to give the students the opportunity to reinforce and self-assess their knowledge/skills. The teacher provides further support and personalized assistance during the established office hours and by e-mail.
Compatibly with the pandemic situation, the course will be held mainly in presence, integrated with appropriate online learning strategies.
Verification of learning
The assessment for this course includes:
- a written exam involving both theory questions (in the form of open answer as well as closed answer questions) and exercises about the topics covered during the course; the aim is to evaluate the extent to which the student knows/understands the taught concepts and is able to apply them to practical problems;
- a final project which may involve the discussion of a scientific article or the application of data mining techniques to real-world data (with a presentation/discussion of the results); the aim is to evaluate the extent to which the student can autonomously cope with new topics/case studies.
The grading system is from 1/30 to a maximum of 30/30 cum laude.
In more detail:
- up to 28 points are assigned based on the written exam;
- up to 4 points are assigned based on the project;
- the points gained with the written exam and those gained with the project are added together: to pass the exam, it is necessary to obtain at least 18 points in total; if the total of points is superior to 30 (31 or 32), the final grade will be ‘30 cum laude’.
To pass the exam (final grade of at least 18/30), the student must demonstrate a sufficient knowledge of the data mining techniques covered in the course (pre-processing, classification, clustering, association analysis). To achieve the highest grade (‘30 cum laude’), the student must demonstrate an excellent knowledge of the course topics and must be able to apply them to the solution of problems. Communication skills and proper terminology also contribute to the final grade.
The assessment methods may vary depending on the evolution of the COVID-19 emergency. In particular, the written exam may be replaced by an oral exam (online). Students will receive all the necessary information during the lessons.
Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, Vipin Kumar, Introduction to Data Mining, Pearson, 2018. (primary textbook)
Ian H. Witten, Eibe Frank, Mark A. Hall, Christopher J. Pal, DATA MINING: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, 2016. (for supplementary readings)
Auxiliary learning resources:
lecture slides, exercises with solutions, scientific articles.
The course materials will be available at https://elearning.unica.it/.