Data Quality for AI

Description

Many AI systems are dependent on large quantities of suitable training data. This dependency creates challenges not only concerning the availability of data but also regarding its quality. For example, incomplete, erroneous, inappropriate, or asymmetric training data leads to unreliable models and can ultimately lead to poor decisions, which is often referred to by Garbage in, garbage out (GIGO). GIGO is used to express the idea that in computing and other spheres, incorrect or poor-quality input will always produce faulty output¹. High-performance AI applications require high-quality training and test data.

This data could include personal information, sensitive financial details, and confidential business data. Nevertheless, privacy is a fundamental human right, and it is essential to protect personal information to ensure trust and maintain a fair and just society. One common approach to address these concerns is to use anonymized data in machine learning algorithms. There is no substantial research that demonstrates the effect of anonymization on the data quality and thus on the downstream ML application. Differential privacy and k-anonymity are the most used families of anonymization techniques.

What is the goal of the seminar?

In this seminar, we will introduce you to the field of data quality, and explore together the impact of anonymization techniques on data quality and AI model performance. To achieve that, we have the following plan:

Kickoff Phase: Each team ideally consists of 2 students and will be assigned a specific task: classification, regression, etc. Your part is to choose one or more representative models (e.g., SVM for classification) to solve this task with the respective datasets (see datasets section). The datasets need to contain protected features such as age that we will try to anonymize.
Research: Each team will explore the effect of anonymization of the data on data quality regarding the well-known data quality dimensions. This includes: (1) understanding the anonymization algorithms assigned to each team and implementing them. (2) Building an ML-pipeline that uses anonymized data to train the ML models this team has chosen. (3) reporting on the performance of the chosen models regarding the degree of anonymization and showing the trade-off. We will provide you with state-of-the-art papers in the field of data quality, differential privacy and k-anonymity. More details about the dimensions and experimental setup will be provided at the beginning of this phase.
Deliverable: The outcome of the seminar is a paper-style technical report that the teams will write collaboratively to present the results of the conducted analysis. In addition to the code, models, and the datasets that have been produced.
Bonus: You will learn how to read/write a research paper and how to conduct scientific experiments and present the results in a paper.

Prerequisites

For this seminar, participants need to be able to program fluently in Python and know how to use jupyter notebooks. The seminar also requires basic knowledge about machine learning algorithms.

Organization

The organizational details for this seminar are as follows:

Project seminar for master students
Language of instruction: English
6 credit points, 4 SWS
At most 6 participants (ideally, 3 teams of 2 students each)

Registration

After the introduction to the seminar on 20.04.2023 at 13:30 in FE.06, please send an e-mail to hazar.harmouch(at)hpi.de with the subject: "Registration to Data Quality for AI" by Tuesday 25.10. We will notify the selected applicants by Wednesday the 26 of April.

In the case of more than six registrations, we might need to choose up to six participants randomly. If you would like to join as a team, you can also mention that in the email. The registered students will receive an e-mail with further details about the seminar. Please register with the Studienreferat after we acknowledge your seminar participation.

Time Table

When: Weekly on Thursday at 13:30

Where: Campus II, Building F, Room F-2.10 (Starting 4th of May).

The following timetable lists the main semester milestones and it still tentative

Date	Topic	Slides
20.04.2023	Introduction (Open to all students)	slides
27.04.2023	Group allocation and technical setup introduction
04.04.2023	Basics of literature search and giving technical talks
11.05.2023	Technical talk to present a research paper
18.05.2023	Ascending day
25.05.2023	Guest talk: Dr.Lisa Ehrlinger
22.06.2023	Mid-term presentation
27.07.2023	End-term presentation
1.09.2023	Final submission

Literature

To get introduced to data quality and to get a better feeling to which extent data quality affects AI, you can start with reading the following literature that you can find on dblp or google-scholar:

Data quality dimensions

R.Y. Wang and D.M. Strong. Beyond accuracy: What data quality means to data consumers. Management of Information Systems, 12(4):5–34, 1996.
F. Naumann and C. Rolker. Assessment methods for information quality criteria. In Proceedings of the International Conference on Information Quality (ICIQ), 148–162, 2000.
Sedir Mohammed, Lou Brandner, Sebastian Hallensleben, Hazar Harmouch, Andreas Hauschke, Jessica Heesen, Stefanie Hildebrandt, Simon David Hirsbrunner, Julia Keselj, Philipp Mahlow, Felix Naumann, Frauke Rostalski, Anna Wilken, & Annika Wölke. (2023). Ein Glossar zur Datenqualität (1.2). Zenodo. https://doi.org/10.5281/zenodo.7702426 (German)
Budach, Lukas, Moritz Feuerpfeil, Nina Ihde, Andrea Nathansen, Nele Noack, Hendrik Patzlaff, Hazar Harmouch, and Felix Naumann. The Effects of Data Quality on Machine Learning Performance. arXiv preprint arXiv:2207.14529 (2022).

K-Anonymity privacy protection algorithm

Djordje Slijepcevic, Maximilian Henzl, Lukas Daniel Klausner, Tobias Dam, Peter Kieseberg, Matthias Zeppelzauer: k-Anonymity in practice: How generalisation and suppression affect machine learning classifiers. Comput. Secur. 111: 102488 (2021)
Sabrina De Capitani di Vimercati, Sara Foresti, Giovanni Livraga, Pierangela Samarati: k-Anonymity: From Theory to Applications. Trans. Data Priv. 16(1): 25-49 (2023).

Differential privacy algorithm

Maoguo Gong, Yu Xie, Ke Pan, Kaiyuan Feng, Alex Kai Qin: A Survey on Differentially Private Machine Learning. IEEE Comput. Intell. Mag. 15(2): 49-64 (2020)
Eugene Bagdasaryan, Omid Poursaeed, Vitaly Shmatikov: Differential Privacy Has Disparate Impact on Model Accuracy. NeurIPS 2019: 15453-15462

To be continued.

Datasets

Sources for datasets used for AI tasks include but are not limited to the following:

Kaggle: https://www.kaggle.com/datasets
OpenML: https://www.openml.org/
Google Dataset Search: https://datasetsearch.research.google.com/
UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets.php

Grading

The final grade is weighted by 6 LP and considers the following:

(15%) Active participation in meetings and discussions
(15%) Technical presentation of a scientific paper
(20%) Mid- and End-term presentation
(20%) Quality of implementation and results
(30%) Final paper-style submission