Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Data Quality for AI

Description

Many AI systems are dependent on large quantities of suitable training data. This dependency creates challenges not only concerning the availability of data but also regarding its quality. For example, incomplete, erroneous, inappropriate, or asymmetric training data leads to unreliable models and can ultimately lead to poor decisions, which is often referred to by Garbage in, garbage out (GIGO). GIGO is used to express the idea that in computing and other spheres, incorrect or poor-quality input will always produce faulty output1. High-performance AI applications require high-quality training and test data.

This data could include personal information, sensitive financial details, and confidential business data. Nevertheless, privacy is a fundamental human right, and it is essential to protect personal information to ensure trust and maintain a fair and just society. One common approach to address these concerns is to use anonymized data in machine learning algorithms. There is no substantial research that demonstrates the effect of anonymization on the data quality and thus on the downstream ML application. Differential privacy and k-anonymity are the most used families of anonymization techniques. 

What is the goal of the seminar?

In this seminar, we will introduce you to the field of data quality, and explore together the impact of anonymization techniques on data quality and AI model performance. To achieve that, we have the following plan:

  • Kickoff Phase: Each team ideally consists of 2 students and will be assigned a specific task: classification, regression, etc. Your part is to choose one or more representative models (e.g., SVM for classification) to solve this task with the respective datasets (see datasets section).  The datasets need to contain protected features such as age that we will try to anonymize.
  • Research: Each team will explore the effect of anonymization of the data on data quality regarding the well-known data quality dimensions. This includes: (1) understanding the anonymization algorithms assigned to each team and implementing them. (2) Building an ML-pipeline that uses anonymized data to train the ML models this team has chosen. (3)  reporting on the performance of the chosen models regarding the degree of anonymization and showing the trade-off. We will provide you with state-of-the-art papers in the field of data quality, differential privacy and k-anonymity. More details about the dimensions and experimental setup will be provided at the beginning of this phase.
  • Deliverable: The outcome of the seminar is a paper-style technical report that the teams will write collaboratively to present the results of the conducted analysis. In addition to the code, models, and the datasets that have been produced.
  • Bonus: You will learn how to read/write a research paper and how to conduct scientific experiments and present the results in a paper. 

Prerequisites

For this seminar, participants need to be able to program fluently in Python and know how to use jupyter notebooks. The seminar also requires basic knowledge about machine learning algorithms.

Organization

The organizational details for this seminar are as follows:

  • Project seminar for master students
  • Language of instruction: English
  • 6 credit points, 4 SWS
  • At most 6 participants (ideally, 3 teams of 2 students each)

    Registration

    After the introduction to the seminar on 20.04.2023 at 13:30 in FE.06, please send an e-mail to hazar.harmouch(at)hpi.de with the subject: "Registration to Data Quality for AI" by Tuesday 25.10. We will notify the selected applicants by Wednesday the 26 of April.

    In the case of more than six registrations, we might need to choose up to six participants randomly. If you would like to join as a team, you can also mention that in the email. The registered students will receive an e-mail with further details about the seminar. Please register with the Studienreferat after we acknowledge your seminar participation.

    Time Table

    When Weekly on Thursday at 13:30 

    Where: Campus II, Building F, Room F-2.10 (Starting 4th of May).

    The following timetable lists the main semester milestones and it still tentative 

    Date

    Topic

    Slides

    20.04.2023

    Introduction (Open to all students)  slides

    27.04.2023

    Group allocation and technical setup introduction 

    04.04.2023

    Basics of literature search and giving technical talks 
    11.05.2023Technical talk to present a research paper  
    18.05.2023Ascending day 
    25.05.2023Guest talk: Dr.Lisa Ehrlinger  
    22.06.2023Mid-term presentation   
    27.07.2023End-term presentation 
    1.09.2023Final submission 

     

     

    Literature

    To get introduced to data quality and to get a better feeling to which extent data quality affects AI, you can start with reading the following literature that you can find on dblp or google-scholar:

    Data quality dimensions

    • R.Y. Wang and D.M. Strong. Beyond accuracy: What data quality means to data consumers. Management of Information Systems, 12(4):5–34, 1996.
    • F. Naumann and C. Rolker. Assessment methods for information quality criteria. In Proceedings of the International Conference on Information Quality (ICIQ), 148–162, 2000.

    • Sedir Mohammed, Lou Brandner, Sebastian Hallensleben, Hazar Harmouch, Andreas Hauschke, Jessica Heesen, Stefanie Hildebrandt, Simon David Hirsbrunner, Julia Keselj, Philipp Mahlow, Felix Naumann, Frauke Rostalski, Anna Wilken, & Annika Wölke. (2023). Ein Glossar zur Datenqualität (1.2). Zenodo. https://doi.org/10.5281/zenodo.7702426  (German)

    • Budach, Lukas, Moritz Feuerpfeil, Nina Ihde, Andrea Nathansen, Nele Noack, Hendrik Patzlaff, Hazar Harmouch, and Felix Naumann. The Effects of Data Quality on Machine Learning PerformancearXiv preprint arXiv:2207.14529 (2022).

    K-Anonymity privacy protection algorithm

    • Djordje Slijepcevic, Maximilian Henzl, Lukas Daniel Klausner, Tobias Dam, Peter Kieseberg, Matthias Zeppelzauer: k-Anonymity  in practice: How generalisation and suppression affect machine learning classifiers. Comput. Secur. 111: 102488 (2021)
    • Sabrina De Capitani di Vimercati, Sara Foresti, Giovanni Livraga, Pierangela Samarati: k-Anonymity: From Theory to Applications. Trans. Data Priv. 16(1): 25-49 (2023).

    Differential privacy algorithm

    • Maoguo Gong, Yu Xie, Ke Pan, Kaiyuan Feng, Alex Kai Qin: A Survey on Differentially Private Machine Learning. IEEE Comput. Intell. Mag. 15(2): 49-64 (2020)
    • Eugene Bagdasaryan, Omid Poursaeed, Vitaly Shmatikov: Differential Privacy Has Disparate Impact on Model Accuracy. NeurIPS 2019: 15453-15462

    To be continued.

    Datasets

    Sources for datasets used for AI tasks include but are not limited to the following: 

    Grading

    The final grade is weighted by 6 LP and considers the following:

    • (15%) Active participation in meetings and discussions
    • (15%) Technical presentation of a scientific paper
    • (20%) Mid- and End-term presentation
    • (20%) Quality of implementation and results
    • (30%) Final paper-style submission