Thesis topics for Students
The following list contains possible topics for BSc and MSc students enrolled at IT University of Copenhagen. Each topic usually contains a theoretical and a practical component. The balance can be adjusted to the student’s needs.
If you wish to write a thesis in algorithms: 1) Find someone to work with (we prefer thesis work as a group), 2) Write a short description of the kind of thesis you imagine writing, 3) Contact me to set up a meeting.
Notes on Supervision: I wrote up some expectations that I have from a thesis project here.
The list below contains thesis topics that I gave some thought. If you find none of them interesting, but have another algorithmic question in mind, feel free to contact me with your proposal. I am interested in supervising topics that survey existing algorithmic work, conduct a reproducibility study of algorithm engineering approaches, work on explaining results of algorithms (e.g., for a recommender system) or fairness issues in machine learning and algorithms (e.g., see this book project and our paper for a concrete example).
Last Update: November 21, 2022
List Of Suggested Topics
- Hyperplane Hashing
- Participating in the Big-ANN-Challenge
- Meta-Study on Reproducibility of Algorithm Engineering Research
- Engineering LSH-based nearest neighbor search
- Privacy-preserving Similarity Search
- Outlier-detection Fast Approximate Nearest Neighbor Search
- Influence of Local Intrinsic Dimensionality in Machine Learning and Indexing Problems
- Learned index data structures
Point-to-Hyperplane is the problem in which a point nearest to a hyperplane should be identified in high-dimensions. Among many applications, solving this problem is at the core of active learning with Support Vector Machines (SVMs). In a recent paper, Huang et al. design an efficient hashing scheme for it. In this project, we understand the data structures used and try to improve its efficiency.
- Point-to-Hyperplane Nearest Neighbor Search Beyond the Unit Hypersphere, Qiang Huang, Yifan Lei, Anthony K. H. Tung, SIGMOD 21, link.
Participating in the Big-ANN-Challenge
The goal of this project is to prepare an entry for the Billion-Scale Approximate Nearest Neighbor Search Challenge using the concept of Locality-Sensitive Hashing. If you think it is interesting to search several hundred gigabytes for nearest neighbors, which is both a conceptional and an implementation challenge, this is the project for you :-) C++/Rust is a must.
Status: 1 Bachelor project.
Meta-Study on Reproducibility of Algorithm Engineering Research
Many conferences are adopting a reproducibility track to increase and popularize reproducibility of research, see for example this initiative. This project aims to analyze accepted papers at algorithm engineering conferences such as ESA and ALENEX under the focus of evaluation methodology and reproducibility. The result should be a meta-study on reproducibility in algorithms research. It can also cover re-implementation of some selected algorithms that were published at these venues.
Engineering LSH-based nearest neighbor search
PUFFINN is an LSH-based nearest neighbor search algorithm with provable guarantees on the correctness of the result. Unfortunately, it only supports inner product and set similarity distance functions. The goal of this project is to add other LSH variants to PUFFINN, making it more generally useful. This project can be carried out on all different levels, but requires good knowledge of C++.
- Martin Aumüller, Tobias Christiani, Rasmus Pagh, Michael Vesterli, 2019, PUFFINN: Parameterless and Universally Fast FInding of Nearest Neighbors
Status: One Bachelor project.
Privacy-preserving Similarity Search
Avoiding leaks of user information is crucial for the reputation of social networks and other large-scale systems. Hence, all aspects of data processing and managing should be carefully analyzed with regard to possible leaks. The thesis aims at studying privacy issues and privacy-preserving techniques for similarity search systems, a key primitive in several applications (e.g., recommender systems, machine learning, information retrieval) that often handle sensitive data like user preferences or medical records.
Work can be carried out in different directions:
- Attack scenarios on similarity search tools. Describe attack scenarios on similarity search systems and evaluate their impact. This would entail studying different paradigms for similarity search such as locality-sensitive hashing, tree-based techniques, and similarity graphs within their application areas (e.g., recommender systems or deep learning).
- Privacy-preserving similarity search. Investigate techniques for privacy-preserving similarity search and study their security guarantees. The goal is to build a similarity-search library that offers provable security guarantees.
- Empirical comparison of privacy-preserving similarity search. Compare two existing methods in terms of their privacy guarantees and running time. For example, (Pagh, Stausholm, 2020) can be used to solve the problem discussed in (Aumüller et al., 2020).
There were already two Master theses (see (Aumüller et al., 2020) for an idea) that worked towards building a privacy-preserving similarity search tool. Extending such a tool could be an option.
A Bachelor thesis/Research project could be a survey of existing technologies.
See the referenced literature for more information.
- (Aumüller et al., 2020) Martin Aumüller, Anders Bourgeat, Jana Schmurr. Locally Differential Private Sketches for Jaccard Similarity Estimation
- Dhaliwal, J., So, G., Parker-Wood, A., Beck, M.: Utility preserving secure private data release
- Homann, J., Edvardsen, M.: Privacy-preserving similarity search, Master thesis 2020, available upon request.
- Pagh, R., Stausholm, N.: Efficient Differentially Private F0 Linear Sketching
- Madsen A. S., Larsen I.: An Empirical Comparison of Differentially Private Similarity Estimation Techniques, Master thesis 2022, available upon request.
Outlier-detection Fast Approximate Nearest Neighbor Search
Outlier detection is a classical problem, in which we want to identify points that deviate so much from other data points in an input set that we suspect them to be generated through a different mechanism.
This topic is concerned with applying fast approximate nearest neighbor search algorithms such as Facebook’s FAISS to solve (versions of) the outlier detection problem.
- (Kirner et al., 2017) E. Kirner, E. Schubert, A. Zimek: Good and Bad Neighborhood Approximations for Outlier Detection Ensembles. Similarity Search and Applications - 10th International Conference.
Status: One Master project.
Influence of Local Intrinsic Dimensionality in Machine Learning and Indexing Problems
In the last year, the notion of local intrinsic dimensionality has found use in solving many different conceptual problems. For example, it has been used in outlier classification (Houle et al., 2018), characterizing adversarial subspaces in machine learning (Ma et al., 2018a/b), and for dataset diversification for machine learning tasks (Aumüller, Ceccarello, 2019).
Possible directions of a thesis can be applications of the LID measure to other areas, for example in work such as (Li et al., 2020) or a reproducibility study of the effects of local intrinsic dimensionality.
- (Houle et al., 2018) Michael E. Houle, Erich Schubert, Arthur Zimek, On the Correlation Between Local Intrinsic Dimensionality and Outlierness
- (Ma et al., 2018a) Xingjun Ma, Bo Li, Yisen Wang, Sarah M. Erfani, Sudanthi N. R. Wijewickrema, Grant Schoenebeck, Dawn Song, Michael E. Houle, James Bailey, Characterizing Adversarial Subspaces Using Local Intrinsic Dimensionality
- (Ma et al., 2018b) Xingjun Ma, Yisen Wang, Michael E. Houle, Shuo Zhou, Sarah M. Erfani, Shu-Tao Xia, Sudanthi N. R. Wijewickrema, James Bailey, Dimensionality-Driven Learning with Noisy Labels
- (Aumüller, Ceccarello, 2019) Martin Aumüller, Matteo Ceccarello, The Role of Local Intrinsic Dimensionality in Benchmarking Nearest Neighbor Search
- (Li et al., 2020) Conglong Li, Minjia Zhang, David Andersen, Yuxiong He, Improving Approximate Nearest Neighbor Search through Learned Adaptive Early Termination
Learned index data structures
In very recent work, (Kraska et al., 2018) discuss replacing classical data structures, such as B-trees and hashing algorithms, with indexes that learn a model of the data from the data. The mentioned paper shows that this can, sometimes, greatly reduce the index size and/or running times.
What the paper lacks is a theoretical foundation of the possibilities and limitations of such learned indexes. Here, (Mitzenmacher, 2018) made a proposal for the case of Bloom filters and related data structures.
Thesis work could go into two directions: 1) Implement and compare learned indexes to other indexing methods, such as Cuckoo hashing or locality-sensitive hash functions. 2) Further develop the theoretical foundations provided in (Mitzenmacher, 2018) to a different context.
(Kraska et al., 2018) Kraska, T., Beutel, A., Chi, E. H., Dean, J., & Polyzotis, N. (2018, May). The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data (pp. 489-504). ACM.
(Mitzenmacher, 2018) M. Mitzenmacher, A Model for Learned Bloom Filters and Related Structures, eprint arXiv:1802.00884.
Status: One Bachelor project.