### For students

## Thesis topics for Students

The following list contains possible topics for BSc and MSc students enrolled at IT University of Copenhagen. Each topic usually contains a theoretical and a practical component. The balance can be adjusted to the student’s needs.

If you wish to write a thesis in algorithms: 1) Find someone to work with (I prefer that you work in a group), 2) Write a short description of the kind of thesis you imagine writing, 3) Contact me to set up a meeting.

**Notes on Supervision**: I wrote up some expectations that I have from a thesis project here.

## Proposed Topics

The list below contains thesis topics that I gave some thought. If you find none of them interesting, but have another algorithmic question in mind, feel free to contact me with your proposal. I am interested in supervising topics that survey existing algorithmic work, conduct a reproducibility study of algorithm engineering approaches, work on explaining results of algorithms (e.g., for a recommender system) or fairness issues in machine
learning and algorithms (e.g., see this book project and our paper for a concrete example).
Many of the projects below consider **similarity search** or ** (approximate) nearest neighbor search**. Please check the first three sections of this overview paper to get a basic understanding of the research area. This website contains a much more in-depth description of the methods and is a suggested read as well.

*Last Update: April 17, 2024*

## List Of Suggested Topics

- Understand, implement, improve
- DBSCAN clustering using efficient ANN techniques
- Sketching for Nearest Neighbor search
- Participating in the Big-ANN-Challenge
- Meta-Study on Reproducibility of Algorithm Engineering Research
- Engineering LSH-based nearest neighbor search
- Privacy-preserving Similarity Search

### Understand, implement, improve

I have a couple of papers that propose interesting ideas. Usually, a project evolves around reading the paper, implementing it, and applying it in a different context or improving on it. I usually have some ideas for improvements, but the basis is to understand and implement a research paper.

**Paper selection:**

- BLISS: A Billion scale Index using Iterative Re-partitioning., Gaurav Gupta, Tharun Medini, Anshumali Shrivastava, Alexander J. Smola, KDD 2022.
- Falconn++: A Locality-sensitive Filtering Approach for Approximate Nearest Neighbor Search, Ninh Pham, Tao Liu, NeurIPS 2022.
- Point-to-Hyperplane Nearest Neighbor Search Beyond the Unit Hypersphere, Qiang Huang, Yifan Lei, Anthony K. H. Tung, SIGMOD 21
- Differentially Private Approximate Near Neighbor Counting in High Dimensions, Alexandr Andoni, Piotr Indyk, Sepideh Mahabadi, Shyam Narayanan, NeurIPS 2023.
- Scalable Density-based Clustering with Random Projections, Haochuan Xu, Ninh Pham, pre-print 2024.

### DBSCAN clustering using efficient ANN techniques

DBSCAN is a core clustering technique. This project will explore how efficient approximate nearest neighbor techniques, such as clustering or graph-based approaches can be used to speed up the core routines required to compute a dbscan clustering.

- Recent Approaches and Trends in Approximate Nearest Neighbor Search, with Remarks on Benchmarking, Aumüller, Ceccarello, An overview paper of ANN techniques.
- Theoretically-Efficient and Practical Parallel DBSCAN, Yiqiu Wang, Yan Gu, Julian Shun, A fast dbscan algorithm for low-dimensional data, SIGMOD 2020.
- DBSCAN revisted revisted: Why and How You Should (Still) Use DBSCAN., Erich Schubert, Jörg Sander, Martin Ester, Hans-Peter Kriegel, Xiaowei Xu, TODS 2017.

### Sketching for Nearest Neighbor search

As you can read in the paper below, efficient distance comparisons are an important ingredient of the nearest neighbor search pipeline. This project will explore other methods for efficient distance estimation and compare it to the state-of-the-art.

- High-Dimensional Approximate Nearest Neighbor Search: with Reliable and Efficient Distance Comparison Operations, Jianyang Gao, Cheng Long, SIGMOD 2023.

### Participating in the Big-ANN-Challenge

The goal of this project is to prepare an entry for the Billion-Scale Approximate Nearest Neighbor Search Challenge using the concept of Locality-Sensitive Hashing. If you think it is interesting to search several hundred gigabytes for nearest neighbors, which is both a conceptional and an implementation challenge, this is the project for you :-) C++/Rust is a must.

### Meta-Study on Reproducibility of Algorithm Engineering Research

Many conferences are adopting a reproducibility track to increase and popularize reproducibility of research, see for example this initiative. This project aims to analyze accepted papers at algorithm engineering conferences such as ESA and ALENEX under the focus of evaluation methodology and reproducibility. The result should be a meta-study on reproducibility in algorithms research. It can also cover re-implementation of some selected algorithms that were published at these venues.

### Engineering LSH-based nearest neighbor search

PUFFINN is an LSH-based nearest neighbor search algorithm with provable guarantees on the correctness of the result. Unfortunately, it only supports inner product and set similarity distance functions. The goal of this project is to add other LSH variants to PUFFINN, making it more generally useful. This project can be carried out on all different levels, but requires good knowledge of C++.

Literature:

- Martin Aumüller, Tobias Christiani, Rasmus Pagh, Michael Vesterli, 2019, PUFFINN: Parameterless and Universally Fast FInding of Nearest Neighbors

### Privacy-preserving Similarity Search

Avoiding leaks of user information is crucial for the reputation of social networks and other large-scale systems. Hence, all aspects of data processing and managing should be carefully analyzed with regard to possible leaks. The thesis aims at studying privacy issues and privacy-preserving techniques for similarity search systems, a key primitive in several applications (e.g., recommender systems, machine learning, information retrieval) that often handle sensitive data like user preferences or medical records.

Work can be carried out in different directions:

*Attack scenarios on similarity search tools*. Describe attack scenarios on similarity search systems and evaluate their impact. This would entail studying different paradigms for similarity search such as locality-sensitive hashing, tree-based techniques, and similarity graphs within their application areas (e.g., recommender systems or deep learning).*Privacy-preserving similarity search*. Investigate techniques for privacy-preserving similarity search and study their security guarantees. The goal is to build a similarity-search library that offers provable security guarantees.*Empirical comparison of privacy-preserving similarity search*. Compare two existing methods in terms of their privacy guarantees and running time. For example, (Pagh, Stausholm, 2020) can be used to solve the problem discussed in (Aumüller et al., 2020).

There were already two Master theses (see (Aumüller et al., 2020) for an idea) that worked towards building a privacy-preserving similarity search tool. Extending such a tool could be an option.

A Bachelor thesis/Research project could be a survey of existing technologies.

See the referenced literature for more information.

*Literature:*

- (Aumüller et al., 2020) Martin Aumüller, Anders Bourgeat, Jana Schmurr. Locally Differential Private Sketches for Jaccard Similarity Estimation
- Dhaliwal, J., So, G., Parker-Wood, A., Beck, M.: Utility preserving secure private data release
- Homann, J., Edvardsen, M.: Privacy-preserving similarity search, Master thesis 2020, available upon request.
- Pagh, R., Stausholm, N.: Efficient Differentially Private F0 Linear Sketching
- Madsen A. S., Larsen I.: An Empirical Comparison of Differentially Private Similarity Estimation Techniques, Master thesis 2022, available upon request.