Suggested thesis topics

by Rasmus Pagh

The following thesis topics are intended for BSc and MSc students enrolled at IT University of Copenhagen. Each one contains a theoretical and a practical component, but the balance can be adjusted to suit the interests of students. They all require a solid background in algorithms (engineering and/or theoretical aspects) or mathematics. As of 2017, most of the topics are related to the Scalable Similarity Search project; you can read introductions to the fascinating topic of similarity (aka. near neighbor) search in CACM articles by Chazelle, and Andoni and Indyk.

If you wish to pursue one of these topics: 1) Find someone to work with, since we usually do not supervise single-person projects, 2) Write a short description of the kind of thesis you imagine writing, and 3) Contact me to set up a meeting well in time before the start of the work (if I am not available, I will try to refer you to another supervisor).

If you would like to see an example of a thesis in algorithms, have a look at Johan Sivertsen's MSc thesis.

Proposals

Engineering kNN (added September 2017). Our group recently created the ANN benchmarks framework for comparing implementations of k-nearest neighbors (kNN) data structures. This project studies the best existing implementations, new techniques in the literature that have yet to be implemented, and tries to improve the state-of-the-art. Prerequisites: Experience with high-performance programming in C/C++ is an advantage but not required.
Automated search for reliable similarity search algorithms (added September 2017). It is known that in some cases the error probability associated with state-of-the-art similarity search algorithms can be eliminated. This relies on the existence of certain mathematical objects such as covering designs and sphere coverings. However, theoretical techniques such as those of Pagh and Ahle are not yet practically useful in all situations. This project deals with creating explicit, efficient coverings for spaces of medium dimension (say, 32 or 64) using a combination of brute-force search algorithms and composition techniques. This will involve making full use of our large server (28 cores, 512 GB RAM). The aim is to match the performance of the best algorithms that have nonzero error probability. Prerequisites: Good mathematical skills, fluency in dynamic programming techniques. Experience with parallel algorithms is an advantage, but not required.
Succinct estimation of Jensen-Shannon divergence (added September 2017). Jensen-Shannon divergence is a key measure of the similarity of two probability distributions. In turn, probability distributions are used in many machine learning settings, with applications for example in recommender systems. This project deals with ways of estimating Jensen-Shannon divergence that is much faster in practice than the exact brute-force computation. This will be partly based on techniques in the literature with the goal of matching the practicality of b-bit minwise hashing. Collaboration with an industry partner that needs this functionality is a possibility. Prerequisites: Good mathematical skills.