Thesis topics for Students

The following list contains possible topics for BSc and MSc students enrolled at IT University of Copenhagen. Each topic usually contains a theoretical and a practical component. The balance can be adjusted to the student’s needs.

If you wish to write a thesis in algorithms: 1) Find someone to work with (we prefer thesis work as a group), 2) Write a short description of the kind of thesis you imagine writing, 3) Contact me to set up a meeting.

Proposed Topics

The list below contains thesis topics that I gave some thought. If you find none of them interesting, but have another algorithmic question in mind, feel free to contact me with your proposal.

Last Update: September 4, 2018.

Avoiding leaks of user information is crucial for the reputation of social networks and other large-scale systems. Hence, all aspects of data processing and managing should be carefully analyzed with regard to possible leaks. The thesis aims at studying privacy issues and privacy-preserving techniques for similarity search systems, a key primitive in several applications (e.g., recommender systems, machine learning, information retrieval) that often handle sensitive data like user preferences or medical records.

Work can be carried out in different directions:

  1. Attack scenarios on similarity search tools. Describe attack scenarios on similarity search systems and evaluate their impact. This would entail studying different paradigms for similarity search such as locality-sensitive hashing, tree-based techniques, and similarity graphs within their application areas (e.g., recommender systems or deep learning).
  2. Privacy-preserving similarity search. Investigate techniques for privacy-preserving similarity search and study their security guarantees. The goal is to build a similarity-search library that offers provable security guarantees.

See the referenced literature for more information.

Literature:

Outlier-detection using LSH Variants

Outlier detection is a classical problem, in which we want to identify points that deviate so much from other data points in an input set that we suspect them to be generated through a different mechanism.

This topic is concerned with applying Locality-Sensitive Hashing to solve (versions of) the outlier detection problem. It was shown in (Kirner et al.) that for some applications, a standard LSH approach does not perform for the outlier detection problem. On starting point of the thesis is to check, how methods described in (Aumüller et al.) (“anti-LSH” and “annulus LSH”) can be applied in the context of outlier detection, and compare these approaches with existing methods.

Literature:

Learned index data structures

In very recent work, (Kraska et al., 2018) discuss replacing classical data structures, such as B-trees and hashing algorithms, with indexes that learn a model of the data from the data. The mentioned paper shows that this can, sometimes, greatly reduce the index size and/or running times.

What the paper lacks is a theoretical foundation of the possibilities and limitations of such learned indexes. Here, (Mitzenmacher, 2018) made a proposal for the case of Bloom filters and related data structures.

Thesis work could go into two directions: 1) Implement and compare learned indexes to other indexing methods, such as Cuckoo hashing or locality-sensitive hash functions. 2) Further develop the theoretical foundations provided in (Mitzenmacher, 2018) to a different context.

Literature:

  • (Kraska et al., 2018) Kraska, T., Beutel, A., Chi, E. H., Dean, J., & Polyzotis, N. (2018, May). The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data (pp. 489-504). ACM.

  • (Mitzenmacher, 2018) M. Mitzenmacher, A Model for Learned Bloom Filters and Related Structures, eprint arXiv:1802.00884.

Nearest-Neighbor search is a key primitive in many different applications, such as recommender systems, natural language processing or machine learning. Many different techniques for building fast nearest-neighbor search algorithms exist, and (Aumüller et al., 2018) present a framework and evaluation for comparing them, available at http://ann-benchmarks.com.

As can be seen from the plots at this webpage, the relative order in which algorithms perform on given datasets stays roughly the same. A thesis in this topic should investigate why this is the case: Which properties have the datasets in common? Are there (non-artificial) datasets where algorithms would behave differently?

The thesis would contain a detailed review of existing approaches, an analysis of datasets with respect to different measures of difficulty that are found in the literature, and the design and evaluation of datasets specifically tuned to show differences between the respective nearest-neighbor search approaches.

Literature: