2023—2025 · Funded by DIREC: Digital Research Centre Denmark

Scope

The search for nearest neighbors is essential but often inefficient in applications like clustering and classification, especially with high-dimensional big data. Traditional methods become impractical due to the curse of dimensionality, making approximate nearest neighbor (ANN) search methods a faster alternative despite their inexact results. ANN methods significantly enhance processing speed, impacting algorithmic decision-making processes by introducing trade-offs in accuracy, bias, and trustworthiness, which must be carefully considered for different use cases.

The project Benefit and Bias of Approximate Nearest Neighbor Search for Machine Learning and Data Mining significantly advanced clustering and outlier detection in modern, high-dimensional and massive datasets through the usage of approximate nearest neighbor techniques.

Publications

The core publications of the team are:

  1. High-dimensional density-based clustering using locality-sensitive hashing, Camilla Birch Okkels, Martin Aumüller, Viktor Bello Thomsen, Arthur Zimek, EDBT 2025. implementation code, benchmarking code
  2. On the Design of Scalable Outlier Detection Methods Using Approximate Nearest Neighbor Graphs, Camilla Birch Okkels, Martin Aumüller, Arthur Zimek, SISAP 2024. code
  3. Approximate Single-Linkage Clustering Using Graph-Based Indexes: MST-Based Approaches and Incremental Searchers, Camilla Birch Okkels, Erik Thordsen, Martin Aumüller, Arthur Zimek, Erich Schubert, SISAP 2025, Invited to a Special Issue in Information Systems. code
  4. Approximate Hierarchical Density-based Clustering Using Graph-based Search Indexes, Camilla Birch Okkels, Erik Thordsen, Martin Aumüller, Arthur Zimek, Erich Schubert, Journal version of SISAP 2025 paper, under submission.
  5. Space-efficient and Adaptive K-Nearest Neighbour Search using Locality-Sensitive Filtering, Martin Aumüller, Alexander Theodor Bilde Pedersen, under submission.

The following publications were supported by the project as well:

  1. Overview of the SISAP 2024 Indexing Challenge, Eric Sadit Tellez, Martin Aumüller, Vladimir Mic, SISAP 2024.
  2. An Empirical Evaluation of Search Strategies for Locality-Sensitive Hashing: Lookup, Voting, and Natural Classifier Search, Malte Helin Johnsen, Martin Aumüller, SISAP 2024.
  3. Results of the Big ANN: NeurIPS'23 competition, Harsha Vardhan Simhadri, Martin Aumüller, et al., NeurIPS 2025. code
  4. Recent Approaches and Trends in Approximate Nearest Neighbor Search, with Remarks on Benchmarking, Martin Aumüller, Matteo Ceccarello, IEEE Data Eng. Bull. 47(3), 2023.

Software

The project significantly improved the state of the art for efficiently identifying outliers and finding density-based clusterings for massive, high-dimensional datasets.

ANN-outlier-detection

State-of-the-art library for efficient detection of local and global outliers using approximate nearest neighbor graphs.

srrdbscan

DBSCAN clustering with strong theoretical guarantees, powered by locality-sensitive hashing.

HSSL

Efficient single-linkage and HDBSCAN clustering for high-dimensional data.

Team

Martin Aumüller Martin Aumüller PI
Arthur Zimek Arthur Zimek co-PI
Camilla Birch Okkels Camilla Birch Okkels PhD student, 2023—2025
Viktor Bello Thomsen Viktor Bello Thomsen Student programmer, 2023—2024
Alexander Theodor Bilde Pedersen Alexander Theodor Bilde Pedersen Research assistant, 2025

Supported Student Papers

The project provided travel support for Bachelor thesis and Master thesis projects surrounding the project theme that ended up as a peer-reviewed research publications.

  1. Christoffer J. W. Romild, Thomas H. Schauser, Joachim Alexander Borup: Enhancing Approximate Nearest Neighbor Search: Binary-Indexed LSH-Tries, Trie Rebuilding, and Batch Extraction, SISAP 2023.
  2. Malte Helin Johnsen, Martin Aumüller: An Empirical Evaluation of Search Strategies for Locality-Sensitive Hashing: Lookup, Voting, and Natural Classifier Search, SISAP 2024.

Camilla's PhD defense

The project supported Camilla Birch Okkels's PhD project Scalable Approximate Nearest Neighbour Algorithms for Machine Learning and Data Mining. She successfully defended her PhD on February 23, 2026. Her committee consisted of Professor Ira Assent (Aarhus University, DK), Professor Richard Connor (St. Andrews, UK), and Associate Professor Riko Jacob (IT University of Copenhagen, DK).

Camilla Birch Okkels defending her PhD Camilla Birch Okkels with her opponents