Portrait

Marcel Gregoriadis

A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication

2024. Marcel Gregoriadis, Leonhard Balduf, Björn Scheuermann, Johan Pouwelse.

Under Review at IEEE Transactions on Parallel and Distributed Systems

Data deduplication emerged as a powerful solution for reducing storage and bandwidth costs by eliminating redundancies at the level of chunks. This has spurred the development of numerous Content-Defined Chunking (CDC) algorithms over the past two decades. Despite advancements, the current state-of-the-art remains obscure, as a thorough and impartial analysis and comparison is lacking. We conduct a rigorous theoretical analysis and impartial experimental comparison of several leading CDC algorit… read more

De-DSI: Decentralised Differentiable Search Index

2024. Petru Neague, Marcel Gregoriadis, Johan Pouwelse.

EuroMLSys '24: Proceedings of the 4th Workshop on Machine Learning and Systems

This study introduces De-DSI, a novel framework that fuses large language models (LLMs) with genuine decentralization for information retrieval, particularly employing the differentiable search index (DSI) concept in a decentralized setting. Focused on efficiently connecting novel user queries with document identifiers without direct document access, De-DSI operates solely on query-docid pairs. To enhance scalability, an ensemble of DSI models is introduced, where the dataset is partitioned into… read more

Analysis and Comparison of Deduplication Strategies in IPFS

2022. Marcel Gregoriadis.

Master Thesis at Humboldt University of Berlin

IPFS has recently risen in popularity, as it represents the backbone for file sharing in a decentralized web. As the amount of files exchanged on IPFS grows, and both storage and network bandwidth are expensive, the discussion around deduplication strategies becomes pressing. This discussion is largely founded on the execution of chunking algorithms. To this end, we analyzed and compared FastCDC and AE, as two state-of-the-art chunking algorithms, with Rabin, Buzhash, and fixed-size chunking, w… read more

Analysis of Arbitrary Content on Blockchain-Based Systems using BigQuery

2022. Marcel Gregoriadis, Robert Muth, Martin Florian.

WWW '22: Companion Proceedings of the Web Conference 2022

Blockchain-based systems have gained immense popularity as enablers of independent asset transfers and smart contract functionality. They have also, since as early as the first Bitcoin blocks, been used for storing arbitrary contents such as texts and images. On-chain data storage functionality is useful for a variety of legitimate use cases. It does, however, also pose a systematic risk. If abused, for example by posting illegal contents on a public blockchain, data storage functionality can le… read more