CDC - Crossover Data Compression

Measuring similarity of files by using compression algorithms.

Background

Data processing often involves organizing data into groups of similar content (clustering), assigning new data to an existing group (classification), or detecting data that cannot be assigned to any of the existing groups (anomaly detection). In all these cases it is necessary to find measurement criteria for data similarity. Such similarity measures are of great importance for applications in data security (e.g., detection of malware) and for machine learning algorithms.

Project content and Goals

The overarching aim of this project is to use data compression methods (such as "packing" or "zipping") for measuring differences between files. More precisely, apart from creating compressed data, compression algorithms also create compression rules, which are accessed when files are decompressed again. Applying such rules to different files provides an anchor point for measuring similarities between these different files.

The central goals and tasks of the project are:

Finding compression algorithms, which allow to access compressed data and compression rules independently (or where they can be easily separated).
Deriving software-metrics from compression rules, which are then applied to different files in order to make similarities between them measurable. Also, determining which of these software-metrics work best in combination with established classification and clustering algorithms.
Developing a working prototype, which is based on and taps into the insights gained during this project.

Methods

The project is divided into several phases:

The status quo with regard to the research question is investigated.
Compression algorithms, which allow to clearly differentiate between data and compression rules, (e.g. Huffman, Tunstall or arithmetic coding), are identified. This builds the foundation for obtaining novel algorithms, which are optimized for measuring similarities. Software-metrics are then used to generate similarity measures on the basis of the raw data produced during compression. Various metrics (e.g. Jensen-Shannon divergence, Hellinger distance) are tested for their suitability.
To achieve the best outcomes possible, the newly developed algorithms and metrics are tested and evaluated. Existing algorithms are modified and tested plus it is tested whether combinations of algorithms provide good results.
It is investigated how well different classification and clustering algorithms work in tandem with the newly developed metrics and methods.
To make the knowledge gained available for a wider audience it is integrated into a computer program that can perform classification tasks.

Results

The correct identification, assignment and categorization of data and files are fundamental tasks in computer science. Detecting similarities is a prerequisite for such tasks, but often it is difficult to find proper measurement criteria for that. By borrowing and modifying methods employed in data compression, a new approach for measuring similarity is pursued in the current project. The knowledge accumulated is made publicly available via scientific publications and integrated in a software prototype that can be used for classifying data.