SpeedGene: A compression algorithm for fast and efficient storage of next-generation genetic sequencing studies

To solve the problem of large file sizes and long loading times of pedigree files for GWAS studies and next-generation sequencing studies, researchers at the Harvard School of Public Health have developed a new compression algorithm -- SpeedGene – that includes an optimized ”storage-and-load” algorithm for genetic data. The new algorithm performs better than currently available compression formats for pedigree files by several magnitudes. Depending on the minor allele frequency, the SpeedGene algorithm selects among three different compression methods to minimize disk and memory storage space. The compression factor of the algorithm depends on the genotype frequency distribution of the markers and the number of subjects in the dataset and could range from 16 to several hundred, which potentially allows genomic data of thousands of people to be stored with only hundreds of megabytes of space. SpeedGene is currently implemented as a C++ library, and could be readily incorporated into common software programs for association studies, where the genetic information could be loaded using the library and directly sent for analysis.

The SpeedGene format does not require any CPU-time for decompression, and the storage structure of the genotypes allows fast computation of permutation methods. The library provides functions for loading the compressed files and retrieving any part of the original data. Furthermore, the implementation supports parallel processing of the dataset, i.e., loading of subsets of markers. This greatly decreases the loading time when parallel jobs are dispatched in clusters.

In conclusion, the SpeedGene library enables the storage and analysis of next generation sequencing data in existing hardware environment, making system upgrades unnecessary.

Applications

The pedigree file-format is one of the most commonly used input formats for genetic data analysis softwares, e.g. FBAT, PBAT, PLINK. For high-throughput sequencing data and data from genome-wide association studies, the sizes of these pedigree files can reach several terabytes. Unnecessarily large files result in waste of disk space and computation time, e.g. loading time during analysis.

Intellectual Property Status: Patent(s) Pending