MPEG-G
MPEG-G (ISO / IEC 23092) is an ISO/IEC standard designed for genomic information representation by the collaboration of the ISO/IEC JTC 1/SC 29/WG 9 (MPEG) and ISO TC 276 "Biotechnology" Work Group 5. The goal of the standard is to provide interoperable solutions for data storage, access, and protection across different possible implementations for data information generated by high-throughput sequencing machines and their subsequent processing and analysis[1][2]. The standard is composed of different parts, each one addressing a specific aspect, such as compression, metadata association, Application Programming Interfaces (APIs), and a reference software for data decoding. Together with the reference decoder software, commercial and open source[3] implementations started to be available in 2019, covering progressively more of the published parts of the standard.
Background
The advent of high-throughput sequencing (HTS) technologies has revolutionized the field of quantitative biology. Availability of large collections of genomic information has now entered everyday practice and has become a cornerstone of a number of disciplines, ranging from biological research to personalized medicine in the clinic. At the moment, genomic information is mostly exchanged through a variety of data formats, such as FASTA/FASTQ for unaligned sequencing reads and SAM/BAM/CRAM for aligned reads. Biological studies typically produce genomic annotation data such as mapping statistics, quantitative browser tracks, variants, genome functional annotations, gene expression data, and Hi-C contact matrices. These diverse types of downstream genomic data are currently represented in different formats such as VCF, BED, GFF, etc., sometimes with loosely defined semantics.The ISO/IEC 23092 (MPEG-G) standard aims to provide a unified format for the efficient representation and compression of such diverse data, both for file storage and data transport. In order to do that, the standard is divided in several parts.
Structure of the standard
The MPEG-G standard utilizes technology and data representation architectures previously validated in the field of digital media. They allow to compress and transport genome sequencing data even in complex scenarios, for instance when access is needed to large amounts of possibly distributed data, or when part of the data needs to be encrypted for privacy reasons. Conceptually, such requirements lead to the definition of a number of mutually interrelated mechanisms, which are summarized in the following list:
- Data format and compression [4]
- Data streaming[4]
- Compressed file concatenation[4]
- Incremental update of sequencing data and metadata[4]
- Selective access to compressed data, e.g. fast queries by genomic range[5]
- Metadata association[6]
- Enforcement of privacy rules[6]
- Selective encryption of data and metadata[6]
- Annotation and linkage of genomic segments[7].
In turn, some of these topic have been collected together, in order to make the standard easier to understand and implement. As a result, the ISO/IEC 23092 standard is physically structured as a series of separate document, as follows:
Part 1 - Transport and Storage of Genomic Information[4]
This part of the standard deals with data formats for both transport and storage of genomic Information, with reference conversion process and informative annexes. The main topics covered by this part are genomic data streaming and file format.
Part 2 - Coding of Genomic Information (Compression)[5]
This part provides specifications for the normative representation of genomic sequence reads identifiers, genomic sequence reads (both unaligned and aligned reads), reference sequences and quality values. This is the part where compression is specified in terms of normative bitstream syntax and decoding behaviour. The compression engine relies on the CABAC arithmetic coder[8].
Part 3 - APIs (Interfaces, Metadata and Protection)[6]
This part of the standard specifies information metadata, protection metadata, and programming interfaces to access genomic information represented in the MPEG-G format. The main goals are to provide the possibility of adding metadata to compressed genomic information, and to enable external applications to gain (controlled) access to MPEG-G data. A non-normative section also details recommended good practices to implement interoperability with the Sequence Alignment Map (SAM) and Binary Alignment Map (BAM) file formats.
Part 4 - Reference Software[9]
This part of the standard is a support and guide for implementers of MPEG-G and it is distributed in source code. It is normative in the sense that any conforming implementation of the decoder, taking the same conformant compressed bitstreams, using the same normative output data structures, will output the same data as the reference software.
Part 5 - Conformance[10]
This part of the standard specifies a normative procedure to assess conformity of bitstreams and decoders to the standard and it is based on an exhaustive dataset of compressed data and corresponding test procedures. Conformance testing is fundamental to validate the correct implementation of the MPEG-G technology in different devices and applications and to enable interoperability among systems.
Part 6 - Genomic Annotations[7]
This part of the standard series specifies a compressed representation of genomic annotations, together with methods to query and encrypt it. Under the concept of genomic annotations, i.e., information associated with a specific region of the reference genome, several types of data are unified — for instance genomic functional annotations usually represented in GFF format, genomic variants usually represented in VCF format, and more. It is possible to link such information to the compressed representation of raw sequencing data and metadata specified in the previous parts of the standard.
| Part | Number | First public release date (First edition) | Latest public release date (edition) | Latest amend- ment | Title | Description |
|---|---|---|---|---|---|---|
| Part 1 | ISO/IEC 23092-1 | 2019 | 2019 | Transport and Storage of Genomic Information | Specification of file format, streaming and indexing | |
| Part 2 | ISO/IEC 23092-2 | 2019 | 2019 | Coding of Genomic Information | Compression of unmapped (raw) and aligned genome sequencing data | |
| Part 3 | ISO/IEC 23092-3 | 2020 | 2020 | APIs | Specification of standard interfaces, syntax for metadata and description of content protection mechanisms | |
| Part 4 | ISO/IEC 23092-4 | (2020) | Reference Software | It describes the open source implementation of a normative decoder and informative encoder. It also provides compressed bitstreams that can be used for reference purposes. Note that other open source implementations developed by independent groups do exist[11] | ||
| Part 5 | ISO/IEC 23092-5 | (2020) | Conformance testing | It details the testing procedure and associated compressed reference bitstreams to be used when one wants to assess the conformance of a decoder implementation with the MPEG-G standard | ||
| Part 6 | ISO/IEC 23092-6 | (2021) | Coding of genomic annotations | Representation of genomic annotations — that is, a number of heterogeneous data types associated with intervals of the reference genome that the sequencing data has been aligned to. |
Filename extensions
To be defined.
See also
References
- ↑ Alberti, Claudio; Paridaens, Tom; Voges, Jan; Naro, Daniel; Ahmad, Junaid; Ravasi, Massimo; Renzi, Daniele; Zoia, Giorgio; Ribeca, Paolo; Ochoa, Idoia; Mattavelli, Marco; Delgado, Jaime; Hernaez, Mikel (October 2018). "An introduction to MPEG-G, the new ISO standard for genomic information representation". bioRxiv: 426353. doi:10.1101/426353. Unknown parameter
|s2cid=ignored (help) - ↑ Hernaez, Mikel; Pavlichin, Dmitri; Weissman, Tsachy; Ochoa, Idoia (2019-07-20). "Genomic Data Compression". Annual Review of Biomedical Data Science. 2 (1): 19–37. doi:10.1146/annurev-biodatasci-072018-021229. ISSN 2574-3414.
- ↑ "Genie, Open Source MPEG-G Codec". 22 June 2021. Unknown parameter
|url-status=ignored (help) - ↑ 4.0 4.1 4.2 4.3 4.4 "ISO/IEC 23092-1 Transport and Storage of Genomic Information".
- ↑ 5.0 5.1 "ISO/IEC 23092-2 Coding of Genomic Information".
- ↑ 6.0 6.1 6.2 6.3 "ISO/IEC 23092-3 Metadata and APIs".
- ↑ 7.0 7.1 "ISO/IEC 23092-6 Coding of Genomic Annotations".
- ↑ Voges, Jan; Paridaens, Tom; Müntefering, Fabian; Mainzer, Ludmilla; Bliss, Brian; Yang, Mingyu; Ochoa, Idoia; Fostier, Jan; Östermann, Jörn; Hernaez, Mikel (12 December 2019). "GABAC: an arithmetic coding solution for genomic data". Bioinformatics. 36 (7): 2275–2277. doi:10.1093/bioinformatics/btz922. PMC 7141842 Check
|pmc=value (help). PMID 31830243. - ↑ "ISO/IEC 23092-4 Reference Software".
- ↑ "ISO/IEC 23092-5 Conformance".
- ↑ Bliss, Brian; Allen, Joshua; Baheti, Saurabh; Bockol, Matthew; Delgado, Jaime; Fostier, Jan; Gelpi, Josep; Hart, Steven; Hernaez, Mikel; Hudson, Matthew; Kalmbach, Michael; Klee, Eric; Mainzer, Liudmila; Fabian, Münterfering; Naro, Daniel; Ochoa, Idoia; Ostermann, Joern; Paridaens, Tom; Ross, Christian; Voges, Jan; Wieben, Eric; Yang, Mingyu; Weissman, Tsachy; Wiepert, Mathieu (November 2019). Genie: an MPEG-G conformant software to compress genomic data. sc19.supercomputing.org. doi:10.1101/426353.
External links
- mpeg-g.org
- MPEG web site
- ISO/IEC 23092-1
- ISO/IEC 23092-2
- ISO/IEC 23092-3
- ISO/IEC 23092-4
- ISO/IEC 23092-5
- ISO/IEC 23092-6
Category:ISO/IEC standards Category:Open standards covered by patents
This article "MPEG-G" is from Wikipedia. The list of its authors can be seen in its historical and/or the page Edithistory:MPEG-G. Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one.
| This page exists already on Wikipedia. |
