MMseqs2

MMseqs2
Developer(s)	Martin Steinegger, Milot Mirdtia, Maria Hauser, Clovis Galiez, Lars von den Driesch, Johannes Soeding
Written in	C++
Engine
Available in	English
Type	Bioinformatics tool
License	GPL v3
Website	mmseqs.org

MMseqs2^[1] (Many-against-Many sequence searching) is an open-source software (GPLv3 licensed) suite for fast similarity searches and clustering of protein sequences. MMseqs2 can compare a database (a set) of query protein sequences with a database of target protein sequences. It aligns each query protein sequence to similar target protein sequences. Sequence similarity searches are widely used in life science research to infer the functions and structures of the query proteins from those of similar proteins in the database. Often, the sensitivity of search tools is insufficient to find a similar sequence with annotated function or known structure; therefore sensitivity and not only speed of the search tools is important. MMseqs2 was reported to achieve a good combination of search sensitivity and search speed^[1] (see figure 1). MMseqs2 is called from the command line and runs under Linux, macOS, and Windows / cygwin (trial version). A version installable as local app or web server is available.

Search MMseqs2 on Amazon.

Application in genomics and metagenomics

MMseqs2 is used to speed up sensitive sequence searches to generate various databases of orthologous proteins: the SonicParanoid database^[2], microbial genome database (MBGD),^[3] and the OrthoDB.^[4]

In metagenomics, genetic sequences from microbes and viruses are sampled from the environment (human gut, skin, soil, oceans, sewage etc.) and sequenced directly, without the need for previous cultivation of microbes. Due to the quick drop in sequencing costs for next generation sequencing, metagenomics is getting ever more powerful while sequence sets grow larger and more costly to analyse computationally. Due to its favourable combination of speed and sensitivity and its ability to process billions of sequences in one go, MMseqs2 is employed to improve sequence annotations and analyses in environmental genomics or metagenomics.^[5]^[6]

Functionality

MMseqs2 is an open source software suite to search and cluster terabyte-sized protein sequence sets. In its iterative profile search mode, MMseqs2 achieves sensitivities to detect similar sequences beyond those of the popular BLAST and PSI-BLAST search tools at 400 times their speed.^[1]

Figure 1: Comparison of sequence search methods BLAST, UCLUST, DIAMOND, RAPsearch2, MMseqs2, LAST and MMseqs. Area under the curve sensitivity up to the first false positive (x-axis) versus speed-up factor relative to BLAST (y-axis), tested with 637,000 searches. White numbers in plot symbols: number of search iterations. This figure was published as Figure 2B in paper^[1].

Compared to its predecessor MMseqs^[7], MMseqs2 is more sensitive, supports iterative profile-to-sequence searches and sequence-to-profile searches, supports nucleotide sequences (for some commands), and contains many more utilities.

MMseqs2 can run on multiple cores and servers, scaling almost linearly (Supplementary Figure 2 in ^[1]). It can also split and distribute large query or target databases automatically across several compute servers using MPI. This allows users to analyse databases with billions of sequences with relatively modest computing resources.

The MMseqs2 suite contains four main tools (workflows) for common searching and clustering tasks:

search	(Iteratively) searches with sequences or profiles through the sequence database
cluster	Clusters sequences by similarity
linclust	Cluster sequences down to 50% pairwise sequence identity in linear time
clusterupdate	Update clustering of old sequence DB to clustering of new sequence DB

These tools are bash workflows composed of some of the 90 utility tools in MMseqs2 and its four core tools for three sequence prefiltering (mmseqs prefilter), local sequence alignment (mmseqs align), and clustering (mmseqs clust). This design gives expert users flexibility to write their own customised workflows as simple bash scripts.

The prefilter core tool computes the similarities between all sequences in the query database with all sequences in a target database using a k-mer matching stage followed by an ungapped alignment. The align core tool implements a vectorized Smith-Waterman-alignment^[8] of all sequences that pass a cut-off for the ungapped alignment score in the prefilter tool.

The clustering core tool can cluster protein sequence sets into groups of similar sequences. It takes as input the similarity graph obtained from the comparison of the sequence set with itself in the prefilter and align modules. Linclust^[9] is an independent workflow to cluster protein sequences in linear time. It is less sensitive but magnitudes faster than the mmseqs cluster workflow. The mmseqs cluster update workflow can efficiently update an existing sequence clustering by adding new sequences and removing deprecated ones without the need to compare all sequences with all others.

External links

References

↑ ^1.0 ^1.1 ^1.2 ^1.3 ^1.4 Steinegger M, Söding J (2017). "MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets". Nature Biotechnology. 35 (11): 1026–1028. doi:10.1038/nbt.3988. PMID 29035372.
↑ Cosentino S.; Iwasaki W. (2018-07-19). "SonicParanoid: fast, accurate, and easy orthology inference". Bioinformatics. doi:10.1093/bioinformatics/bty631. PMID 30032301.
↑ Uchiyama I.; Mihara M.; Nishide H.; Chiba H.; Kato M. (2018-11-20). "MBGD update 2018: microbial genome database based on hierarchical orthology relations covering closely related and distantly related comparisons". Nucleic Acids Res. doi:10.1093/nar/gky1054. PMID 30462302.
↑ Kriventseva EV; Kuznetsov D; Tegenfeldt F; Manni M; Dias R; Simão FA; Zdobnov EM (2018-11-05). "OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs". Bioinformatics. doi:10.1093/nar/gky1053. PMID 30395283.
↑ Hao Y.; Pei Z.; Brown s.M. (Oct 2017). Bioinformatics in Microbiome Analysis. Methods in Microbiology. Methods in Microbiology. 44. pp. 1–18. doi:10.1016/bs.mim.2017.08.002. ISBN 9780128137147. Search this book on
↑ Lau P.; Preynat-Seauve O.; et al. (May 2017). "Metagenomics analysis of red blood cell and fresh-frozen plasma units". Transfusion. 57 (7): 1787–1800. doi:10.1111/trf.14148. PMID 28497550.
↑ Hauser M.; Steinegger M.; Söding J. (Jan 2016). "MMseqs software suite for fast and deep clustering and searching of large protein sequence sets". Bioinformatics. 32 (9): 1323–1330. doi:10.1093/bioinformatics/btw006. PMID 26743509.
↑ Farrar M (January 2007). "Striped Smith-Waterman speeds database searches six times over other SIMD implementations". Bioinformatics. 23 (2): 156–61. doi:10.1093/bioinformatics/btl582. PMID 17110365.
↑ Steinegger, Martin; Soeding, Johannes (2018-06-29). "Clustering huge protein sequence sets in linear time". Nature Communications. 9 (1): 2542. doi:10.1038/s41467-018-04964-5. PMC 6026198. PMID 29959318.
↑ Hu G.; Kurgan L. (2018-08-13). "Sequence Similarity Searching". Curr Prot Prot Sci.: e71. doi:10.1002/cpps.71. PMID 30102464.

This article "MMseqs2" is from Wikipedia. The list of its authors can be seen in its historical and/or the page Edithistory:MMseqs2. Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one.

[pmid29035372-1] 1.0 ^1.1 ^1.2 ^1.3 ^1.4 Steinegger M, Söding J (2017). "MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets". Nature Biotechnology. 35 (11): 1026–1028. doi:10.1038/nbt.3988. PMID 29035372.

[2] Cosentino S.; Iwasaki W. (2018-07-19). "SonicParanoid: fast, accurate, and easy orthology inference". Bioinformatics. doi:10.1093/bioinformatics/bty631. PMID 30032301.

[3] Uchiyama I.; Mihara M.; Nishide H.; Chiba H.; Kato M. (2018-11-20). "MBGD update 2018: microbial genome database based on hierarchical orthology relations covering closely related and distantly related comparisons". Nucleic Acids Res. doi:10.1093/nar/gky1054. PMID 30462302.

[4] Kriventseva EV; Kuznetsov D; Tegenfeldt F; Manni M; Dias R; Simão FA; Zdobnov EM (2018-11-05). "OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs". Bioinformatics. doi:10.1093/nar/gky1053. PMID 30395283.

[5] Hao Y.; Pei Z.; Brown s.M. (Oct 2017). Bioinformatics in Microbiome Analysis. Methods in Microbiology. Methods in Microbiology. 44. pp. 1–18. doi:10.1016/bs.mim.2017.08.002. ISBN 9780128137147. Search this book on

[6] Lau P.; Preynat-Seauve O.; et al. (May 2017). "Metagenomics analysis of red blood cell and fresh-frozen plasma units". Transfusion. 57 (7): 1787–1800. doi:10.1111/trf.14148. PMID 28497550.

[7] Hauser M.; Steinegger M.; Söding J. (Jan 2016). "MMseqs software suite for fast and deep clustering and searching of large protein sequence sets". Bioinformatics. 32 (9): 1323–1330. doi:10.1093/bioinformatics/btw006. PMID 26743509.

[pmid17110365-8] Farrar M (January 2007). "Striped Smith-Waterman speeds database searches six times over other SIMD implementations". Bioinformatics. 23 (2): 156–61. doi:10.1093/bioinformatics/btl582. PMID 17110365.

[9] Steinegger, Martin; Soeding, Johannes (2018-06-29). "Clustering huge protein sequence sets in linear time". Nature Communications. 9 (1): 2542. doi:10.1038/s41467-018-04964-5. PMC 6026198. PMID 29959318.

[10] Hu G.; Kurgan L. (2018-08-13). "Sequence Similarity Searching". Curr Prot Prot Sci.: e71. doi:10.1002/cpps.71. PMID 30102464.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

MMseqs2

Contents

Application in genomics and metagenomics

Functionality

See also

External links

References

📰 Article(s) of the same category(ies)[edit]