You can edit almost every page by Creating an account and confirming your email.

LaBSE

From EverybodyWiki Bios & Wiki


LaBSE
File:Google "G" Logo.svg
Developer(s)Google Research
Initial releaseJuly 15, 2020 (2020-07-15)
Repositorytfhub.dev/google/LaBSE
Written in
Engine
    Operating systemCross-platform
    TypeOpen-source machine learning / Natural language processing
    LicenseApache License 2.0

    Search LaBSE on Amazon.

    LaBSE (Language-agnostic BERT Sentence Embedding) is an open-source sentence embedding model developed by Google Research and published in 2020.[1]

    It extends BERT language model with a multilingual dual-encoder architecture trained on parallel translation data, enabling semantically comparable sentence vectors across more than one hundred languages.[2]

    LaBSE is distributed via TensorFlow Hub and is widely used for cross-lingual information retrieval, semantic search, and machine translation evaluation.[3][4]

    Overview

    LaBSE was introduced by Google Research as part of its multilingual representation learning program. The model maps text from diverse languages into a shared 768-dimensional vector space, where semantically equivalent sentences are located close to each other.[5][6]

    Unlike traditional translation-based systems, LaBSE relies on a single shared transformer encoder for all languages, allowing direct comparison between sentences without translation.[1]

    Architecture

    The system follows the structure of BERT-base (12 transformer layers, 12 attention heads) but employs a dual-encoder training setup similar to the Universal Sentence Encoder.[7][8]

    Each sentence is tokenized using a joint multilingual WordPiece vocabulary covering 109 languages. Mean pooling across the final hidden states yields a fixed-size sentence representation. Training uses a translation ranking loss that maximizes cosine similarity between parallel sentences and minimizes it for unrelated pairs.[9][10]

    Training

    LaBSE was trained on large multilingual corpora combining public datasets such as OPUS with internal translation data from Google.[11][12]

    Optimization employed Adam with in-batch negatives and temperature-scaled cross-entropy. According to the authors, LaBSE achieved state-of-the-art results on cross-lingual retrieval benchmarks such as BUCC and Tatoeba at the time of its release.[1]

    Applications

    The model is publicly available on TensorFlow Hub and integrated into popular frameworks such as Hugging Face Transformers and Spark NLP. Typical applications include:

    • Cross-lingual document and semantic search.
    • Automatic evaluation of machine translation quality.
    • Multilingual clustering, deduplication, and classification.
    • Serving as a universal encoder for zero-shot learning tasks.

    Reception and impact

    LaBSE has been cited extensively in academic literature on cross-lingual representation learning.[13] Independent evaluations report that it remains competitive with later multilingual embedding models such as LASER2 and multilingual Sentence-BERT.[14]

    Its introduction marked a milestone in multilingual semantic similarity research and influenced subsequent releases of multilingual encoders in the open-source ecosystem.[15][16][17]

    See also

    References

    1. 1.0 1.1 1.2 Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2020). "agnostic". arXiv:2007.01852 [cs.CL].
    2. Feng, Fangxiaoyu; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the ACL. 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
    3. "tfhub.dev/google/LaBSE". TensorFlow Hub. Retrieved 2025-10-10.
    4. Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL). 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
    5. ""Language-Agnostic BERT Sentence Embedding"". Google Research Blog. 2020-08-18. Retrieved 2025-10-10.
    6. Feng, Fangxiaoyu; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
    7. Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2024). "EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 32: 2841–2856. arXiv:2007.01852. Bibcode:2024ITASL..32.2841M. doi:10.1109/TASLP.2024.3402064.
    8. Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
    9. Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2024). "EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 32: 2841–2856. arXiv:2007.01852. Bibcode:2024ITASL..32.2841M. doi:10.1109/TASLP.2024.3402064.
    10. Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
    11. "Samanantar: The Largest Publicly Available Parallel Corpus". MIT Press. 2022. doi:10.1162/tacl_a_00452. Retrieved 2025-10-10.
    12. Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2024). "EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 32: 2841–2856. arXiv:2007.01852. Bibcode:2024ITASL..32.2841M. doi:10.1109/TASLP.2024.3402064.
    13. Reimers, Nils; Gurevych, Iryna (2020). "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation". Transactions of the Association for Computational Linguistics. 8: 121–135. doi:10.1162/tacl_a_00343.
    14. "Notes on LaBSE". Ceshine AI Blog. 2021-02-01.
    15. Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2024). "EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 32: 2841–2856. arXiv:2007.01852. Bibcode:2024ITASL..32.2841M. doi:10.1109/TASLP.2024.3402064.
    16. Mao, Zhuoyuan; Chu, Chenhui; Kurohashi, Sadao (2022). "Efficient and Effective Massively Multilingual Sentence Embedding (EMS)". arXiv:2205.15744 [cs.CL].
    17. "Comparative Study of Multilingual Sentence Embedding Models for Semantic Search". Hugging Face Blog. 2023-03-15. Retrieved 2025-10-10.

    External links



    This article "LaBSE" is from Wikipedia. The list of its authors can be seen in its historical and/or the page Edithistory:LaBSE. Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one.