LaBSE
Search LaBSE on Amazon.
LaBSE (Language-agnostic BERT Sentence Embedding) is an open-source sentence embedding model developed by Google Research and published in 2020.[1]
It extends BERT language model with a multilingual dual-encoder architecture trained on parallel translation data, enabling semantically comparable sentence vectors across more than one hundred languages.[2]
LaBSE is distributed via TensorFlow Hub and is widely used for cross-lingual information retrieval, semantic search, and machine translation evaluation.[3][4]
Overview
LaBSE was introduced by Google Research as part of its multilingual representation learning program. The model maps text from diverse languages into a shared 768-dimensional vector space, where semantically equivalent sentences are located close to each other.[5][6]
Unlike traditional translation-based systems, LaBSE relies on a single shared transformer encoder for all languages, allowing direct comparison between sentences without translation.[1]
Architecture
The system follows the structure of BERT-base (12 transformer layers, 12 attention heads) but employs a dual-encoder training setup similar to the Universal Sentence Encoder.[7][8]
Each sentence is tokenized using a joint multilingual WordPiece vocabulary covering 109 languages. Mean pooling across the final hidden states yields a fixed-size sentence representation. Training uses a translation ranking loss that maximizes cosine similarity between parallel sentences and minimizes it for unrelated pairs.[9][10]
Training
LaBSE was trained on large multilingual corpora combining public datasets such as OPUS with internal translation data from Google.[11][12]
Optimization employed Adam with in-batch negatives and temperature-scaled cross-entropy. According to the authors, LaBSE achieved state-of-the-art results on cross-lingual retrieval benchmarks such as BUCC and Tatoeba at the time of its release.[1]
Applications
The model is publicly available on TensorFlow Hub and integrated into popular frameworks such as Hugging Face Transformers and Spark NLP. Typical applications include:
- Cross-lingual document and semantic search.
- Automatic evaluation of machine translation quality.
- Multilingual clustering, deduplication, and classification.
- Serving as a universal encoder for zero-shot learning tasks.
Reception and impact
LaBSE has been cited extensively in academic literature on cross-lingual representation learning.[13] Independent evaluations report that it remains competitive with later multilingual embedding models such as LASER2 and multilingual Sentence-BERT.[14]
Its introduction marked a milestone in multilingual semantic similarity research and influenced subsequent releases of multilingual encoders in the open-source ecosystem.[15][16][17]
See also
References
- ↑ 1.0 1.1 1.2 Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2020). "agnostic". arXiv:2007.01852 [cs.CL].
- ↑ Feng, Fangxiaoyu; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the ACL. 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
- ↑ "tfhub.dev/google/LaBSE". TensorFlow Hub. Retrieved 2025-10-10.
- ↑ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL). 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
- ↑ ""Language-Agnostic BERT Sentence Embedding"". Google Research Blog. 2020-08-18. Retrieved 2025-10-10.
- ↑ Feng, Fangxiaoyu; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
- ↑ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2024). "EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 32: 2841–2856. arXiv:2007.01852. Bibcode:2024ITASL..32.2841M. doi:10.1109/TASLP.2024.3402064.
- ↑ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
- ↑ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2024). "EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 32: 2841–2856. arXiv:2007.01852. Bibcode:2024ITASL..32.2841M. doi:10.1109/TASLP.2024.3402064.
- ↑ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
- ↑ "Samanantar: The Largest Publicly Available Parallel Corpus". MIT Press. 2022. doi:10.1162/tacl_a_00452. Retrieved 2025-10-10.
- ↑ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2024). "EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 32: 2841–2856. arXiv:2007.01852. Bibcode:2024ITASL..32.2841M. doi:10.1109/TASLP.2024.3402064.
- ↑ Reimers, Nils; Gurevych, Iryna (2020). "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation". Transactions of the Association for Computational Linguistics. 8: 121–135. doi:10.1162/tacl_a_00343.
- ↑ "Notes on LaBSE". Ceshine AI Blog. 2021-02-01.
- ↑ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2024). "EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 32: 2841–2856. arXiv:2007.01852. Bibcode:2024ITASL..32.2841M. doi:10.1109/TASLP.2024.3402064.
- ↑ Mao, Zhuoyuan; Chu, Chenhui; Kurohashi, Sadao (2022). "Efficient and Effective Massively Multilingual Sentence Embedding (EMS)". arXiv:2205.15744 [cs.CL].
- ↑ "Comparative Study of Multilingual Sentence Embedding Models for Semantic Search". Hugging Face Blog. 2023-03-15. Retrieved 2025-10-10.
External links
- LaBSE repository
- Language-Agnostic BERT Sentence Embedding (by Yinfei Yang and Fangxiaoyu Feng, Software Engineers, Google Research).
- TensorFlow Hub – LaBSE
This article "LaBSE" is from Wikipedia. The list of its authors can be seen in its historical and/or the page Edithistory:LaBSE. Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one.
