PatentLex

The PatentLex System
Initial release	January 2016
Written in	Perl, Python, PHP, JavaScript, React, Material UI
Engine
Operating system	Cross-platform
Platform	Linux / Apache / MySQL / PHP
Standard(s)	Unicode
Available in	English, Traditional and Simplified Chinese
Type	Corpus
Website	patentlex.chilin.hk

Search PatentLex on Amazon.PatentLex is a unique searchable platform of bilingual Chinese-English terms in science and technology. It aims at solving cross-linguistic communication problems involving skills such as understanding words in the source language or expressing concepts in the target language. The human interlocutors or the natural language processing (NLP) systems will sooner than later come across the major linguistic problem of lexical deficiency, known as OOV (out of vocabulary) problem, a term coined by researchers in statistical machine translation, and used commonly in NLP. The problem manifests itself because similar or identical concepts are often encoded differently in any pair of languages, and because of the incapacity of the human being or its digital analogy to capture the differences correctly and exhaustively. While the human being has physical limitations and the computer has now practically almost unlimited capacity, the problem persists because developments in human culture and society, especially in science and technology, constantly outpace developments in terminological research and lexicography. A good source of data to bootstrap developments in scientific and technological terminology would be bilingual patents.

Another problem is the non-isomorphic correspondence of the semantic mapping of the conceptual spaces in different languages. This is manifested as what may be called multiple renditions of a source language term in the target language, which would challenge both the human interlocutor and the NLP system so that further efforts are needed to check and analyse the actual usage of the alternate terms to make the correct selection.

PatentLex has resulted from filtering 10 years’ of successfully filed Chinese and English patents, consisting of a total of 1.8 million patents (involving 5 trillion English words, and 12 billion Chinese characters), to obtain over 300,000 parallel or comparable Chinese and English patents^[1]^[2], consisting of 1 billion English words and 2 million Chinese characters, which form the basis of the comparable texts for the cultivation of this platform.

The comparable corpus has gone through several more language engineering efforts to perform further cultivation and filtering, such as an iterative approach to align the sentences to obtain statistically determined Chinese-English parallel sentence pairs. The resultant set then goes through a process of bilingual multi-word extraction, which is in turn used to augment the bilingual sentence alignment process and the extraction of compound words.

Cultivation

The comparable corpus of bilingual patents constitute a sub-type of parallel texts. The patent documents are split into sentences (one kind of text segmentation), and resultant sentence pairs go through a sentence alignment process. An ensemble of alignment algorithms and tools are applied, including dynamic programming technique which can match 1-to-n sentence fragments into sentence pairs on the basis of length-based and lexicon-based features, as well as bilateral translation probabilities.

A new corpus of Chinese-English parallel sentence pairs is created which include partial sentence components. The resultant set then goes through a process of bilingual multi-word extraction, which is in turn used to augment the lexicon-based bilingual sentence alignment process, forming an iterative approach to yield increasingly better results.

Language Resources for Machine Translation and NLP

Finally, a corpus of over 30 million high quality sentence pairs is obtained, with over 900 million English words and 1.9 billion Chinese characters. The initial bilingual sentence pairs obtained in the exercise in 2009 provided the training corpora for the participants and for their performance assessment in first two international Chinese-English patent MT competitions organised by NII Testbeds and Community for Information Access Research (NTCIR) of National Institute of Informatics (NII) in Tokyo in 2011 ^[3] and 2013 ^[4]. Work since then has involved automatic and semi-automatic curation efforts and has further provided bilingual and linguistically well-formed compound words from this corpus of bilingual sentence pairs.

Search Platform Development

An interactive platform is developed from this corpus to serve the needs of researchers on bilingual terminology, translators of technical subjects, and patent specialists in an age where rapid developments in science and technology outpace progress in terminology, and is found to be especially useful to teachers and students in courses on technical translation. The database can be helpful for search engine enhancement as well as for machine translation training and related data analytics.^[5] This platform was demonstrated in the Game Changer Innovation Contest organised by TAUS Asia 2019 in Singapore, and came in second place. ^[6]

PatentLex version 2.0 offers over one million bilingual compound word entries with example sentences and distribution among major international patent classification domains.^[7]

Features

Provision of multiple alternate renditions of the translated terms according to their usage frequency in authentic technical documents such as patents;
Provision of domains of the different renditions to facilitate selection;
Provision of ample usage examples from authentic patent texts;
LexiScan to process input texts and mark all known technical compound words in the database with respect to the above 3 provisions;
Fuzzy search support;
Bilingual knowledge graph navigation to enhance bilingual appreciation and ability in technical terminology (under development);
Cross-lingual word cloud presentation of salient and comparable terms to facilitate data analytics (under development).

References

↑ Lu, Bin., Benjamin K. Tsou, Tao Jiang, Jingbo Zhu, and Kwong, Olivia (2011). "Mining parallel knowledge from comparable patents". Ontology Learning and Knowledge Discovery Using the Web: Challenges and Recent Advances: 247–271 – via IGI Global.
↑ Lu, Bin., Ka-po Chow and Benjamin K. Tsou (2015). "Comparable Multilingual Patents as Large-scale Parallel Corpora". In: Serge Sharoff, Reinhard Rapp, Pierre Zweigenbaum, Pascale Fung (Eds.), Building and Using Comparable Corpora. Springer-Verlag: 167–187.
↑ Goto, Isao, Bin Lu, Ka-Po Chow, Sumita Eiichiro, and Benjamin K. Tsou (2012). "Overview of the Patent Translation Task at the NTCIR-9 Workshop". Proceedings of the NTCIR-9 Workshop: 559–578 – via NTCIR, Tokyo.
↑ Goto, Isao, Ka-Po Chow, Bin Lu, Sumita Eiichiro, and Benjamin K. Tsou (2013). "Overview of the Patent Translation Task at the NTCIR-10 Workshop". Proceedings of the 10th NTCIR Conference: 260–286 – via NTCIR, Tokyo.
↑ Tsou, Benjamin (2019). "A Proactive Approach to Lexical Hurdles in Technical Translation and Language Processing Involving Chinese and English". TAUS Asia Conference and Exhibits Conference Proceedings – via TAUS.
↑ Anne-Maj, van der Meer (Ed.) (October 2019). "TAUS Game Changer Innovation Contest". TAUS Keynotes Asia 2019: Global Content In and Out of Asia. 2019: 35–36 – via TAUS.CS1 maint: Extra text: authors list (link)
↑ Tsou, Benjamin, and Kapo Chow (2019). "From the cultivation of comparable corpora to harvesting from them: A quantitative and qualitative exploration". Proceedings of the 12th Workshop on Building and Using Comparable Corpora: 29–36 – via RANLP 2019, Varna, Bulgaria.

External links

Official website

This article "PatentLex" is from Wikipedia. The list of its authors can be seen in its historical and/or the page Edithistory:PatentLex. Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one.

[1] Lu, Bin., Benjamin K. Tsou, Tao Jiang, Jingbo Zhu, and Kwong, Olivia (2011). "Mining parallel knowledge from comparable patents". Ontology Learning and Knowledge Discovery Using the Web: Challenges and Recent Advances: 247–271 – via IGI Global.

[2] Lu, Bin., Ka-po Chow and Benjamin K. Tsou (2015). "Comparable Multilingual Patents as Large-scale Parallel Corpora". In: Serge Sharoff, Reinhard Rapp, Pierre Zweigenbaum, Pascale Fung (Eds.), Building and Using Comparable Corpora. Springer-Verlag: 167–187.

[3] Goto, Isao, Bin Lu, Ka-Po Chow, Sumita Eiichiro, and Benjamin K. Tsou (2012). "Overview of the Patent Translation Task at the NTCIR-9 Workshop". Proceedings of the NTCIR-9 Workshop: 559–578 – via NTCIR, Tokyo.

[4] Goto, Isao, Ka-Po Chow, Bin Lu, Sumita Eiichiro, and Benjamin K. Tsou (2013). "Overview of the Patent Translation Task at the NTCIR-10 Workshop". Proceedings of the 10th NTCIR Conference: 260–286 – via NTCIR, Tokyo.

[5] Tsou, Benjamin (2019). "A Proactive Approach to Lexical Hurdles in Technical Translation and Language Processing Involving Chinese and English". TAUS Asia Conference and Exhibits Conference Proceedings – via TAUS.

[6] Anne-Maj, van der Meer (Ed.) (October 2019). "TAUS Game Changer Innovation Contest". TAUS Keynotes Asia 2019: Global Content In and Out of Asia. 2019: 35–36 – via TAUS.CS1 maint: Extra text: authors list (link)

[7] Tsou, Benjamin, and Kapo Chow (2019). "From the cultivation of comparable corpora to harvesting from them: A quantitative and qualitative exploration". Proceedings of the 12th Workshop on Building and Using Comparable Corpora: 29–36 – via RANLP 2019, Varna, Bulgaria.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

PatentLex

Contents