Cross-Lingual NLP and Text Mining Resources

Introduction

Most natural language processing and text mining systems have mainly been designed to operate on English language text. Yet, the majority of people online do not speak English, and as more and more people gain access to digital tools, there is an increasing need for tools that operate on other languages.

In some cases, it is feasible to develop new customized systems for specific populations. E.g., there are many tools that have been developed for te Chinese market. However, it is challenging to achieve this for all of the world's over 7,000 languages.

Self-Learning with Adversarial Perturbation for Cross-Lingual NLP

Cross-lingual NLP and text mining technology is an important alternative in such situations. The idea is to develop a system based on one or more languages for which many resources are available, yet be able to use it in many new languages by connecting their linguistic representations.

Induced Code-Switching Datasets

We are releasing the cross-lingual benchmark database from our SIGIR 2020 paper. In this benchmark, a text classification model needs to be trained on regular English training data, but is evaluated on documents that contain an automatically induced form of code-switching, i.e., many of the original English words have been replaced by non-English words.

Example:

Clorox Co disait it déclarés a two-for-one stock scission et autorisée a 10.3 pourcentage accroissement in la annuel dividende rate. The société disait la trimestriels liquidités dividende était boosted to $0.64 per partagez depuis $0.58 on a pre-split basis, exigibles August 15 to actionnariat of enregistrer on July 28. The société disait ce sera be la 21st consécutive annuel accroissement in la dividend. The additionnel partages résultant depuis la scission sera be distribuées on September 2 to actionnariat of enregistrer on July 28, la société said. Clorox is a fabricant of ménage épicerie produits et automobile nettoyage produits commercialisés in la United States et internationally.

Induced Code-Switching Datasets

Download Induced Benchmark Datasets with German, Spanish, French, Italian, Japanese, Russian words

For more information about the datasets and our cross-lingual prediction method, based on self-learning, please consult our publication:

Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text Classification BibTeX arXiv Data
Xin Dong, Yaxin Zhu, Yupeng Zhang, Zuohui Fu, Dongkuan Xu, Sen Yang, Gerard de Melo (2020)
In: Proc. SIGIR 2020 (Short Paper). ACM.
Acceptance rate: 30%

Cross-Lingual Databases

Apart from the above, we have also developed a number of cross-lingual databases that facilitate cross-lingual NLP and text mining.

Universal Wordnet (UWN)

One of the largest multilingual knowledge graphs, transforming the well-known WordNet database into a massively multilingual resource covering over 1 million words and several million named entities in a single semantically organized hierarchy. This is based on machine learning along with the MENTA extension based on Wikipedia. Our derivative project OpenWordNet-PT (GitHub) is being used by Google Translate.

Lexvo.org

Contributes information about words and other language-related entities to the Linked Data Web and Semantic Web, leading to a Web of Data in which the British Library, the Spanish National Library, and others have linked their data to Lexvo.org, and Lexvo.org in turn connects its own data to other valuable resources.

Sentiment/Emotion

Datasets and resources for sentiment analysis and fine-grained emotion analysis, in part available for multiple languages.

Etymological Wordnet

A database of etymological and derivational relationships between words in different languages, mined from Wiktionary.