Indic Transformers: An Analysis of Transformer Language Models for Indian Languages

Read on Medium


This post is about our recent work focusing on application of various transformer-based architectures on Indian languages.

Our paper has been accepted at ML-RSA @ NeurIPS 2020. 🥳

This blog begins with an overview of modern Natural Language Processing (NLP) and how NLP has evolved and progressed in the deep-learning era. We then move onto the premise of our work, the linguistic disparity in NLP research and highlight some relevant work in NLP for Indian languages. Finally, we delve into the details of our contributions, experimental setups and share some key insights from our research.

This work was done jointly by KUSHAL JAIN, Felix Laumann and Ayushman Dash and in collaboration with Reverie Language Technologies.

An Overview of Modern NLP

Natural Language Processing (NLP) has witnessed a lot of developments in recent years. Modern NLP is now synonymous with transformers and architectures like BERT, RoBERTa, ELECTRA and so on. Architectures based on transformers have improved the SOTA(state-of-the-art) on almost every downstream task. Training task-specific architectures is being quickly replaced by this new paradigm of pre-training and fine-tuning large language models. The underlying principles of these models, however, are quite similar to previous techniques in NLP.

Let’s take a short tour of NLP research in the deep-learning era.

  • Word Embeddings: These are nothing but fixed-size low dimensional vectors that encode the semantics or the meaning of a word. These vectors are also trained using a large corpus in an unsupervised way. The most common approach, called the skip-gram method involves training a proper end-to-end neural network. Here the input is a given word in the corpus and the outputs are the words surrounding the chosen word. In short, we make the model predict the context based on the current word. Eventually after training the neural network, we remove the prediction layer and use the weights of the hidden layer as word-vectors. E.g. Word2vec, GloVe, fastText.

    Refer to this blog if you need to understand it in more details.

    Vector spaces showing the algebraic relationship between related words.
  • Language Modeling: One of the main drawbacks of word embeddings discussed above is that they are static or context-free. A word is associated with only one meaning in general. ELMo overcomes this by training multilayer bi-LSTM language models which capture both semantics and syntax of the language. While contextualized word-embeddings from ELMo improved SOTA on many tasks, the idea proposed by ULMFiT was perhaps the turning point in NLP research. It introduced the idea of pre-training language models on large amounts of text corpora and then fine-tuning these models on downstream tasks like text-classification.

    In their own words,

    Our method significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18- 24% on the majority of datasets.

    Simple example depicting the task of language modeling
  • Masked Language Modeling & BERT: BERT uses the transformer encoder and is pre-trained on a huge text corpus. The pre-training technique is also different from previous language models and uses Masked Language Modeling & BERT: BERT uses the transformer encoder and is pre-trained on a huge text corpus. The pre-training technique is also different from previous language models and uses masked language modeling objective. BERT reported SOTA results on almost every downstream task including text-classification, NER, QA, NLI and so on. It was considered to be a major breakthrough in NLP research. Since then, there have been multiple models built around BERT or transformer architecture.">masked language modeling objective. BERT reported SOTA results on almost every downstream task including text-classification, NER, QA, NLI and so on. It was considered to be a major breakthrough in NLP research. Since then, there have been multiple models built around BERT or transformer architecture.
    Number of mentions of key trends in NLP in ACL papers taken from here.

The State of Indic-NLP

All the advancements and cool developments in NLP that we boasted about in the previous section are surprisingly limited to English and a handful of other high-resource languages. Indian languages are highly under-represented in the current NLP research despite having a population close to 1.3 billion. Even in India, there are a lot of languages spoken in different parts of the country.

Only 10% of the Indian population speak English.

This essentially means that a large number of people that speak or communicate in low-resource languages have not benefited from the recent developments in NLP.

A map showing the diversity of languages in India.

However, there have been some concerted efforts to improve NLP research in Indian languages.

  • inltk: Along with a toolkit to pre-process and normalize Indian languages, inltk also provides language models trained on Indian languages with their performance on classification tasks. These language models are based on ULMFiT.

  • Indic-NLP: Classification datasets for 11 languages, fast-text word-embeddings and normalizers for Indian languages. Recently, they also released an IndicBERT model which is a multilingual ALBERT pre-trained on 11 Indian languages.

Our Contribution

We at NeuralSpace & Reverie believe in pursuing research that matters and benefits the community. When we set out to see the performance of these SOTA models on Indian languages, we were surprised to find the lack of research and benchmarks of these models. So we decided to start things from scratch. Some of our contributions are as follows:

  • We trained 4 variants of monolingual contextual language models from scratch. These variants include DistilBERT, BERT, RoBERTa and XLMRoBERTa. We do so for 3 different languages viz. Hindi, Telugu and Bengali which cover more than 60% of native speakers in the country.
  • We evaluate the performance of these models on 3 downstream tasks: text-classification,POS-tagging and Question-Answering (QA).
  • We present an exhaustive analysis of transformer-based for Indian languages by comparing them directly with their multilingual counterparts.

Some challenges that we faced during this work:

While we explain the issues and how we overcame those in the paper in a greater detail, I would like to outline them in brief over here.

  • Lack of monolingual data to train language models: BERT reported SOTA results on various downstream tasks in English and it was trained on 3.3 billion tokens. If we compare that to the number of tokens available for Indian languages, we have ~700M for Hindi, 300M for Bengali and 79M for Telugu respectively. If we are to train powerful and efficient language models, we need to have more data especially for languages like Telugu.
  • Lack of labeled datasets for downstream tasks: In order to evaluate our language models or any other supervised learning approach, we must have some annotated datasets for all major languages spoken in the country. More importantly, these datasets should be easily accessible to the research community. Universal Dependencies is a great initiative in order to realize this aim by curating tree-banks for a lot of low-resource languages. However, we faced challenges for finding such tree-banks for Bengali. Even for QA, there are no major datasets for Indian languages.

Our Experiments and Findings

We run exhaustive experiments on the selected downstream tasks under 3 different setups.

  • Setup A: In this setup we evaluate multilingual models that are available out-of-the box on Hugging Face model hub. Multilingual models are generally trained by combining vocabularies and datasets of 100+ languages and training the language model in the hope of cross-lingual transfer of features. Multilingual variants are available for DistilBERT, BERT and XLM-R.
  • Setup B: Next we fine-tune each multilingual language model from setup A with data from a single language. Since multilingual models are trained with data from multiple languages, we believe that fine-tuning them with monolingual data would be beneficial, In our case we fine-tune mBERT with 300MB data of Hindi, Bengali & Telugu separately.
  • Setup C: Finally, in this setup we train monolingual contextual models from scratch. This allows us to directly compare our models with their multilingual counterparts.

For each of these setups we run experiments under 2 different training regimes/settings

  • Frozen: In this setting, we freeze the language model and use its output as embeddings to train BiLSTM, linear and transformer layers on top.
  • Fine-tuned: Here, we fine-tune the entire language model with a simple linear layer on top.
  • Direct comparison between the best results from setup A and setup C and related work. The empty cells denote models were either not evaluated on that particular task/language (e.g., IndicBERT on POS-Tagging) or that the model was only evaluated on one specific task (e.g., TyDiQA).

    The table above summarizes our key results and also compares it with other recent and relevant work. We report state-of-the-art results for Hindi and Bengali text classification. Our results for POS tagging are also competitive with the latest approaches.

Some key insights/trends from our work:

  • We find that training a multi-layer LSTM/biLSTM using language model output as embeddings gives comparable or sometimes even better results than fine-tuning the entire model. This is particularly helpful in cases where you do not have access to high-power GPUs. In our work, LSTM layers usually outperformed other layers that we experimented with under the “frozen” setting.
  • The above point holds true for text-classification and POS tagging for all 3 languages. However for question-answering, we find that fine-tuning outperforms freezing by a considerable margin.
  • Our models in setup C outperform their multilingual counterparts in case of BERT and DistilBERT for all languages and tasks.
  • RoBERTa models in setup C perform poorly for all tasks and languages.We believe this is due to the fact that RoBERTa uses a Byte-level BPE tokenizer which affects the typology of morphologically rich languages. This has been pointed out earlier for other languages. We found this to be true for Indian languages as well. Hence the tokenizer choice becomes a key factor when dealing Indian languages.

You can read in detail about our experiments, results and analysis in the paper here.

You can find our pre-trained models over here.

At NeuralSpace, we are working towards bringing the state-of-the-art NLP research to low-resource Indian languages. Feel free to reach out to us to discuss new ideas!


Comment on Medium