Modeling Tricks For Low Resource NLP

NLP for startup engineers

Domain adaptation with low resource is quite a recurring problem for companies. Here is a collection of the tricks I could find to help you do better modeling.

Data augmentation

Connor’s review video

Prakhar’s review video

  • Removal

    • Character removal

    • Word removal

    • Span removal

  • Word replacement

    • wornet

    • embeddings

    • custom embeddings

  • Word expansion and contractions

  • Swapping

    • adjacent word

    • adjacent sentence

  • Paraphrase via seq2seq and syntactic trees

  • Back-translation

A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation

“In this paper, we introduce a set of simple yet efficient data augmentation strategies dubbed cutoff, where part of the information within an input sentence is erased to yield its restricted views (during the fine-tuning stage). Notably, this process relies merely on stochastic sampling and thus adds little computational overhead.”

Zero shot classification

Use transformers zero-shot pipeline to label data which uses entailment task to find the best label. You can read the working here.*


Give a few examples to prime GPT3 API and use it to label more dataset.*


Use k nearest neighbour to label dataset.*

Weak supervision

  1. Train model with limited data

  2. Make some predictions on data for labelling

  3. Correct those samples which are incorrect and those which are correct but have low prob (hard samples)

  4. Train classifier with new data*

Modifying optimizer/model/training

Revisiting Few-sample BERT Fine-tuning

“ First, we show that the debiasing omission in BERTAdam is the main cause of degenerate models on small datasets commonly observed in previous work.

Second, we observe the top layers of the pre-trained BERT provide a detrimental initialization for fine-tuning and delay learning. Simply re-initializing top layers not only speeds up learning but also leads to better model performance.

Third, we demonstrate that the common one-size-fits-all three-epochs practice for BERT fine-tuning is sub-optimal and allocating more training time can stabilize fine-tuning.

Finally, we revisit several methods proposed for stabilizing BERT fine-tuning and observe that their positive effects are reduced with the debiased ADAM.”

On the Stability of Fine-tuning BERT

“Use small learning rates combined with bias correction to avoid vanishing gradients early in training.

Increase the number of iterations considerably and train to (almost) zero training loss while making use of early stopping.”

Contrastive learning

SimCLRv2 works great with low resource image classification. Similarly, it can also help in low resource NLP.

Supervised Contrastive Learning For Pre-trained Language Model Fine-tuning

“We obtain strong improvements on few-shot learning settings (20, 100, 1000 labeled examples), leading up to 10.7 points improvement for 20 labeled examples.

Models trained with SCL are not only robust to the noise in the training data, but also generalize better to related tasks with limited labeled data.”

Custom architecture

Combining BERT with Static Word Embeddings for Categorizing Social Media

Since subword models can capture more info and are robust to OOV, fasttext can also be used above instead of GLoVE. Adding features from multilingual transformer models can also help but can get compute heavy for inference.

Bootstrap embeddings

Tricks For Domain Adaptation


There are a lot of techniques to get better results in a low resource setting. Choose those with high impact and low complexity. Let me know in the comments if you have a trick.

If you found this useful, do share and let others know.

Noteable work:

*This data will not be perfect and has to be given less sample weight while training or trained with label smoothing.

Come join Maxpool - A Data Science community to discuss real ML problems!

Connect with me on MediumTwitter & LinkedIn.