Modeling Tricks For Low Resource NLP

NLP for startup engineers

Feb 14, 2021

Domain adaptation with low resource is quite a recurring problem for companies. Here is a collection of the tricks I could find to help you do better modeling.

Data augmentation

Connor’s review video

Use Masked LM to replace with synonyms
Data augmentation can be done with textual similarity if you have a large amount of unlabelled text
GANs and BART are suitable for text augmentation
Prepend label at input to do conditional data augmentation
Conditional BERT Contextual Augmentation

Prakhar’s review video

Removal
- Character removal
- Word removal
- Span removal
Word replacement
- wornet
- embeddings
- custom embeddings
Word expansion and contractions
Swapping
- adjacent word
- adjacent sentence
Paraphrase via seq2seq and syntactic trees
Back-translation

Text Data Augmentation Made Simple By Leveraging NLP

A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation

“In this paper, we introduce a set of simple yet efficient data augmentation strategies dubbed cutoff, where part of the information within an input sentence is erased to yield its restricted views (during the fine-tuning stage). Notably, this process relies merely on stochastic sampling and thus adds little computational overhead.”

Zero shot classification

Use transformers zero-shot pipeline to label data which uses entailment task to find the best label. You can read the working here.*

GPT3

Give a few examples to prime GPT3 API and use it to label more dataset.*

kNN

Use k nearest neighbour to label dataset.*

Weak supervision

Train model with limited data
Make some predictions on data for labelling
Correct those samples which are incorrect and those which are correct but have low prob (hard samples)
Train classifier with new data*

Modifying optimizer/model/training

Revisiting Few-sample BERT Fine-tuning

“ First, we show that the debiasing omission in BERTAdam is the main cause of degenerate models on small datasets commonly observed in previous work.

Second, we observe the top layers of the pre-trained BERT provide a detrimental initialization for fine-tuning and delay learning. Simply re-initializing top layers not only speeds up learning but also leads to better model performance.

Third, we demonstrate that the common one-size-fits-all three-epochs practice for BERT fine-tuning is sub-optimal and allocating more training time can stabilize fine-tuning.

Finally, we revisit several methods proposed for stabilizing BERT fine-tuning and observe that their positive effects are reduced with the debiased ADAM.”

On the Stability of Fine-tuning BERT

“Use small learning rates combined with bias correction to avoid vanishing gradients early in training.

Increase the number of iterations considerably and train to (almost) zero training loss while making use of early stopping.”

Contrastive learning

SimCLRv2 works great with low resource image classification. Similarly, it can also help in low resource NLP.

Supervised Contrastive Learning For Pre-trained Language Model Fine-tuning

“We obtain strong improvements on few-shot learning settings (20, 100, 1000 labeled examples), leading up to 10.7 points improvement for 20 labeled examples.

Models trained with SCL are not only robust to the noise in the training data, but also generalize better to related tasks with limited labeled data.”

Custom architecture

Combining BERT with Static Word Embeddings for Categorizing Social Media

Since subword models can capture more info and are robust to OOV, fasttext can also be used above instead of GLoVE. Adding features from multilingual transformer models can also help but can get compute heavy for inference.

Bootstrap embeddings

Tricks For Domain Adaptation

Conclusion

There are a lot of techniques to get better results in a low resource setting. Choose those with high impact and low complexity. Let me know in the comments if you have a trick.

If you found this useful, do share and let others know.