Domain adaptation with low resource is quite a recurring problem for companies. Here is a collection of the tricks I could find to help you do better modeling.
Use Masked LM to replace with synonyms
Data augmentation can be done with textual similarity if you have a large amount of unlabelled text
GANs and BART are suitable for text augmentation
Prepend label at input to do conditional data augmentation
Word expansion and contractions
Paraphrase via seq2seq and syntactic trees
“In this paper, we introduce a set of simple yet efficient data augmentation strategies dubbed cutoff, where part of the information within an input sentence is erased to yield its restricted views (during the fine-tuning stage). Notably, this process relies merely on stochastic sampling and thus adds little computational overhead.”
Use transformers zero-shot pipeline to label data which uses entailment task to find the best label. You can read the working here.*
Give a few examples to prime GPT3 API and use it to label more dataset.*
Use k nearest neighbour to label dataset.*
Train model with limited data
Make some predictions on data for labelling
Correct those samples which are incorrect and those which are correct but have low prob (hard samples)
Train classifier with new data*
“ First, we show that the debiasing omission in BERTAdam is the main cause of degenerate models on small datasets commonly observed in previous work.
Second, we observe the top layers of the pre-trained BERT provide a detrimental initialization for fine-tuning and delay learning. Simply re-initializing top layers not only speeds up learning but also leads to better model performance.
Third, we demonstrate that the common one-size-fits-all three-epochs practice for BERT fine-tuning is sub-optimal and allocating more training time can stabilize fine-tuning.
Finally, we revisit several methods proposed for stabilizing BERT fine-tuning and observe that their positive effects are reduced with the debiased ADAM.”
“Use small learning rates combined with bias correction to avoid vanishing gradients early in training.
Increase the number of iterations considerably and train to (almost) zero training loss while making use of early stopping.”
SimCLRv2 works great with low resource image classification. Similarly, it can also help in low resource NLP.
“We obtain strong improvements on few-shot learning settings (20, 100, 1000 labeled examples), leading up to 10.7 points improvement for 20 labeled examples.
Models trained with SCL are not only robust to the noise in the training data, but also generalize better to related tasks with limited labeled data.”
Since subword models can capture more info and are robust to OOV, fasttext can also be used above instead of GLoVE. Adding features from multilingual transformer models can also help but can get compute heavy for inference.
There are a lot of techniques to get better results in a low resource setting. Choose those with high impact and low complexity. Let me know in the comments if you have a trick.
If you found this useful, do share and let others know.
*This data will not be perfect and has to be given less sample weight while training or trained with label smoothing.
Come join Maxpool - A Data Science community to discuss real ML problems!