Your typical day as an NLP Scientist.
You have a transformer model. You want to make it better by fine-tuning.
You plan the annotation task for it.
As per the annotator bandwidth available it looks like a 3 weeks effort.
The annotation comes.
You cannot stop yourself from doing the needful.
You make the train-test split.
You get an evaluation of the current model on the eval dataset - F1 85%
You feel relaxed. It’s not a high number like 90%. You need to get higher than 85.
Seems doable.
You train the model.
You get only 86% F1.
You are disappointed!
What happened here? You start looking at the mistakes.
You find three types of mistakes.
Either the annotation is wrong, or the model is making silly mistakes, or the model is making mistakes in complex cases.
You are baffled.
You manually annotate 500 samples from eval dataset.
You see a lot of annotation and silly model mistakes.
You run your current model and new model.
You get 81 and 79, respectively.
You scratch your head.
Did you make it worse???
Bad things happen to good models.
You are working under a time crunch.
You thought the project would soon be over.
But now you have a twist in the story.
Reannotating everything will take a lot of time.
How to minimise the time?
You think of getting the eval dataset right.
You update the annotation guidelines, discuss errors with the annotators and get another inter-annotator on the 500 samples you annotated.
You wait two days for the eval annotations.
You pull the new dataset and again evaluate old and new model on it.
You get 88% and 81%, respectively.
Bad things happen to good models.
You are under a cocktail of emotions. On one side, you just saved your ass by not deploying a worse model to production. On the other side, all your efforts have gone to waste.
You stand up, walk around, check Twitter, frustrated, almost on the verge of posting “Bad things happen to good models”, but you stop. Ranting on Twitter is not going to be of any help. Nirant will jump with something funny. You will get on a meme thread of jokes. You will waste an hour accomplishing nothing.
You got to share the news with the team. You fucked up. The team fucked up.
You need to re-annotate the training data.
What is the lesson here?
You ponder - can you get a better model by erroneous data?
What is the limit on error rate to improve the model?
Is it 10%, 20% or 30%?
What happens if your model is 90% accurate, but the data has 20% errors?
What if your model is 90% accurate, but the data has 5% errors?
Is there research on this topic?
You do find one - Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks
You are tired. The day is over.
A battle for another day.
Bad Things Always Happen To Good Models.
Come join Maxpool - A Data Science community to discuss real ML problems!