Risk-takers | 100x engineers | Astronauts | From Unsplash
From Google’s 43 rules of ML
“Rule #4: Keep the first model simple and get the infrastructure right.”
With some opinions floating in the market, I feel it’s a good time to spark a discussion about this topic. Otherwise, the opinions of the popular will just drown other ideas.
Note: I work in NLP and these opinions are more focussed towards NLP applications. Cannot guarantee truthfulness for tabular and image problems.
Problems with simple models
It’s suitable for companies who want to have good enough automation. If you want to truly win the competition or delight your customers, you need to learn to work with complex models (when they make technical and financial sense). But don’t choose complex for marginal gains of like 1%. Google put BERT to production in Oct 2019 because it will improve 10% of their results.
A simple system can require heavy feature engineering and hacks. Feature engineering can take time if done manually. I will share my experience in the end.
Feature engineering analogy →
How deep learning changed Computer Vision and how the new language models put an end to statistical NLP.
The automated feature engineering libraries like featuretools and tsfresh are also rising. We don’t need to rely just on domain experts for features.
If the 2nd iteration system is too different from 1st, it requires changing the code again. You wish you had gone slowly, experimented more and made the 2nd directly.
If the 1st iteration has good and not great metrics, they can put the whole project on the risk of “what is the value addition of ML?”. Whenever something is automated, there is someone feeling insecure and they will look for numbers to attack.
First of all, there is nothing like blackbox because there are tools to understand any model.
Secondly, blackboxes are not necessarily bad if you test them heavily. But testing gets boring because it happens at the end of a project when we are tired or impatient to put it to production or don’t have enough test data or don’t have resources to create a test data.
Mental models at play 🐗
Thinking fast and slow
We often resort to short-circuit fast thinking.
“Thinking, Fast and Slow is a best-selling book published in 2011 by Nobel Memorial Prize in Economic Sciences laureate Daniel Kahneman. The central thesis is a dichotomy between two modes of thought: “System 1” is fast, instinctive and emotional; “System 2” is slower, more deliberative, and more logical.”
Nowadays when work pressure is high, we don’t want to think slowly and directly resort to using the old advice of making a simple ML system first.
Why argue when it actually makes your work easy? 😆
Goodhart’s law
“When a metric becomes a target, it ceases to be a good metric.“
When your seniors might hold the philosophy of ‘simple model’, you will avoid conflict and make a good rapport.
I see many argue against putting transformer NLP models in production because they are heavy and more blackbox which is actually not true.
TinyBERT has 15M parameters while AWD-LSTM of ULMFiT has 24M. AWD-LSTM is often used by people due to its ease of use with fastai library.
Also you can interpret transformer models using Captum by PyTorch.
Conformity kills creativity
The philosophy of the group is a function of majority.
As you might have seen while choosing a restaurant — we have to go with the majority because if we took the risk of suggesting a new restaurant and forcing people to eat there, the high expectations of people will put immense stress on us.
Why not just chill instead of taking the risk? There is no clear cut reward for taking risk unless it’s tied to our individual SMART* goals!
Have you ever seen ‘take risk’ in anyone’s SMART goals? 😂
Note: SMART is a common goal-setting philosophy used in organisations.
High attrition side effects 💩
Data scientists are changing jobs very frequently and have difficulty building trust at a new place. Hence, they will reduce the risk of failure of the first project because the risk is proportional to the complexity of the project.
Your new ideas will hold value only when people trust you and you have a history of execution. People tend to reject the ideas of a new person talking ‘complex’ when the group is in favour of simple.
To change the group thinking, team needs to have an established person with history in the same or old organisation with an appetite for challenging the status quo.
Conformity kills creativity. 😄
My experience 👷
Let me tell you about my recent experience. I just deployed a transformer model to production for semantic search. Earlier we had a non-contextual model with a complex pipeline of processing, external APIs and rules to handle edge cases.
Now our new search system is extremely simple and scalable. It's not only better at understanding queries but also faster due to less reliance on external APIs. In fact, it costs us less now because we don’t have to use paid APIs anymore.
Choosing a heavier model saves us current and future troubles 💰
This is what happens with many companies.
Adding a rule to ML pipeline for edge case always seem saner than making a new system.
But if these cases keep popping again, you might want to reconsider.
Companies like Google understand this. That’s why they make better language models as they don’t have time to make rules in the age of hyper-personalisation.
Sweet point of complexity 📈
The current problem of data science is the ‘search space’. We have extremely simple models to 17B parameters models available for free to use.
In order to find the best model for production, we should ideally evaluate all of them with hyper-tuning and then come up with a prescription. But who has time for such experimentation when your competitor is growing fast and we need to put ‘something’ to production ASAP. 😕
To find a solution to this problem, I run experiments in my freetime to understand the sweet point of complexity.
ML is quite empirical and my assumptions turn out wrong on a regular basis. Hence it’s very important to keep trying new methodologies and learn from them.
1. Don’t start the research when the project starts.
2. Be proactive. Experiment and read papers for insights.
3. Setup your own research lab on Colab.
Takeaway →
Hire engineers who do research to develop a dictionary of solutions and use their fast-thinking for selecting solution for production.
Conclusion 💥
My goal is not to convince you to waste time researching to deploy big models. My aim is to throw light on the importance of innovation and thinking from the basics.
Don’t use time constraints as an excuse to do simple modelling. 💩
I work in a startup myself and hence I understand all types of constraints such as time, data, manpower and compute.
Google doesn’t shy away from putting heavy models to production — be it for machine translation or YouTube’s recommendation system. They have always believed in using the state of the art. AI is their strategy. Not a part of strategy.
If the cost of production doesn’t work out, they push the limits by doing hardware innovation like TPUs.
MapReduce was written by just 2 engineers Sanjay Ghemawat and Jeff Dean when they found out that distributed processing has become a major part of all projects at Google.
Jeremy Howard discovered a way to make transfer learning work for NLP while doing his personal experiments and shared with everyone.
All these efforts took time to develop but had a global impact. Don’t stop your engineers from experimenting and open-sourcing. Let them try complex solutions and take risk. One of them will turn out your 100x engineer.
Originally published on Modern NLP