Come join Maxpool - A Data Science community to discuss real ML problems
In the last few years, I have answered many technical and career doubts of people.
What I regret about this now is the information is lost in different places - LinkedIn and WhatsApp chats. Many questions have been really good and I wish I had done this QA in a more structured way.
From now on, you can ask me anything over here and I will try to answer them sooner or later in the best way possible.
I am comfortable in general data science and NLP but feel free to stretch my limits :P
I also encourage others to share a better answer :)
Question Format
1 line title of the problem
ex. Classification for customer reviews
Explain the problem
Explain the data in terms of x(input) and y(output)
Hi Pratik, I'm doing project related to Sentiment Analysis using Twitter data. After cleaning, I use VADER to classify the tweets into positive, negative and neutrals. Then I did ngram and wordcloud but still not able to get insights from the analysis. seems like the words frequency for both negative and positive are the same. I made sure i did the cleaning properly. Do you have any advise?
hey pratik so i'm making a recommendation system similar to quora,stack overflow where they recommend most similar questions to the one user asks. i'm using tfidf+cosine similarity+linear kernel for this but the problem comes when my dataset size is around 2gb and saving model to pickle gives memory error. any suggestions ? data is huge and contains around 300,000-500,000 texts with each row containing more than a paragraph worth of text
I checked the chat in the group and I can’t see any solution other than suggested by the guys. Try increasing min_df and reducing ngram range so that vocab is small and TFIDF matrix is smaller. Hence less memory consumption.
ngram is already 1,1 cannot reduce it further. but i get the point to decrease the parameters. one last question; is the approach i am following appropriate for the problem ? reason i am using simpler methods since much compute power is not available and i have't found any advance method that works on paragraphs of texts giving similarity in a fraction of seconds
SVD on TFIDF is a good idea! But for encoding text to a fixed size vector, consider a neural encoder like transformer. But they only support text till 512 length and so you need to choose only important sentences in paragraph.
Hi Pratik , its awesome that you are conducting QnA . I have a decent knowledge of Machine Learning but feel that i need to learn a bit more . So want to have your advice on whether i should consider continuing by self study or rather plan for Master ?
Also one more advice, how to transition from knowing a lot of theory to practically be able to apply all those theory ?
Hello sir. It's great to follow someone like you and learn from you :)
Topic- Stuck in a multiclass classification problem
Details - I am working on a dataset wherein I am using type 2 fuzzy systems. Its a neural network architecture but in the mathematical model (not using tensorflow or keras).
I have just one way for output through a formula. When I had a dataset for binary classification, it was great because I used sigmoid function on the output and then classified them as 1 if the sigmoid value was greater than 0.5 else 0. But now I have 3 classes in the dataset. I am blank as to how to proceed as for 2 classes, the easy thing was that one class would be (1- other class).
But for 3 classes, I'd need to use softmax but since I just have one formula for output, how do I do the computations to classify them in classes now.
If you could please help me, and maybe need to have a look at my model, please connect me at vipulv5247@gmail.com
I have tried many things but not easy to explain here.
Hi Pratik, Great to connect. I am new to Elastic Search and wanted to understand how can I handle synonyms in my Search. How do I get started here. If you could link me to any resources it will be great.
Another follow up question here, Do i have to build mappings manually or use concepts like wordembeddings/similarity to feed a file for synonym. Is ES capable to adding synonyms ?
There might exist some plugin but we made it manually. You can write a script for creating file but you still need to check it for unnecessary words. The first iteration will be messy.
1. I am not able to relate Central Limit Theorem (CLT) in my day to day activities as a data scientist. Can you please tell where CLT is used in machine learning?
2. How should I formulate my hypothesis for a 1. multi classification problem 2. Regression problem. For example, with retail data, I am trying to predict five fast selling items. What should be my hypothesis for this.
I want to ask about assumption on machine learning model. Do you consider assumption when build model? Like in regression there is multicolinearity etc
Rather than doing a paid course, I would suggest taking 6 months study break and do self learning with MOOCs, blogging and Kaggle.
No need to worry about internship. During the 6 months, every day spend 6 hours on theory and 6 hours on blogging and Kaggle. This will easily make you 90 percentile.
Start giving interviews for both job and internship after 4 months just to understand what is asked, make notes of questions and then blog about it with right answers.
Most people fear taking a dedicated self-study break but they don’t feel the same about doing MBA which is actually a 2 year break where you spend a lot of money. Later they spend 2 years to payback the loan. So in effective it takes 4 years to get to 0 debt.
Instead spending 1 year self study gets you a great DS job and then after 3 years you are a senior data scientist which will be a better job at a better pay compared to MBA. Not to mention you will have savings of 3 years which will be >20 lakhs. The hypothesis holds true even when you compare with MBA from IIMs.
How do you see statistics versus machine learning and deep learning ?How good is it to know about the data engineering and deployment in production? What all language and technology one should target for next 5 years , especially after this pandemic? There is lot of supply more than the demand of "data scientist" and still lacking qualified data workers? How one should evaluate and see problem before dropping into implementation of all data science in particular industry ?
I personally don’t worry about statistics vs machine learning. I just learn what is required and is in high demand. For now Python is going to stay and having production skills is the differentiating factor. Check my answer to Nikita on how can we learn production skills.
I did a talk recently on how to think of DS solutions
I am familiar with NLP. But just doing NLP projects on kaggle or some other platform won't give me end to end experience. So, what tools / website or any other source can be used to get an experience in NLP like a professional one get ?
Let’s decouple story before and after modeling. I feel it’s very difficult to replicate industrial setting. One thing you can do is write a good inference API, make a streamlit demo and deploy on either GCP or AWS.
Elasticsearch is used a lot and try to do make a search engine with it by indexing some pdf documents like financial reports of companies which you can get from here
Firstly, thank you so much for taking your time out to answer my question.
I am currently doing a project related to building a good training dataset by checking the existing dataset's consistency. I wanted to understand what you think of my approach and see if you have any suggestions or maybe even your thoughts on if you would approach this differently.
Here goes the problem statement:
The existing dataset is hugely imbalanced with several classes (for eg: A, B, .... AX, AY) and the inputs are vague one-liners (text) related to product delivery, etc. The data is human labelled and there are inconsistencies with the labelling. For instance, two same or similar utterances may have been labelled with completely different class labels.
Here goes my solution:
I wanted to take advantage of the fact that humans labelled the utterances and most of them would be right. So I designed a 2-step process. As a first step, I weed out the outliers from the majority class labels using a good outlier detection algorithm. The problem with this step is I am not sure if I can repeat this procedure for the minority class labels as the no. of instances are just too low. And also, even if I manage to detect an outlier, I am not sure which class label the outlier can potentially belong to. So, as a second step, for both the majority and minority class labels, I perform a kNN for each and every data point in the dataset and check if its nearest neighbours also belong to the same class label. If they don't, then I simply relabel the datapoint with the group to which the majority of these neighbours belong. This way I ensure the most similar utterances are uniformly tagged and also (hopefully) take advantage of the fact that when humans label, they most often do it right and that reflects in the majority during nearest neighbours. But as you can see, there are definitely a few pitfalls with this approach. One of them being, if all or majority of the nearest neighbors are mislabelled, then it leads to corrupting a maybe-good datapoint. And the other challenge being the selection of k ofcourse.
I do have certain other ideas in mind - the usage of adversarial data augmentation (libs like textattack) and then maybe doing a cross-validated prediction using deep learning algos on all the datapoints and check for consistency. But I wanted the solution to be simplistic.
However, it would be super helpful if you can share your thoughts on this one!!! And thank you once again!!
Let’s just for simplicity sake say it is English. The samples contains texts like “hey when is my delivery coming”, “how can i set up my sim”, “i have issues with my phone”, etc. Thanks!
What advice would you give to someone who is going to graduate from college and is thinking about what to do in life? (Assuming he has a job offer as a Plan B)
Hey, the best way to deal with 20s is to first go shallow and then deep when you find a great field of work.
A lot of people don’t explore enough. Later, they either hate what they have or fall in weak love with something silly. I changed many fields to finally discover data science. This idea is discussed in detail in the book Range.
I have a huge personal bias of not doing masters. I suggest people to join startups instead of masters.
I find below advices extremely valuable and equivalent of reading many books.
Hi Pratik, I'm doing project related to Sentiment Analysis using Twitter data. After cleaning, I use VADER to classify the tweets into positive, negative and neutrals. Then I did ngram and wordcloud but still not able to get insights from the analysis. seems like the words frequency for both negative and positive are the same. I made sure i did the cleaning properly. Do you have any advise?
Hey come to maxpool.club
hey pratik so i'm making a recommendation system similar to quora,stack overflow where they recommend most similar questions to the one user asks. i'm using tfidf+cosine similarity+linear kernel for this but the problem comes when my dataset size is around 2gb and saving model to pickle gives memory error. any suggestions ? data is huge and contains around 300,000-500,000 texts with each row containing more than a paragraph worth of text
Hey try hickle library
thanks for the reply!i read about it , isn't that a hacky way ? isn't there any robust solution?and the way i am calculating similarity that ok ?
I checked the chat in the group and I can’t see any solution other than suggested by the guys. Try increasing min_df and reducing ngram range so that vocab is small and TFIDF matrix is smaller. Hence less memory consumption.
ngram is already 1,1 cannot reduce it further. but i get the point to decrease the parameters. one last question; is the approach i am following appropriate for the problem ? reason i am using simpler methods since much compute power is not available and i have't found any advance method that works on paragraphs of texts giving similarity in a fraction of seconds
what about using svd ?
SVD on TFIDF is a good idea! But for encoding text to a fixed size vector, consider a neural encoder like transformer. But they only support text till 512 length and so you need to choose only important sentences in paragraph.
Hi Pratik , its awesome that you are conducting QnA . I have a decent knowledge of Machine Learning but feel that i need to learn a bit more . So want to have your advice on whether i should consider continuing by self study or rather plan for Master ?
Also one more advice, how to transition from knowing a lot of theory to practically be able to apply all those theory ?
My pleasure! You can find the answer in my earlier response to Shrishti and Nikita.
Ya , mybad i didn't see the old comments
Hello sir. It's great to follow someone like you and learn from you :)
Topic- Stuck in a multiclass classification problem
Details - I am working on a dataset wherein I am using type 2 fuzzy systems. Its a neural network architecture but in the mathematical model (not using tensorflow or keras).
I have just one way for output through a formula. When I had a dataset for binary classification, it was great because I used sigmoid function on the output and then classified them as 1 if the sigmoid value was greater than 0.5 else 0. But now I have 3 classes in the dataset. I am blank as to how to proceed as for 2 classes, the easy thing was that one class would be (1- other class).
But for 3 classes, I'd need to use softmax but since I just have one formula for output, how do I do the computations to classify them in classes now.
If you could please help me, and maybe need to have a look at my model, please connect me at vipulv5247@gmail.com
I have tried many things but not easy to explain here.
Thanks in advance for your help
Hey Vipul, maybe you can try hierarchical classification.
Sigmoid 1 will tell you between class 1 or class 2+3.
If class 2+3 has more than 50%, you pass it through sigmoid 2 which will tell you between class 2 and 3.
Other way is to fit one Vs rest models and then take the class which has the max probability from all the model outputs.
Hi Pratik, Great to connect. I am new to Elastic Search and wanted to understand how can I handle synonyms in my Search. How do I get started here. If you could link me to any resources it will be great.
Hey same here! You need to define a filter in your ES config.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html
For generating synonyms, I have explained the recipe in my talk
https://youtu.be/VHm6_uC4vxM?t=920
Thanks a lot Pratik. this helps!
Another follow up question here, Do i have to build mappings manually or use concepts like wordembeddings/similarity to feed a file for synonym. Is ES capable to adding synonyms ?
There might exist some plugin but we made it manually. You can write a script for creating file but you still need to check it for unnecessary words. The first iteration will be messy.
Hi Pratik. Thanks for the initiative.
I have two questions:
1. I am not able to relate Central Limit Theorem (CLT) in my day to day activities as a data scientist. Can you please tell where CLT is used in machine learning?
2. How should I formulate my hypothesis for a 1. multi classification problem 2. Regression problem. For example, with retail data, I am trying to predict five fast selling items. What should be my hypothesis for this.
My pleasure! Sorry I have no clear answer for both.
can you recommend a book on natural language processing, assuming user has some experience in data science
Jurafsky’s NLP book is lovely
Hi pratik, thanks for this initiative
I want to ask about assumption on machine learning model. Do you consider assumption when build model? Like in regression there is multicolinearity etc
My pleasure! I don’t have a perfect answer to this but this paper is really helpful
https://arxiv.org/pdf/1901.10002.pdf
Firstly, I am really glad that you took such initiative.
Coming to the other part,I am pretty perplexed about certain decision and would like your suggestion over this
--Doing MBA from a good college in india
-- Pursuing Ms in data science,also is it hard to get internship out there if we don't have been much familiar with data science.
My pleasure! For MBA, check my answer to Arsh.
Rather than doing a paid course, I would suggest taking 6 months study break and do self learning with MOOCs, blogging and Kaggle.
No need to worry about internship. During the 6 months, every day spend 6 hours on theory and 6 hours on blogging and Kaggle. This will easily make you 90 percentile.
Start giving interviews for both job and internship after 4 months just to understand what is asked, make notes of questions and then blog about it with right answers.
Most people fear taking a dedicated self-study break but they don’t feel the same about doing MBA which is actually a 2 year break where you spend a lot of money. Later they spend 2 years to payback the loan. So in effective it takes 4 years to get to 0 debt.
Instead spending 1 year self study gets you a great DS job and then after 3 years you are a senior data scientist which will be a better job at a better pay compared to MBA. Not to mention you will have savings of 3 years which will be >20 lakhs. The hypothesis holds true even when you compare with MBA from IIMs.
How to be a great data scientist/engineer
https://pakodas.substack.com/p/a-data-scientist-without-a-phd
(You can also read a longer version of getting in data science at guide.pratik.ai)
How do you see statistics versus machine learning and deep learning ?How good is it to know about the data engineering and deployment in production? What all language and technology one should target for next 5 years , especially after this pandemic? There is lot of supply more than the demand of "data scientist" and still lacking qualified data workers? How one should evaluate and see problem before dropping into implementation of all data science in particular industry ?
I personally don’t worry about statistics vs machine learning. I just learn what is required and is in high demand. For now Python is going to stay and having production skills is the differentiating factor. Check my answer to Nikita on how can we learn production skills.
I did a talk recently on how to think of DS solutions
https://m.youtube.com/watch?v=f2m6Mon0VE8&t=218s
Thanks a lot , Pratik. I will surely follow ..
Hi Pratik,
I am familiar with NLP. But just doing NLP projects on kaggle or some other platform won't give me end to end experience. So, what tools / website or any other source can be used to get an experience in NLP like a professional one get ?
Let’s decouple story before and after modeling. I feel it’s very difficult to replicate industrial setting. One thing you can do is write a good inference API, make a streamlit demo and deploy on either GCP or AWS.
For great demo examples you can visit this
https://madewithml.com/projects/search-results/?tags=streamlit
If you want to learn cleaning text data, fetch Twitter text data with twint and make a tag predictor.
https://github.com/twintproject/twint
Elasticsearch is used a lot and try to do make a search engine with it by indexing some pdf documents like financial reports of companies which you can get from here
https://www.nseindia.com/companies-listing/corporate-filings-annual-reports
For above get all NSE company codes from
https://nsetools.readthedocs.io/en/latest/usage.html#list-of-traded-stock-codes-names
Hello Pratik,
Firstly, thank you so much for taking your time out to answer my question.
I am currently doing a project related to building a good training dataset by checking the existing dataset's consistency. I wanted to understand what you think of my approach and see if you have any suggestions or maybe even your thoughts on if you would approach this differently.
Here goes the problem statement:
The existing dataset is hugely imbalanced with several classes (for eg: A, B, .... AX, AY) and the inputs are vague one-liners (text) related to product delivery, etc. The data is human labelled and there are inconsistencies with the labelling. For instance, two same or similar utterances may have been labelled with completely different class labels.
Here goes my solution:
I wanted to take advantage of the fact that humans labelled the utterances and most of them would be right. So I designed a 2-step process. As a first step, I weed out the outliers from the majority class labels using a good outlier detection algorithm. The problem with this step is I am not sure if I can repeat this procedure for the minority class labels as the no. of instances are just too low. And also, even if I manage to detect an outlier, I am not sure which class label the outlier can potentially belong to. So, as a second step, for both the majority and minority class labels, I perform a kNN for each and every data point in the dataset and check if its nearest neighbours also belong to the same class label. If they don't, then I simply relabel the datapoint with the group to which the majority of these neighbours belong. This way I ensure the most similar utterances are uniformly tagged and also (hopefully) take advantage of the fact that when humans label, they most often do it right and that reflects in the majority during nearest neighbours. But as you can see, there are definitely a few pitfalls with this approach. One of them being, if all or majority of the nearest neighbors are mislabelled, then it leads to corrupting a maybe-good datapoint. And the other challenge being the selection of k ofcourse.
I do have certain other ideas in mind - the usage of adversarial data augmentation (libs like textattack) and then maybe doing a cross-validated prediction using deep learning algos on all the datapoints and check for consistency. But I wanted the solution to be simplistic.
However, it would be super helpful if you can share your thoughts on this one!!! And thank you once again!!
Cheers,
Rajesh
Hi Rajesh, this seems interesting! Can you post some text samples. Are they standard English?
Let’s just for simplicity sake say it is English. The samples contains texts like “hey when is my delivery coming”, “how can i set up my sim”, “i have issues with my phone”, etc. Thanks!
Ahh, if you have at least a few thousand samples, try this one with passing class weights to model.fit and use label smoothing of 0.1.
https://github.com/bhavsarpratik/psyduck/blob/master/psy/ml/imbalance.py
https://www.dlology.com/blog/keras-meets-universal-sentence-encoder-transfer-learning-for-text-data/
i found your nlp questions very hard. Could you give some resources or give some intuition to understand the concepts better
https://medium.com/modern-nlp/nlp-interview-questions-f062040f32f7
Hey Prabu, try reading the original paper of all language models. Make notes and discuss with others to understand more.
What advice would you give to someone who is going to graduate from college and is thinking about what to do in life? (Assuming he has a job offer as a Plan B)
Hey, the best way to deal with 20s is to first go shallow and then deep when you find a great field of work.
A lot of people don’t explore enough. Later, they either hate what they have or fall in weak love with something silly. I changed many fields to finally discover data science. This idea is discussed in detail in the book Range.
I have a huge personal bias of not doing masters. I suggest people to join startups instead of masters.
I find below advices extremely valuable and equivalent of reading many books.
http://paulgraham.com/wealth.html
https://mobile.twitter.com/naval/status/1002103360646823936
https://pmarchive.com/luck_and_the_entrepreneur.html