Ask Pratik Anything!

Aug 11, 2020

Come join Maxpool - A Data Science community to discuss real ML problems

In the last few years, I have answered many technical and career doubts of people.

What I regret about this now is the information is lost in different places - LinkedIn and WhatsApp chats. Many questions have been really good and I wish I had done this QA in a more structured way.

From now on, you can ask me anything over here and I will try to answer them sooner or later in the best way possible.

I am comfortable in general data science and NLP but feel free to stretch my limits :P

I also encourage others to share a better answer :)

Question Format

1 line title of the problem
ex. Classification for customer reviews
Explain the problem
Explain the data in terms of x(input) and y(output)
Mention if any solution didn’t work out
Share any solution that comes to your mind

43 Comments

Hi Pratik, I'm doing project related to Sentiment Analysis using Twitter data. After cleaning, I use VADER to classify the tweets into positive, negative and neutrals. Then I did ngram and wordcloud but still not able to get insights from the analysis. seems like the words frequency for both negative and positive are the same. I made sure i did the cleaning properly. Do you have any advise?

Expand full comment

Hey come to maxpool.club

Expand full comment

hey pratik so i'm making a recommendation system similar to quora,stack overflow where they recommend most similar questions to the one user asks. i'm using tfidf+cosine similarity+linear kernel for this but the problem comes when my dataset size is around 2gb and saving model to pickle gives memory error. any suggestions ? data is huge and contains around 300,000-500,000 texts with each row containing more than a paragraph worth of text

Expand full comment

Hey try hickle library

Expand full comment

thanks for the reply!i read about it , isn't that a hacky way ? isn't there any robust solution?and the way i am calculating similarity that ok ?

Expand full comment

I checked the chat in the group and I can’t see any solution other than suggested by the guys. Try increasing min_df and reducing ngram range so that vocab is small and TFIDF matrix is smaller. Hence less memory consumption.

Expand full comment

ngram is already 1,1 cannot reduce it further. but i get the point to decrease the parameters. one last question; is the approach i am following appropriate for the problem ? reason i am using simpler methods since much compute power is not available and i have't found any advance method that works on paragraphs of texts giving similarity in a fraction of seconds

Expand full comment

what about using svd ?

Expand full comment

SVD on TFIDF is a good idea! But for encoding text to a fixed size vector, consider a neural encoder like transformer. But they only support text till 512 length and so you need to choose only important sentences in paragraph.

Expand full comment

Continue thread →

Hi Pratik , its awesome that you are conducting QnA . I have a decent knowledge of Machine Learning but feel that i need to learn a bit more . So want to have your advice on whether i should consider continuing by self study or rather plan for Master ?

Also one more advice, how to transition from knowing a lot of theory to practically be able to apply all those theory ?

Expand full comment

My pleasure! You can find the answer in my earlier response to Shrishti and Nikita.

Expand full comment

Ya , mybad i didn't see the old comments

Expand full comment

Hello sir. It's great to follow someone like you and learn from you :)

Topic- Stuck in a multiclass classification problem

Details - I am working on a dataset wherein I am using type 2 fuzzy systems. Its a neural network architecture but in the mathematical model (not using tensorflow or keras).

I have just one way for output through a formula. When I had a dataset for binary classification, it was great because I used sigmoid function on the output and then classified them as 1 if the sigmoid value was greater than 0.5 else 0. But now I have 3 classes in the dataset. I am blank as to how to proceed as for 2 classes, the easy thing was that one class would be (1- other class).

But for 3 classes, I'd need to use softmax but since I just have one formula for output, how do I do the computations to classify them in classes now.

If you could please help me, and maybe need to have a look at my model, please connect me at vipulv5247@gmail.com

I have tried many things but not easy to explain here.

Thanks in advance for your help

Expand full comment

Hey Vipul, maybe you can try hierarchical classification.

Sigmoid 1 will tell you between class 1 or class 2+3.

If class 2+3 has more than 50%, you pass it through sigmoid 2 which will tell you between class 2 and 3.

Other way is to fit one Vs rest models and then take the class which has the max probability from all the model outputs.

Expand full comment

Chirasmita Mallick

Hi Pratik, Great to connect. I am new to Elastic Search and wanted to understand how can I handle synonyms in my Search. How do I get started here. If you could link me to any resources it will be great.

Expand full comment

Hey same here! You need to define a filter in your ES config.

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html

For generating synonyms, I have explained the recipe in my talk

https://youtu.be/VHm6_uC4vxM?t=920

Expand full comment

Chirasmita Mallick

Thanks a lot Pratik. this helps!

Expand full comment

Chirasmita Mallick

Another follow up question here, Do i have to build mappings manually or use concepts like wordembeddings/similarity to feed a file for synonym. Is ES capable to adding synonyms ?

Expand full comment

There might exist some plugin but we made it manually. You can write a script for creating file but you still need to check it for unnecessary words. The first iteration will be messy.

Expand full comment

Hi Pratik. Thanks for the initiative.

I have two questions:

1. I am not able to relate Central Limit Theorem (CLT) in my day to day activities as a data scientist. Can you please tell where CLT is used in machine learning?

2. How should I formulate my hypothesis for a 1. multi classification problem 2. Regression problem. For example, with retail data, I am trying to predict five fast selling items. What should be my hypothesis for this.

Expand full comment

My pleasure! Sorry I have no clear answer for both.

Expand full comment

can you recommend a book on natural language processing, assuming user has some experience in data science

Expand full comment

Jurafsky’s NLP book is lovely

Expand full comment

Hi pratik, thanks for this initiative

I want to ask about assumption on machine learning model. Do you consider assumption when build model? Like in regression there is multicolinearity etc

Expand full comment

My pleasure! I don’t have a perfect answer to this but this paper is really helpful

https://arxiv.org/pdf/1901.10002.pdf

Expand full comment

Firstly, I am really glad that you took such initiative.

Coming to the other part,I am pretty perplexed about certain decision and would like your suggestion over this

--Doing MBA from a good college in india

-- Pursuing Ms in data science,also is it hard to get internship out there if we don't have been much familiar with data science.

Expand full comment

My pleasure! For MBA, check my answer to Arsh.

Rather than doing a paid course, I would suggest taking 6 months study break and do self learning with MOOCs, blogging and Kaggle.

No need to worry about internship. During the 6 months, every day spend 6 hours on theory and 6 hours on blogging and Kaggle. This will easily make you 90 percentile.

Start giving interviews for both job and internship after 4 months just to understand what is asked, make notes of questions and then blog about it with right answers.

Most people fear taking a dedicated self-study break but they don’t feel the same about doing MBA which is actually a 2 year break where you spend a lot of money. Later they spend 2 years to payback the loan. So in effective it takes 4 years to get to 0 debt.

Instead spending 1 year self study gets you a great DS job and then after 3 years you are a senior data scientist which will be a better job at a better pay compared to MBA. Not to mention you will have savings of 3 years which will be >20 lakhs. The hypothesis holds true even when you compare with MBA from IIMs.

How to be a great data scientist/engineer

https://pakodas.substack.com/p/a-data-scientist-without-a-phd

(You can also read a longer version of getting in data science at guide.pratik.ai)

Expand full comment

How do you see statistics versus machine learning and deep learning ?How good is it to know about the data engineering and deployment in production? What all language and technology one should target for next 5 years , especially after this pandemic? There is lot of supply more than the demand of "data scientist" and still lacking qualified data workers? How one should evaluate and see problem before dropping into implementation of all data science in particular industry ?

Expand full comment

I personally don’t worry about statistics vs machine learning. I just learn what is required and is in high demand. For now Python is going to stay and having production skills is the differentiating factor. Check my answer to Nikita on how can we learn production skills.

I did a talk recently on how to think of DS solutions

https://m.youtube.com/watch?v=f2m6Mon0VE8&t=218s

Expand full comment

Thanks a lot , Pratik. I will surely follow ..

Expand full comment

Hi Pratik,

I am familiar with NLP. But just doing NLP projects on kaggle or some other platform won't give me end to end experience. So, what tools / website or any other source can be used to get an experience in NLP like a professional one get ?

Expand full comment

Let’s decouple story before and after modeling. I feel it’s very difficult to replicate industrial setting. One thing you can do is write a good inference API, make a streamlit demo and deploy on either GCP or AWS.

For great demo examples you can visit this

https://madewithml.com/projects/search-results/?tags=streamlit

If you want to learn cleaning text data, fetch Twitter text data with twint and make a tag predictor.

https://github.com/twintproject/twint

Elasticsearch is used a lot and try to do make a search engine with it by indexing some pdf documents like financial reports of companies which you can get from here

https://www.nseindia.com/companies-listing/corporate-filings-annual-reports

For above get all NSE company codes from

https://nsetools.readthedocs.io/en/latest/usage.html#list-of-traded-stock-codes-names

Expand full comment

Hello Pratik,

Firstly, thank you so much for taking your time out to answer my question.

I am currently doing a project related to building a good training dataset by checking the existing dataset's consistency. I wanted to understand what you think of my approach and see if you have any suggestions or maybe even your thoughts on if you would approach this differently.

Here goes the problem statement:

The existing dataset is hugely imbalanced with several classes (for eg: A, B, .... AX, AY) and the inputs are vague one-liners (text) related to product delivery, etc. The data is human labelled and there are inconsistencies with the labelling. For instance, two same or similar utterances may have been labelled with completely different class labels.

Here goes my solution:

I wanted to take advantage of the fact that humans labelled the utterances and most of them would be right. So I designed a 2-step process. As a first step, I weed out the outliers from the majority class labels using a good outlier detection algorithm. The problem with this step is I am not sure if I can repeat this procedure for the minority class labels as the no. of instances are just too low. And also, even if I manage to detect an outlier, I am not sure which class label the outlier can potentially belong to. So, as a second step, for both the majority and minority class labels, I perform a kNN for each and every data point in the dataset and check if its nearest neighbours also belong to the same class label. If they don't, then I simply relabel the datapoint with the group to which the majority of these neighbours belong. This way I ensure the most similar utterances are uniformly tagged and also (hopefully) take advantage of the fact that when humans label, they most often do it right and that reflects in the majority during nearest neighbours. But as you can see, there are definitely a few pitfalls with this approach. One of them being, if all or majority of the nearest neighbors are mislabelled, then it leads to corrupting a maybe-good datapoint. And the other challenge being the selection of k ofcourse.

I do have certain other ideas in mind - the usage of adversarial data augmentation (libs like textattack) and then maybe doing a cross-validated prediction using deep learning algos on all the datapoints and check for consistency. But I wanted the solution to be simplistic.

However, it would be super helpful if you can share your thoughts on this one!!! And thank you once again!!

Cheers,

Rajesh

Expand full comment

Hi Rajesh, this seems interesting! Can you post some text samples. Are they standard English?

Expand full comment

Let’s just for simplicity sake say it is English. The samples contains texts like “hey when is my delivery coming”, “how can i set up my sim”, “i have issues with my phone”, etc. Thanks!

Expand full comment

Ahh, if you have at least a few thousand samples, try this one with passing class weights to model.fit and use label smoothing of 0.1.

https://github.com/bhavsarpratik/psyduck/blob/master/psy/ml/imbalance.py

https://www.dlology.com/blog/keras-meets-universal-sentence-encoder-transfer-learning-for-text-data/

Expand full comment

i found your nlp questions very hard. Could you give some resources or give some intuition to understand the concepts better

https://medium.com/modern-nlp/nlp-interview-questions-f062040f32f7

Expand full comment

Hey Prabu, try reading the original paper of all language models. Make notes and discuss with others to understand more.

Expand full comment

What advice would you give to someone who is going to graduate from college and is thinking about what to do in life? (Assuming he has a job offer as a Plan B)

Expand full comment

Hey, the best way to deal with 20s is to first go shallow and then deep when you find a great field of work.

A lot of people don’t explore enough. Later, they either hate what they have or fall in weak love with something silly. I changed many fields to finally discover data science. This idea is discussed in detail in the book Range.

I have a huge personal bias of not doing masters. I suggest people to join startups instead of masters.

I find below advices extremely valuable and equivalent of reading many books.

http://paulgraham.com/wealth.html

https://mobile.twitter.com/naval/status/1002103360646823936

https://pmarchive.com/luck_and_the_entrepreneur.html

Expand full comment

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts