Come join Maxpool - A Data Science community to discuss real ML problems
In the last few years, I have answered many technical and career doubts of people.
What I regret about this now is the information is lost in different places - LinkedIn and WhatsApp chats. Many questions have been really good and I wish I had done this QA in a more structured way.
From now on, you can ask me anything over here and I will try to answer them sooner or later in the best way possible.
I am comfortable in general data science and NLP but feel free to stretch my limits :P
I also encourage others to share a better answer :)
Question Format
1 line title of the problem
ex. Classification for customer reviews
Explain the problem
Explain the data in terms of x(input) and y(output)
Hi Pratik, I'm doing project related to Sentiment Analysis using Twitter data. After cleaning, I use VADER to classify the tweets into positive, negative and neutrals. Then I did ngram and wordcloud but still not able to get insights from the analysis. seems like the words frequency for both negative and positive are the same. I made sure i did the cleaning properly. Do you have any advise?
hey pratik so i'm making a recommendation system similar to quora,stack overflow where they recommend most similar questions to the one user asks. i'm using tfidf+cosine similarity+linear kernel for this but the problem comes when my dataset size is around 2gb and saving model to pickle gives memory error. any suggestions ? data is huge and contains around 300,000-500,000 texts with each row containing more than a paragraph worth of text
Hi Pratik , its awesome that you are conducting QnA . I have a decent knowledge of Machine Learning but feel that i need to learn a bit more . So want to have your advice on whether i should consider continuing by self study or rather plan for Master ?
Also one more advice, how to transition from knowing a lot of theory to practically be able to apply all those theory ?
Hello sir. It's great to follow someone like you and learn from you :)
Topic- Stuck in a multiclass classification problem
Details - I am working on a dataset wherein I am using type 2 fuzzy systems. Its a neural network architecture but in the mathematical model (not using tensorflow or keras).
I have just one way for output through a formula. When I had a dataset for binary classification, it was great because I used sigmoid function on the output and then classified them as 1 if the sigmoid value was greater than 0.5 else 0. But now I have 3 classes in the dataset. I am blank as to how to proceed as for 2 classes, the easy thing was that one class would be (1- other class).
But for 3 classes, I'd need to use softmax but since I just have one formula for output, how do I do the computations to classify them in classes now.
If you could please help me, and maybe need to have a look at my model, please connect me at vipulv5247@gmail.com
I have tried many things but not easy to explain here.
Hi Pratik, Great to connect. I am new to Elastic Search and wanted to understand how can I handle synonyms in my Search. How do I get started here. If you could link me to any resources it will be great.
1. I am not able to relate Central Limit Theorem (CLT) in my day to day activities as a data scientist. Can you please tell where CLT is used in machine learning?
2. How should I formulate my hypothesis for a 1. multi classification problem 2. Regression problem. For example, with retail data, I am trying to predict five fast selling items. What should be my hypothesis for this.
I want to ask about assumption on machine learning model. Do you consider assumption when build model? Like in regression there is multicolinearity etc
How do you see statistics versus machine learning and deep learning ?How good is it to know about the data engineering and deployment in production? What all language and technology one should target for next 5 years , especially after this pandemic? There is lot of supply more than the demand of "data scientist" and still lacking qualified data workers? How one should evaluate and see problem before dropping into implementation of all data science in particular industry ?
I am familiar with NLP. But just doing NLP projects on kaggle or some other platform won't give me end to end experience. So, what tools / website or any other source can be used to get an experience in NLP like a professional one get ?
Firstly, thank you so much for taking your time out to answer my question.
I am currently doing a project related to building a good training dataset by checking the existing dataset's consistency. I wanted to understand what you think of my approach and see if you have any suggestions or maybe even your thoughts on if you would approach this differently.
Here goes the problem statement:
The existing dataset is hugely imbalanced with several classes (for eg: A, B, .... AX, AY) and the inputs are vague one-liners (text) related to product delivery, etc. The data is human labelled and there are inconsistencies with the labelling. For instance, two same or similar utterances may have been labelled with completely different class labels.
Here goes my solution:
I wanted to take advantage of the fact that humans labelled the utterances and most of them would be right. So I designed a 2-step process. As a first step, I weed out the outliers from the majority class labels using a good outlier detection algorithm. The problem with this step is I am not sure if I can repeat this procedure for the minority class labels as the no. of instances are just too low. And also, even if I manage to detect an outlier, I am not sure which class label the outlier can potentially belong to. So, as a second step, for both the majority and minority class labels, I perform a kNN for each and every data point in the dataset and check if its nearest neighbours also belong to the same class label. If they don't, then I simply relabel the datapoint with the group to which the majority of these neighbours belong. This way I ensure the most similar utterances are uniformly tagged and also (hopefully) take advantage of the fact that when humans label, they most often do it right and that reflects in the majority during nearest neighbours. But as you can see, there are definitely a few pitfalls with this approach. One of them being, if all or majority of the nearest neighbors are mislabelled, then it leads to corrupting a maybe-good datapoint. And the other challenge being the selection of k ofcourse.
I do have certain other ideas in mind - the usage of adversarial data augmentation (libs like textattack) and then maybe doing a cross-validated prediction using deep learning algos on all the datapoints and check for consistency. But I wanted the solution to be simplistic.
However, it would be super helpful if you can share your thoughts on this one!!! And thank you once again!!
What advice would you give to someone who is going to graduate from college and is thinking about what to do in life? (Assuming he has a job offer as a Plan B)
Ask Pratik Anything!
Hi Pratik, I'm doing project related to Sentiment Analysis using Twitter data. After cleaning, I use VADER to classify the tweets into positive, negative and neutrals. Then I did ngram and wordcloud but still not able to get insights from the analysis. seems like the words frequency for both negative and positive are the same. I made sure i did the cleaning properly. Do you have any advise?
hey pratik so i'm making a recommendation system similar to quora,stack overflow where they recommend most similar questions to the one user asks. i'm using tfidf+cosine similarity+linear kernel for this but the problem comes when my dataset size is around 2gb and saving model to pickle gives memory error. any suggestions ? data is huge and contains around 300,000-500,000 texts with each row containing more than a paragraph worth of text
Hi Pratik , its awesome that you are conducting QnA . I have a decent knowledge of Machine Learning but feel that i need to learn a bit more . So want to have your advice on whether i should consider continuing by self study or rather plan for Master ?
Also one more advice, how to transition from knowing a lot of theory to practically be able to apply all those theory ?
Hello sir. It's great to follow someone like you and learn from you :)
Topic- Stuck in a multiclass classification problem
Details - I am working on a dataset wherein I am using type 2 fuzzy systems. Its a neural network architecture but in the mathematical model (not using tensorflow or keras).
I have just one way for output through a formula. When I had a dataset for binary classification, it was great because I used sigmoid function on the output and then classified them as 1 if the sigmoid value was greater than 0.5 else 0. But now I have 3 classes in the dataset. I am blank as to how to proceed as for 2 classes, the easy thing was that one class would be (1- other class).
But for 3 classes, I'd need to use softmax but since I just have one formula for output, how do I do the computations to classify them in classes now.
If you could please help me, and maybe need to have a look at my model, please connect me at vipulv5247@gmail.com
I have tried many things but not easy to explain here.
Thanks in advance for your help
Hi Pratik, Great to connect. I am new to Elastic Search and wanted to understand how can I handle synonyms in my Search. How do I get started here. If you could link me to any resources it will be great.
Hi Pratik. Thanks for the initiative.
I have two questions:
1. I am not able to relate Central Limit Theorem (CLT) in my day to day activities as a data scientist. Can you please tell where CLT is used in machine learning?
2. How should I formulate my hypothesis for a 1. multi classification problem 2. Regression problem. For example, with retail data, I am trying to predict five fast selling items. What should be my hypothesis for this.
can you recommend a book on natural language processing, assuming user has some experience in data science
Hi pratik, thanks for this initiative
I want to ask about assumption on machine learning model. Do you consider assumption when build model? Like in regression there is multicolinearity etc
Firstly, I am really glad that you took such initiative.
Coming to the other part,I am pretty perplexed about certain decision and would like your suggestion over this
--Doing MBA from a good college in india
-- Pursuing Ms in data science,also is it hard to get internship out there if we don't have been much familiar with data science.
How do you see statistics versus machine learning and deep learning ?How good is it to know about the data engineering and deployment in production? What all language and technology one should target for next 5 years , especially after this pandemic? There is lot of supply more than the demand of "data scientist" and still lacking qualified data workers? How one should evaluate and see problem before dropping into implementation of all data science in particular industry ?
Hi Pratik,
I am familiar with NLP. But just doing NLP projects on kaggle or some other platform won't give me end to end experience. So, what tools / website or any other source can be used to get an experience in NLP like a professional one get ?
Hello Pratik,
Firstly, thank you so much for taking your time out to answer my question.
I am currently doing a project related to building a good training dataset by checking the existing dataset's consistency. I wanted to understand what you think of my approach and see if you have any suggestions or maybe even your thoughts on if you would approach this differently.
Here goes the problem statement:
The existing dataset is hugely imbalanced with several classes (for eg: A, B, .... AX, AY) and the inputs are vague one-liners (text) related to product delivery, etc. The data is human labelled and there are inconsistencies with the labelling. For instance, two same or similar utterances may have been labelled with completely different class labels.
Here goes my solution:
I wanted to take advantage of the fact that humans labelled the utterances and most of them would be right. So I designed a 2-step process. As a first step, I weed out the outliers from the majority class labels using a good outlier detection algorithm. The problem with this step is I am not sure if I can repeat this procedure for the minority class labels as the no. of instances are just too low. And also, even if I manage to detect an outlier, I am not sure which class label the outlier can potentially belong to. So, as a second step, for both the majority and minority class labels, I perform a kNN for each and every data point in the dataset and check if its nearest neighbours also belong to the same class label. If they don't, then I simply relabel the datapoint with the group to which the majority of these neighbours belong. This way I ensure the most similar utterances are uniformly tagged and also (hopefully) take advantage of the fact that when humans label, they most often do it right and that reflects in the majority during nearest neighbours. But as you can see, there are definitely a few pitfalls with this approach. One of them being, if all or majority of the nearest neighbors are mislabelled, then it leads to corrupting a maybe-good datapoint. And the other challenge being the selection of k ofcourse.
I do have certain other ideas in mind - the usage of adversarial data augmentation (libs like textattack) and then maybe doing a cross-validated prediction using deep learning algos on all the datapoints and check for consistency. But I wanted the solution to be simplistic.
However, it would be super helpful if you can share your thoughts on this one!!! And thank you once again!!
Cheers,
Rajesh
i found your nlp questions very hard. Could you give some resources or give some intuition to understand the concepts better
https://medium.com/modern-nlp/nlp-interview-questions-f062040f32f7
What advice would you give to someone who is going to graduate from college and is thinking about what to do in life? (Assuming he has a job offer as a Plan B)