43 Comments

Hi Pratik, I'm doing project related to Sentiment Analysis using Twitter data. After cleaning, I use VADER to classify the tweets into positive, negative and neutrals. Then I did ngram and wordcloud but still not able to get insights from the analysis. seems like the words frequency for both negative and positive are the same. I made sure i did the cleaning properly. Do you have any advise?

Expand full comment

hey pratik so i'm making a recommendation system similar to quora,stack overflow where they recommend most similar questions to the one user asks. i'm using tfidf+cosine similarity+linear kernel for this but the problem comes when my dataset size is around 2gb and saving model to pickle gives memory error. any suggestions ? data is huge and contains around 300,000-500,000 texts with each row containing more than a paragraph worth of text

Expand full comment

Hi Pratik , its awesome that you are conducting QnA . I have a decent knowledge of Machine Learning but feel that i need to learn a bit more . So want to have your advice on whether i should consider continuing by self study or rather plan for Master ?

Also one more advice, how to transition from knowing a lot of theory to practically be able to apply all those theory ?

Expand full comment

Hello sir. It's great to follow someone like you and learn from you :)

Topic- Stuck in a multiclass classification problem

Details - I am working on a dataset wherein I am using type 2 fuzzy systems. Its a neural network architecture but in the mathematical model (not using tensorflow or keras).

I have just one way for output through a formula. When I had a dataset for binary classification, it was great because I used sigmoid function on the output and then classified them as 1 if the sigmoid value was greater than 0.5 else 0. But now I have 3 classes in the dataset. I am blank as to how to proceed as for 2 classes, the easy thing was that one class would be (1- other class).

But for 3 classes, I'd need to use softmax but since I just have one formula for output, how do I do the computations to classify them in classes now.

If you could please help me, and maybe need to have a look at my model, please connect me at vipulv5247@gmail.com

I have tried many things but not easy to explain here.

Thanks in advance for your help

Expand full comment

Hi Pratik, Great to connect. I am new to Elastic Search and wanted to understand how can I handle synonyms in my Search. How do I get started here. If you could link me to any resources it will be great.

Expand full comment

Hi Pratik. Thanks for the initiative.

I have two questions:

1. I am not able to relate Central Limit Theorem (CLT) in my day to day activities as a data scientist. Can you please tell where CLT is used in machine learning?

2. How should I formulate my hypothesis for a 1. multi classification problem 2. Regression problem. For example, with retail data, I am trying to predict five fast selling items. What should be my hypothesis for this.

Expand full comment

can you recommend a book on natural language processing, assuming user has some experience in data science

Expand full comment

Hi pratik, thanks for this initiative

I want to ask about assumption on machine learning model. Do you consider assumption when build model? Like in regression there is multicolinearity etc

Expand full comment

Firstly, I am really glad that you took such initiative.

Coming to the other part,I am pretty perplexed about certain decision and would like your suggestion over this

--Doing MBA from a good college in india

-- Pursuing Ms in data science,also is it hard to get internship out there if we don't have been much familiar with data science.

Expand full comment

How do you see statistics versus machine learning and deep learning ?How good is it to know about the data engineering and deployment in production? What all language and technology one should target for next 5 years , especially after this pandemic? There is lot of supply more than the demand of "data scientist" and still lacking qualified data workers? How one should evaluate and see problem before dropping into implementation of all data science in particular industry ?

Expand full comment

Hi Pratik,

I am familiar with NLP. But just doing NLP projects on kaggle or some other platform won't give me end to end experience. So, what tools / website or any other source can be used to get an experience in NLP like a professional one get ?

Expand full comment

Hello Pratik,

Firstly, thank you so much for taking your time out to answer my question.

I am currently doing a project related to building a good training dataset by checking the existing dataset's consistency. I wanted to understand what you think of my approach and see if you have any suggestions or maybe even your thoughts on if you would approach this differently.

Here goes the problem statement:

The existing dataset is hugely imbalanced with several classes (for eg: A, B, .... AX, AY) and the inputs are vague one-liners (text) related to product delivery, etc. The data is human labelled and there are inconsistencies with the labelling. For instance, two same or similar utterances may have been labelled with completely different class labels.

Here goes my solution:

I wanted to take advantage of the fact that humans labelled the utterances and most of them would be right. So I designed a 2-step process. As a first step, I weed out the outliers from the majority class labels using a good outlier detection algorithm. The problem with this step is I am not sure if I can repeat this procedure for the minority class labels as the no. of instances are just too low. And also, even if I manage to detect an outlier, I am not sure which class label the outlier can potentially belong to. So, as a second step, for both the majority and minority class labels, I perform a kNN for each and every data point in the dataset and check if its nearest neighbours also belong to the same class label. If they don't, then I simply relabel the datapoint with the group to which the majority of these neighbours belong. This way I ensure the most similar utterances are uniformly tagged and also (hopefully) take advantage of the fact that when humans label, they most often do it right and that reflects in the majority during nearest neighbours. But as you can see, there are definitely a few pitfalls with this approach. One of them being, if all or majority of the nearest neighbors are mislabelled, then it leads to corrupting a maybe-good datapoint. And the other challenge being the selection of k ofcourse.

I do have certain other ideas in mind - the usage of adversarial data augmentation (libs like textattack) and then maybe doing a cross-validated prediction using deep learning algos on all the datapoints and check for consistency. But I wanted the solution to be simplistic.

However, it would be super helpful if you can share your thoughts on this one!!! And thank you once again!!

Cheers,

Rajesh

Expand full comment

i found your nlp questions very hard. Could you give some resources or give some intuition to understand the concepts better

https://medium.com/modern-nlp/nlp-interview-questions-f062040f32f7

Expand full comment

What advice would you give to someone who is going to graduate from college and is thinking about what to do in life? (Assuming he has a job offer as a Plan B)

Expand full comment