Correlation Test with p-value? Yes! | Pandas vs. Scipy

Published: 01 January 1970
on channel: Data Science Garage
1,554
36

Let’s talk about #Correlation test in Data Science with Python, Pandas and Scipy libraries. This video provides the overview between pandas.dataframe.corr() method and Scipy for calculating a coefficient of correlation for you features stored in Pandas DataFrame. The advantage comes from Scipy is that you will be able to evaluate the importance of the calculated correlation and decide which one is meaningful, and which ones are not for you business problem.

First of all, let’s be familiar with the official pd.DataFrame.corr() method documentation here: https://pandas.pydata.org/docs/refere...
As you can see, this way by using Pandas provides limited flexibility to use correlation for you Data Science project. There you can change the method of calculation (pearson, kendall, and spearman) and min_periods parameters only. So, it has minimal configuration.

From my personal experience, I suggest to try another way to play with correlation. You can test them with #pvalue which can be automatically calculated in you #Pandas Dataframe.

A mandatory warning that must be mentioned when talking about correlation is “Correlation does not imply causation”. You can read more about this warning in this Medium.com article:   / correlation-vs-causation-in-data-science  

You would often use correlation during exploratory data analysis (EDA) both for supervised and unsupervised ML problems to solve. The most straighforward way to calculate correlation is to use built-in Pandas method for that: pandas.dataframe.corr(). You can see the example in the video and in this Github code gist: https://gist.githubusercontent.com/Ka...

THE CORE OF THIS VIDEO:
When we talk about the correlation between variables, what we ideally want to measure is the correlation between variables in the entire population. However most data scientists work with a sample of data. So if we obtain a different sample it’s possible we could have different correlation scores. As such we need to assess the significance of the correlation values we calculated, which depends on the sample size. 

By using Pandas approach, you will never get a p-values for your correlation, so you will be not able to test them. I suggest use Scipy library for this reason. Scipy’s stats library offers all three versions of the correlation test offered in pandas.DataFrame.corr():
Pearson (https://docs.scipy.org/doc/scipy/refe...)
Spearman (https://docs.scipy.org/doc/scipy/refe...)
KendalTau (https://docs.scipy.org/doc/scipy/refe...)
By using Scipy approach, you can pass columns of dataframe you want to compare. The Github snippet is here: https://gist.githubusercontent.com/Ka...

From the output of the code linked above, we can see the p-values and hence know how significant the correlations are.

Conclusion:
For correlation tests on data samples (which is often what you will be working on) always calculate the p-values as well. As such, when working with sample data go for the scipy.stats over pandas.DataFrame.corr()


Read more about:
Pearson and Spearman correlation coefficients (clearly explained): https://towardsdatascience.com/clearl...
Kendall Rank Correlation (explained): https://towardsdatascience.com/kendal...
Interpreting correlations: https://towardsdatascience.com/eveyth...


Watch video Correlation Test with p-value? Yes! | Pandas vs. Scipy online without registration, duration hours minute second in high quality. This video was added by user Data Science Garage 01 January 1970, don't forget to share it with your friends and acquaintances, it has been viewed on our site 1,554 once and liked it 36 people.