[python]How to find the correlation for categorical variables
Introduction
When analyzing data, you will look at the correlation between variables in the given data. For the correlation between numerical values, you can check the correlation coefficient, but what if one or both are categories? I looked it up, so I will summarize it.
Number vs Number
In this case, it is famous. You can check the correlation coefficient. The definition of the correlation coefficient is as follows.
To find the correlation coefficient in python, use the corr () method of pandas.DataFrame.
import numpy as np
import pandas as pd
x=np.random.randint(1, 10, 100)
y=np.random.randint(1, 10, 100)
data=pd.DataFrame({'x':x, 'y': y})
data.corr()
If the value is 0, there is no correlation, if it is close to 1, there is a positive correlation, and if it is close to -1, there is a negative correlation.
Category vs Number
It is expressed as a statistic called correlation ratio. The definition is as follows.
See here for a concrete example.
The numerator represents "how far each category is". The farther the categories are, the larger the numerator, and the stronger the correlation.
This correlation ratio also means no correlation when it is 0, and a strong positive correlation when it approaches 1.
In python, it calculates as follows (see here).
def correlation_ratio(cat_key, num_key, data):
categorical=data[cat_key]
numerical=data[num_key]
mean=numerical.dropna().mean()
all_var=((numerical-mean)**2).sum() #Sum of squares of
#total deviation
unique_cat=pd.Series(categorical.unique())
unique_cat=list(unique_cat.dropna())
categorical_num=[numerical[categorical==cat] for cat in unique_cat]
categorical_var=[len(x.dropna())*(x.dropna().mean()-mean)**2 for x in categorical_num]
#Number of categories x (category average - overall average)^2
r=sum(categorical_var)/all_var
return r
Category vs Category
We will look at it using a statistic called Cramer's coefficient of association. Definition is
where χ2 is the chi-square distribution, n is the number of data items, and k is the one with the smaller number of categories. Please refer to here for the χ-square distribution.
Roughly speaking, it is a quantity that expresses how different the distribution of each category is from the overall distribution. Again, if it is close to 0, there is no correlation, and if it is close to 1, there is a positive correlation.
To calculate with python, do the following (see here).
import scipy.stats as st
def cramerV(x, y, data):
table=pd.crosstab(data[x], data[y])
x2, p, dof, e=st.chi2_contingency(table, False)
n=table.sum().sum()
r=np.sqrt(x2/(n*(np.min(table.shape)-1)))
return r
Find each index together
Only this would be the second brew of the previous article, so I created a method to calculate each index collectively for DataFrame. You don't have to look it up one by one!
def is_categorical(data, key):
col_type=data[key].dtype
if col_type=='int':
nunique=data[key].nunique()
return nunique<6
elif col_type=="float":
return False
else:
return True
def get_corr(data, categorical_keys=None):
keys=data.keys()
if categorical_keys is None:
categorical_keys=keys[[is_categorycal(data, key) for key in keys]]
corr=pd.DataFrame({})
corr_ratio=pd.DataFrame({})
corr_cramer=pd.DataFrame({})
for key1 in keys:
for key2 in keys:
if (key1 in categorical_keys) and (key2 in categorical_keys):
r=cramerV(key1, key2, data)
corr_cramer.loc[key1, key2]=r
elif (key1 in categorical_keys) and (key2 not in categorical_keys):
r=correlation_ratio(cat_key=key1, num_key=key2, data=data)
corr_ratio.loc[key1, key2]=r
elif (key1 not in categorical_keys) and (key2 in categorical_keys):
r=correlation_ratio(cat_key=key2, num_key=key1, data=data)
corr_ratio.loc[key1, key2]=r
else:
r=data.corr().loc[key1, key2]
corr.loc[key1, key2]=r
return corr, corr_ratio, corr_cramer
Which key is a categorical variable is automatically determined from the variable type unless specified.
Let's apply it to titanic data.
data=pd.read_csv(r"train.csv")
data=data.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis=1)
category=["Survived", "Pclass", "Sex", "Embarked"]
corr, corr_ratio, corr_cramer=get_corr(data, category)
corr
corr_ratio
corr_cramer
In addition, it can be visualized with the seaborn heatmap.
import seaborn as sns
sns.heatmap(corr_cramer, vmin=-1, vmax=1)
Lastly
The explanation of each statistic has become messy, so please see the page mentioned in the reference. Even if I put it together, I end up forgetting it and looking it up, so I try to create a method that automates as much as possible.
Methods used in the article is uploaded on github.
Reference
Recent Posts
See AllSummary Data analysis is performed using python. The analysis itself is performed using pandas, and the final results are stored in...
Phenomenon I get a title error when trying to import firestore with raspberry pi. from from firebase_admin import firestore ImportError:...
Overview If you want to do fitting, you can do it with scipy.optimize.leastsq etc. in python. However, when doing fitting, there are many...
Comments