top of page

[python]How to find the correlation for categorical variables


Introduction


When analyzing data, you will look at the correlation between variables in the given data. For the correlation between numerical values, you can check the correlation coefficient, but what if one or both are categories? I looked it up, so I will summarize it.



Number vs Number


In this case, it is famous. You can check the correlation coefficient. The definition of the correlation coefficient is as follows.



To find the correlation coefficient in python, use the corr () method of pandas.DataFrame.


import numpy as np
import pandas as pd

x=np.random.randint(1, 10, 100)
y=np.random.randint(1, 10, 100)

data=pd.DataFrame({'x':x, 'y': y})

data.corr()








If the value is 0, there is no correlation, if it is close to 1, there is a positive correlation, and if it is close to -1, there is a negative correlation.



Category vs Number


It is expressed as a statistic called correlation ratio. The definition is as follows.

See here for a concrete example.

The numerator represents "how far each category is". The farther the categories are, the larger the numerator, and the stronger the correlation.


This correlation ratio also means no correlation when it is 0, and a strong positive correlation when it approaches 1.

In python, it calculates as follows (see here).


def correlation_ratio(cat_key, num_key, data):

    categorical=data[cat_key]
    numerical=data[num_key]

    mean=numerical.dropna().mean()
    all_var=((numerical-mean)**2).sum()  #Sum of squares of 
                                         #total deviation

    unique_cat=pd.Series(categorical.unique())
    unique_cat=list(unique_cat.dropna())

    categorical_num=[numerical[categorical==cat] for cat in unique_cat]
    categorical_var=[len(x.dropna())*(x.dropna().mean()-mean)**2 for x in categorical_num]  
    #Number of categories x (category average - overall average)^2

    r=sum(categorical_var)/all_var

    return r

Category vs Category


We will look at it using a statistic called Cramer's coefficient of association. Definition is

where χ2 is the chi-square distribution, n is the number of data items, and k is the one with the smaller number of categories. Please refer to here for the χ-square distribution.

Roughly speaking, it is a quantity that expresses how different the distribution of each category is from the overall distribution. Again, if it is close to 0, there is no correlation, and if it is close to 1, there is a positive correlation.

To calculate with python, do the following (see here).


import scipy.stats as st

def cramerV(x, y, data):

    table=pd.crosstab(data[x], data[y])
    x2, p, dof, e=st.chi2_contingency(table, False)

    n=table.sum().sum()
    r=np.sqrt(x2/(n*(np.min(table.shape)-1)))

    return r


Find each index together


Only this would be the second brew of the previous article, so I created a method to calculate each index collectively for DataFrame. You don't have to look it up one by one!


def is_categorical(data, key):

    col_type=data[key].dtype

    if col_type=='int':

        nunique=data[key].nunique()
        return nunique<6

    elif col_type=="float":
        return False

    else:
        return True
def get_corr(data, categorical_keys=None):

    keys=data.keys()

    if categorical_keys is None:

        categorical_keys=keys[[is_categorycal(data, key) for key in keys]]

    corr=pd.DataFrame({})
    corr_ratio=pd.DataFrame({})
    corr_cramer=pd.DataFrame({})

    for key1 in keys:
        for key2 in keys:

            if (key1 in categorical_keys) and (key2 in categorical_keys):

                r=cramerV(key1, key2, data)
                corr_cramer.loc[key1, key2]=r                

            elif (key1 in categorical_keys) and (key2 not in categorical_keys):

                r=correlation_ratio(cat_key=key1, num_key=key2, data=data)
                corr_ratio.loc[key1, key2]=r                

            elif (key1 not in categorical_keys) and (key2 in categorical_keys):

                r=correlation_ratio(cat_key=key2, num_key=key1, data=data)
                corr_ratio.loc[key1, key2]=r                

            else:

                r=data.corr().loc[key1, key2]
                corr.loc[key1, key2]=r                    

    return corr, corr_ratio, corr_cramer

Which key is a categorical variable is automatically determined from the variable type unless specified.


Let's apply it to titanic data.

data=pd.read_csv(r"train.csv")
data=data.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis=1)
category=["Survived", "Pclass", "Sex", "Embarked"]

corr, corr_ratio, corr_cramer=get_corr(data, category)
corr





corr_ratio


corr_cramer


In addition, it can be visualized with the seaborn heatmap.

import seaborn as sns
sns.heatmap(corr_cramer, vmin=-1, vmax=1)


Lastly


The explanation of each statistic has become messy, so please see the page mentioned in the reference. Even if I put it together, I end up forgetting it and looking it up, so I try to create a method that automates as much as possible.


Methods used in the article is uploaded on github.


Reference


Recent Posts

See All

[Python] Conditionally fitting

Overview If you want to do fitting, you can do it with scipy.optimize.leastsq etc. in python. However, when doing fitting, there are many...

Comments


category

Let's do our best with our partner:​ ChatReminder

iphone6.5p2.png

It is an application that achieves goals in a chat format with partners.

google-play-badge.png
Download_on_the_App_Store_Badge_JP_RGB_blk_100317.png

Let's do our best with our partner:​ ChatReminder

納品:iPhone6.5①.png

It is an application that achieves goals in a chat format with partners.

google-play-badge.png
Download_on_the_App_Store_Badge_JP_RGB_blk_100317.png

Theme diary: Decide the theme and record for each genre

It is a diary application that allows you to post and record with themes and sub-themes for each genre.

google-play-badge.png
Download_on_the_App_Store_Badge_JP_RGB_blk_100317.png
bottom of page