[python]How to visualize data
Introduction
When doing machine learning such as kaggle's competition, the first thing to do is to visualize the data. And I think that seaborn is often used for data visualization. But do you ever wonder which one to use because there are various types of graphs? (I have)
There are many explanations that "which method can be used to draw such a graph", but I feel that there are few explanations that "in what circumstances this graph is good". Therefore, here I have summarized which method of seaborn should be used for each type of explanatory variable and objective variable.
Environment is python: 3.6.6 seaborn: 0.10.0
Explanatory variable: Discrete quantity (category) Objective variable: Discrete quantity
First is when both the explanatory variable and the objective variable are discrete quantities (categories). Use seaborn count plot. Draw how many each category of objective variables exists. Pass the explanatory variable to the argument x of countplot and the objective variable to hue. The data is titanic.
import pandas as pd
import seaborn as sns
data=pd.read_csv("train.csv")
sns.countplot(x='Embarked', data=data, hue='Survived')
You can also reverse x and hue (which is a matter of taste?).
sns.countplot(x='Survived', data=data, hue='Embarked')
Explanatory variable: Continuous quantity Objective variable: Discrete quantity
Next is when the explanatory variable is a continuous quantity and the objective variable is a discrete quantity. Draw the distribution of explanatory variables for each category of objective variables with seaborn's distroplot.
g=sns.FacetGrid(data=data, hue='Survived', size=5)
g.map(sns.distplot, 'Fare')
g.add_legend()
Please refer to the other article for how to color-code with a method that does not have a hue as an argument .
Explanatory variable: Discrete quantity Objective variable: Continuous quantity
Next, when the explanatory variable is a discrete quantity and the objective variable is a continuous quantity. Draw the distribution of the objective variable for each category of explanatory variables with the seaborn violin plot. We use Kaggle's House Prices for the data.
train_data=pd.read_csv("train.csv")
sns.violinplot(x="MSZoning", y="SalePrice", data=train_data)
Explanatory variable: continuous quantity Objective variable: continuous quantity
Finally, when both the explanatory variable and the objective variable are continuous quantities. Draw the correlation between the explanatory variable and the objective variable with seaborn's joint plot.
sns.jointplot(x="LotArea", y="SalePrice", data=train_data)
This joint plot is excellent because you can see the correlation between two variables and their distribution at the same time.
Summary
The above is summarized in the table below.
Recent Posts
See AllSummary Data analysis is performed using python. The analysis itself is performed using pandas, and the final results are stored in...
Phenomenon I get a title error when trying to import firestore with raspberry pi. from from firebase_admin import firestore ImportError:...
Overview If you want to do fitting, you can do it with scipy.optimize.leastsq etc. in python. However, when doing fitting, there are many...
Comments