Open Terminal or CMD and type the following command:
pip install seaborn
pip install statsmodels
Just like pandas you have to import it:
# don't forget this
%matplotlib inline
import seaborn as sns
# Now tell jupyter to use seaborn colors
sns.set(color_codes=True)
import pandas as pd
weather_df = pd.read_csv("https://raw.githubusercontent.com/vega/vega-datasets/gh-pages/data/weather.csv")
# Your turn
# load https://github.com/vega/vega-datasets/raw/gh-pages/data/cars.json
# into cars_df
# Try to plot the distribution of the precipitation column
weather_df.precipitation.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1137243c8>
# What was new about the previous plot?
# Can you do a histogram of precipitation
So let's try to improve
Great for overlaying plots, making customization, and creating trellis/grid plots
Excellent for quick EDA and allows for some customization
Let's examine how the weather_df and cars_df look like:
# How do we examine weather_df to know what columns exist?
# How do we examine cars_df to know what columns exist?
# BEWARE of case sensitivity, "Location" will not work!
sns.countplot("location", data=weather_df)
<matplotlib.axes._subplots.AxesSubplot at 0x1137c2390>
# Store the image in a variable:
cnt_plt = sns.countplot("location", data=weather_df)
# use saveFig and give the file a name
cnt_plt.figure.savefig("location_count1")
# For transparent background use
cnt_plt.figure.savefig("location_count2", transparent=True)
# Your turn to examine the distribution of other categorical variables we identified
# from both weather_df and cars_df
name_plt = sns.countplot('Name', data=cars_df)
# But first, notice how countplot works
data = [1,2,3,3,3,4,5,5,5,6,6,6,6]
names_plt = sns.countplot(data)
# CountPlot will do the counting of the categories for you
# Look at cars_df.Name
cars_df.Name
# It is just a list of names (Pandas calls it series)
# CountPlot is able to count them but the image is not readable
0 chevrolet chevelle malibu 1 buick skylark 320 2 plymouth satellite 3 amc rebel sst 4 ford torino 5 ford galaxie 500 6 chevrolet impala 7 plymouth fury iii 8 pontiac catalina 9 amc ambassador dpl 10 citroen ds-21 pallas 11 chevrolet chevelle concours (sw) 12 ford torino (sw) 13 plymouth satellite (sw) 14 amc rebel sst (sw) 15 dodge challenger se 16 plymouth 'cuda 340 17 ford mustang boss 302 18 chevrolet monte carlo 19 buick estate wagon (sw) 20 toyota corona mark ii 21 plymouth duster 22 amc hornet 23 ford maverick 24 datsun pl510 25 volkswagen 1131 deluxe sedan 26 peugeot 504 27 audi 100 ls 28 saab 99e 29 bmw 2002 ... 376 chevrolet cavalier wagon 377 chevrolet cavalier 2-door 378 pontiac j2000 se hatchback 379 dodge aries se 380 pontiac phoenix 381 ford fairmont futura 382 amc concord dl 383 volkswagen rabbit l 384 mazda glc custom l 385 mazda glc custom 386 plymouth horizon miser 387 mercury lynx l 388 nissan stanza xe 389 honda Accelerationord 390 toyota corolla 391 honda civic 392 honda civic (auto) 393 datsun 310 gx 394 buick century limited 395 oldsmobile cutlass ciera (diesel) 396 chrysler lebaron medallion 397 ford granada l 398 toyota celica gt 399 dodge charger 2.2 400 chevrolet camaro 401 ford mustang gl 402 vw pickup 403 dodge rampage 404 ford ranger 405 chevy s-10 Name: Name, Length: 406, dtype: object
# Let's get the name count and filter the top ten
# Let's see how the data looks like
cars_df.Name.value_counts()[:10]
ford pinto 6 amc matador 5 ford maverick 5 toyota corolla 5 toyota corona 4 chevrolet chevette 4 peugeot 504 4 chevrolet impala 4 amc hornet 4 amc gremlin 4 Name: Name, dtype: int64
# CountPlot doesnt work well with value-count data
# If we use it, it will count the numbers for us
# and find the times 6s, 5s, and 4s occured
data = cars_df.Name.value_counts()[:10]
sns.countplot(data)
<matplotlib.axes._subplots.AxesSubplot at 0x11864d198>
# Instead use simply bar plot from matplot lib
# after filtering topten from value_counts
cars_df.Name.value_counts()[:10].plot(kind="bar")
# What can you tell from this plot?
<matplotlib.axes._subplots.AxesSubplot at 0x11862a710>
weather_df.head()
location | date | precipitation | temp_max | temp_min | wind | weather | |
---|---|---|---|---|---|---|---|
0 | Seattle | 2012-01-01 00:00 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle |
1 | Seattle | 2012-01-02 00:00 | 10.9 | 10.6 | 2.8 | 4.5 | rain |
2 | Seattle | 2012-01-03 00:00 | 0.8 | 11.7 | 7.2 | 2.3 | rain |
3 | Seattle | 2012-01-04 00:00 | 20.3 | 12.2 | 5.6 | 4.7 | rain |
4 | Seattle | 2012-01-05 00:00 | 1.3 | 8.9 | 2.8 | 6.1 | rain |
cars_df.head()
Acceleration | Cylinders | Displacement | Horsepower | Miles_per_Gallon | Name | Origin | Weight_in_lbs | Year | |
---|---|---|---|---|---|---|---|---|---|
0 | 12.0 | 8 | 307.0 | 130.0 | 18.0 | chevrolet chevelle malibu | USA | 3504 | 1970-01-01 |
1 | 11.5 | 8 | 350.0 | 165.0 | 15.0 | buick skylark 320 | USA | 3693 | 1970-01-01 |
2 | 11.0 | 8 | 318.0 | 150.0 | 18.0 | plymouth satellite | USA | 3436 | 1970-01-01 |
3 | 12.0 | 8 | 304.0 | 150.0 | 16.0 | amc rebel sst | USA | 3433 | 1970-01-01 |
4 | 10.5 | 8 | 302.0 | 140.0 | 17.0 | ford torino | USA | 3449 | 1970-01-01 |
sns.distplot(cars_df.Acceleration)
# Add the argument kde=False to remove the distribution line
# you can set the range of values in each bar
# using bins argument
<matplotlib.axes._subplots.AxesSubplot at 0x1188189b0>
# Here is the distribution of another variable
sns.distplot(cars_df.Weight_in_lbs)
<matplotlib.axes._subplots.AxesSubplot at 0x118ffa668>
# try to plot cars_df.Horsepower
# What is the problem?
# how can we fix it? (2 solutions available)
cars_df.Cylinders.value_counts().plot(kind="pie")
<matplotlib.axes._subplots.AxesSubplot at 0x11a4c14e0>
# another way of doing it with matplot lib
import matplotlib.pyplot as plt
plt.pie(cars_df.Cylinders.value_counts(), labels=cars_df.Cylinders.value_counts().index)
([<matplotlib.patches.Wedge at 0x119c8bb38>, <matplotlib.patches.Wedge at 0x119c907f0>, <matplotlib.patches.Wedge at 0x119c96470>, <matplotlib.patches.Wedge at 0x119c9a0f0>, <matplotlib.patches.Wedge at 0x119c9ad30>], [<matplotlib.text.Text at 0x119c90358>, <matplotlib.text.Text at 0x119c90f98>, <matplotlib.text.Text at 0x119c96c18>, <matplotlib.text.Text at 0x119c9a898>, <matplotlib.text.Text at 0x119c9e518>])
# orient can be 'v' or 'h
sns.boxplot(cars_df.Acceleration, orient='v')
<matplotlib.axes._subplots.AxesSubplot at 0x11c5edc50>
Explore the seaborn documentation and try to plot the categorical variables using:
# Your work here
# You can add cells as needed
# But first, remember to convert the date field to datetime object
weather_df.date = pd.to_datetime(weather_df.date)
# to rotate the date by 50 degrees
plt.xticks(rotation=50)
plt.plot_date(x=weather_df.date, y=weather_df.temp_max, fmt='g-')
[<matplotlib.lines.Line2D at 0x11a3244a8>]
fmt='g-'
means green solid lineExamin the documentation for seaborn on aesthetics and matplotlib tutorials to modify the plots that we have made so far. Specifically, you need to select 4 different plot from above and perform the following:
# Your work here
# don't forget to move the plots you will work on here