MIS 492 - Data Analysis and Visualization

Week 3

Data Preparation Using Pandas

Dr. Mohammad AlMarzouq

Introduction to Pandas

  • It is a python 3rd party library
  • Used for data analysis and visualization
  • Part of Anaconda python distribution
  • Best used with Jupyter notebook, can be used with regular python programs
  • Main feature is the Data Frame
In [1]:
# Load the pandas library to let python know you will use it
import pandas as pd

What is a Data Frame?

  • Its a data structure, like lists and dictionaries
  • Consists of rows and columns, similar to SQL tables and excel spreadsheets
    • Columns are attributes or variables
    • Rows are records or single observations
  • Operations are typically performed on columns
  • Has both numeric and named indexing

Tidy Data

  • Standard form of organizing data in dataframe such that:
    • Each variable forms a column
    • Each row forms a row
    • Each table is an observational unit (level of analysis)
  • Required reading: Tidy Data, by Hadley Wickham

Importance of Tidy Data

  • Most tools we will use assume that data is tidy
  • Collected data is likely messy or non-tidy, need to learn how to reshape it
  • We will start with tidy data in our analysis
    • Will learn how to reshape when we start data collection

Loading data into a data frame

  • Data is usually loaded/collected from an external source, like an api, website, csv, or excel file, and might not be well organized.
    • We will start with well organized and behaved data to get to know how to use Pandas
  • Download the weather data set from vega-dataset (right click and save as)
  • Place it in the same directory as the jupyter notbook you are working on
In [5]:
# load the data using pandas library
# do you remember what was pd?
pd.read_csv("weather.csv")

# Jupter notebook tip:
# type: pd.
# then hit tab, see what happens
# try also: pd.read_ (then hit tab)
Out[5]:
location date precipitation temp_max temp_min wind weather
0 Seattle 2012-01-01 00:00 0.0 12.8 5.0 4.7 drizzle
1 Seattle 2012-01-02 00:00 10.9 10.6 2.8 4.5 rain
2 Seattle 2012-01-03 00:00 0.8 11.7 7.2 2.3 rain
3 Seattle 2012-01-04 00:00 20.3 12.2 5.6 4.7 rain
4 Seattle 2012-01-05 00:00 1.3 8.9 2.8 6.1 rain
5 Seattle 2012-01-06 00:00 2.5 4.4 2.2 2.2 rain
6 Seattle 2012-01-07 00:00 0.0 7.2 2.8 2.3 rain
7 Seattle 2012-01-08 00:00 0.0 10.0 2.8 2.0 sun
8 Seattle 2012-01-09 00:00 4.3 9.4 5.0 3.4 rain
9 Seattle 2012-01-10 00:00 1.0 6.1 0.6 3.4 rain
10 Seattle 2012-01-11 00:00 0.0 6.1 -1.1 5.1 sun
11 Seattle 2012-01-12 00:00 0.0 6.1 -1.7 1.9 sun
12 Seattle 2012-01-13 00:00 0.0 5.0 -2.8 1.3 sun
13 Seattle 2012-01-14 00:00 4.1 4.4 0.6 5.3 snow
14 Seattle 2012-01-15 00:00 5.3 1.1 -3.3 3.2 snow
15 Seattle 2012-01-16 00:00 2.5 1.7 -2.8 5.0 snow
16 Seattle 2012-01-17 00:00 8.1 3.3 0.0 5.6 snow
17 Seattle 2012-01-18 00:00 19.8 0.0 -2.8 5.0 snow
18 Seattle 2012-01-19 00:00 15.2 -1.1 -2.8 1.6 snow
19 Seattle 2012-01-20 00:00 13.5 7.2 -1.1 2.3 snow
20 Seattle 2012-01-21 00:00 3.0 8.3 3.3 8.2 rain
21 Seattle 2012-01-22 00:00 6.1 6.7 2.2 4.8 rain
22 Seattle 2012-01-23 00:00 0.0 8.3 1.1 3.6 rain
23 Seattle 2012-01-24 00:00 8.6 10.0 2.2 5.1 rain
24 Seattle 2012-01-25 00:00 8.1 8.9 4.4 5.4 rain
25 Seattle 2012-01-26 00:00 4.8 8.9 1.1 4.8 rain
26 Seattle 2012-01-27 00:00 0.0 6.7 -2.2 1.4 drizzle
27 Seattle 2012-01-28 00:00 0.0 6.7 0.6 2.2 rain
28 Seattle 2012-01-29 00:00 27.7 9.4 3.9 4.5 rain
29 Seattle 2012-01-30 00:00 3.6 8.3 6.1 5.1 rain
... ... ... ... ... ... ... ...
2892 New York 2015-12-02 00:00 3.0 13.9 8.3 2.0 fog
2893 New York 2015-12-03 00:00 0.0 13.3 7.2 7.2 sun
2894 New York 2015-12-04 00:00 0.0 11.7 5.0 4.7 sun
2895 New York 2015-12-05 00:00 0.0 11.7 1.7 2.4 sun
2896 New York 2015-12-06 00:00 0.0 10.6 3.3 2.9 sun
2897 New York 2015-12-07 00:00 0.0 12.8 4.4 3.4 drizzle
2898 New York 2015-12-08 00:00 0.0 10.6 4.4 3.5 sun
2899 New York 2015-12-09 00:00 0.0 12.8 1.1 3.4 sun
2900 New York 2015-12-10 00:00 0.0 15.0 8.9 3.0 drizzle
2901 New York 2015-12-11 00:00 0.0 14.4 7.8 2.7 drizzle
2902 New York 2015-12-12 00:00 0.0 17.8 9.4 1.9 fog
2903 New York 2015-12-13 00:00 0.0 21.1 11.7 3.1 drizzle
2904 New York 2015-12-14 00:00 9.1 16.1 11.7 4.8 fog
2905 New York 2015-12-15 00:00 2.3 17.8 11.7 8.2 fog
2906 New York 2015-12-16 00:00 1.3 11.7 7.2 4.1 fog
2907 New York 2015-12-17 00:00 29.7 15.0 10.0 4.1 fog
2908 New York 2015-12-18 00:00 0.3 14.4 3.9 6.1 sun
2909 New York 2015-12-19 00:00 0.0 5.0 2.2 9.0 sun
2910 New York 2015-12-20 00:00 0.0 6.7 1.7 5.1 sun
2911 New York 2015-12-21 00:00 0.0 12.8 3.3 5.3 sun
2912 New York 2015-12-22 00:00 4.8 15.6 11.1 3.8 fog
2913 New York 2015-12-23 00:00 29.5 17.2 8.9 4.5 fog
2914 New York 2015-12-24 00:00 0.5 20.6 13.9 4.9 fog
2915 New York 2015-12-25 00:00 2.5 17.8 11.1 0.9 fog
2916 New York 2015-12-26 00:00 0.3 15.6 9.4 4.8 drizzle
2917 New York 2015-12-27 00:00 2.0 17.2 8.9 5.5 fog
2918 New York 2015-12-28 00:00 1.3 8.9 1.7 6.3 snow
2919 New York 2015-12-29 00:00 16.8 9.4 1.1 5.3 fog
2920 New York 2015-12-30 00:00 9.4 10.6 5.0 3.0 fog
2921 New York 2015-12-31 00:00 1.5 11.1 6.1 5.5 fog

2922 rows × 7 columns

Loading File From URL

If the CSV file is downloadable from a url, you can put the URL in place of the file name:

In [37]:
my_df = pd.read_csv("https://raw.githubusercontent.com/vega/vega-datasets/gh-pages/data/weather.csv")
my_df
Out[37]:
location date precipitation temp_max temp_min wind weather
0 Seattle 2012-01-01 00:00 0.0 12.8 5.0 4.7 drizzle
1 Seattle 2012-01-02 00:00 10.9 10.6 2.8 4.5 rain
2 Seattle 2012-01-03 00:00 0.8 11.7 7.2 2.3 rain
3 Seattle 2012-01-04 00:00 20.3 12.2 5.6 4.7 rain
4 Seattle 2012-01-05 00:00 1.3 8.9 2.8 6.1 rain
5 Seattle 2012-01-06 00:00 2.5 4.4 2.2 2.2 rain
6 Seattle 2012-01-07 00:00 0.0 7.2 2.8 2.3 rain
7 Seattle 2012-01-08 00:00 0.0 10.0 2.8 2.0 sun
8 Seattle 2012-01-09 00:00 4.3 9.4 5.0 3.4 rain
9 Seattle 2012-01-10 00:00 1.0 6.1 0.6 3.4 rain
10 Seattle 2012-01-11 00:00 0.0 6.1 -1.1 5.1 sun
11 Seattle 2012-01-12 00:00 0.0 6.1 -1.7 1.9 sun
12 Seattle 2012-01-13 00:00 0.0 5.0 -2.8 1.3 sun
13 Seattle 2012-01-14 00:00 4.1 4.4 0.6 5.3 snow
14 Seattle 2012-01-15 00:00 5.3 1.1 -3.3 3.2 snow
15 Seattle 2012-01-16 00:00 2.5 1.7 -2.8 5.0 snow
16 Seattle 2012-01-17 00:00 8.1 3.3 0.0 5.6 snow
17 Seattle 2012-01-18 00:00 19.8 0.0 -2.8 5.0 snow
18 Seattle 2012-01-19 00:00 15.2 -1.1 -2.8 1.6 snow
19 Seattle 2012-01-20 00:00 13.5 7.2 -1.1 2.3 snow
20 Seattle 2012-01-21 00:00 3.0 8.3 3.3 8.2 rain
21 Seattle 2012-01-22 00:00 6.1 6.7 2.2 4.8 rain
22 Seattle 2012-01-23 00:00 0.0 8.3 1.1 3.6 rain
23 Seattle 2012-01-24 00:00 8.6 10.0 2.2 5.1 rain
24 Seattle 2012-01-25 00:00 8.1 8.9 4.4 5.4 rain
25 Seattle 2012-01-26 00:00 4.8 8.9 1.1 4.8 rain
26 Seattle 2012-01-27 00:00 0.0 6.7 -2.2 1.4 drizzle
27 Seattle 2012-01-28 00:00 0.0 6.7 0.6 2.2 rain
28 Seattle 2012-01-29 00:00 27.7 9.4 3.9 4.5 rain
29 Seattle 2012-01-30 00:00 3.6 8.3 6.1 5.1 rain
... ... ... ... ... ... ... ...
2892 New York 2015-12-02 00:00 3.0 13.9 8.3 2.0 fog
2893 New York 2015-12-03 00:00 0.0 13.3 7.2 7.2 sun
2894 New York 2015-12-04 00:00 0.0 11.7 5.0 4.7 sun
2895 New York 2015-12-05 00:00 0.0 11.7 1.7 2.4 sun
2896 New York 2015-12-06 00:00 0.0 10.6 3.3 2.9 sun
2897 New York 2015-12-07 00:00 0.0 12.8 4.4 3.4 drizzle
2898 New York 2015-12-08 00:00 0.0 10.6 4.4 3.5 sun
2899 New York 2015-12-09 00:00 0.0 12.8 1.1 3.4 sun
2900 New York 2015-12-10 00:00 0.0 15.0 8.9 3.0 drizzle
2901 New York 2015-12-11 00:00 0.0 14.4 7.8 2.7 drizzle
2902 New York 2015-12-12 00:00 0.0 17.8 9.4 1.9 fog
2903 New York 2015-12-13 00:00 0.0 21.1 11.7 3.1 drizzle
2904 New York 2015-12-14 00:00 9.1 16.1 11.7 4.8 fog
2905 New York 2015-12-15 00:00 2.3 17.8 11.7 8.2 fog
2906 New York 2015-12-16 00:00 1.3 11.7 7.2 4.1 fog
2907 New York 2015-12-17 00:00 29.7 15.0 10.0 4.1 fog
2908 New York 2015-12-18 00:00 0.3 14.4 3.9 6.1 sun
2909 New York 2015-12-19 00:00 0.0 5.0 2.2 9.0 sun
2910 New York 2015-12-20 00:00 0.0 6.7 1.7 5.1 sun
2911 New York 2015-12-21 00:00 0.0 12.8 3.3 5.3 sun
2912 New York 2015-12-22 00:00 4.8 15.6 11.1 3.8 fog
2913 New York 2015-12-23 00:00 29.5 17.2 8.9 4.5 fog
2914 New York 2015-12-24 00:00 0.5 20.6 13.9 4.9 fog
2915 New York 2015-12-25 00:00 2.5 17.8 11.1 0.9 fog
2916 New York 2015-12-26 00:00 0.3 15.6 9.4 4.8 drizzle
2917 New York 2015-12-27 00:00 2.0 17.2 8.9 5.5 fog
2918 New York 2015-12-28 00:00 1.3 8.9 1.7 6.3 snow
2919 New York 2015-12-29 00:00 16.8 9.4 1.1 5.3 fog
2920 New York 2015-12-30 00:00 9.4 10.6 5.0 3.0 fog
2921 New York 2015-12-31 00:00 1.5 11.1 6.1 5.5 fog

2922 rows × 7 columns

Now it is your turn

Download airport.csv then load it into the notebook

Remember: Right click on the link and select save target as

File Types

  • CSV: Comma Separated Values example
    • Use pd.read_csv
  • JSON: Javascript Object Notation example
    • Use pd.read_json
  • Excel: Microsoft Excel File
    • Use pd.read_excel
  • Others, type pd.read_ then hit Tab to see a list
In [ ]:
# Your turn
# Load: https://github.com/vega/vega-datasets/raw/gh-pages/data/cars.json
# Into: cars_df

How to work with the data?

  • You must place it in a variable so you can refer to it
  • The current data was displayed and not assigned to a variable, so you cannot use it
  • Assign it to a variable named my_df
In [3]:
my_df = pd.read_csv("weather.csv")
In [ ]:
# Your turn: Load airports.csv into airports_df

Let us discover how the data looks like

We examine the top and bottom records of the dataframe to get an idea of what the data looks like

In [9]:
my_df.head()
Out[9]:
location date precipitation temp_max temp_min wind weather
0 Seattle 2012-01-01 00:00 0.0 12.8 5.0 4.7 drizzle
1 Seattle 2012-01-02 00:00 10.9 10.6 2.8 4.5 rain
2 Seattle 2012-01-03 00:00 0.8 11.7 7.2 2.3 rain
3 Seattle 2012-01-04 00:00 20.3 12.2 5.6 4.7 rain
4 Seattle 2012-01-05 00:00 1.3 8.9 2.8 6.1 rain
In [ ]:
# You can pass a number in the head() method to show more data
# show 10 items (try it)

# do the same for airports_df
In [13]:
# To know which columns are available use the columns attribute
my_df.columns
Out[13]:
Index(['location', 'date', 'precipitation', 'temp_max', 'temp_min', 'wind',
       'weather'],
      dtype='object')
In [ ]:
# Your turn: explore the columns for airports_df

Data types

  • Each column will have its own data type
  • Remember, variables will be in columns
  • Observations in rows
  • Use dtypes attribute of to discover columns and datatypes
    • OOP: What is the difference between a function, a method, an attribute, and a variable?
In [18]:
my_df.dtypes
Out[18]:
location          object
date              object
precipitation    float64
temp_max         float64
temp_min         float64
wind             float64
weather           object
dtype: object
In [ ]:
# Your turn: Find out the data types for the airports_df column
In [18]:
# Pandas uses data types provided by numpy
# load numpy
import numpy as np

# convert the column to datetime
my_df.date.astype(np.datetime64)
Out[18]:
0      2012-01-01
1      2012-01-02
2      2012-01-03
3      2012-01-04
4      2012-01-05
5      2012-01-06
6      2012-01-07
7      2012-01-08
8      2012-01-09
9      2012-01-10
10     2012-01-11
11     2012-01-12
12     2012-01-13
13     2012-01-14
14     2012-01-15
15     2012-01-16
16     2012-01-17
17     2012-01-18
18     2012-01-19
19     2012-01-20
20     2012-01-21
21     2012-01-22
22     2012-01-23
23     2012-01-24
24     2012-01-25
25     2012-01-26
26     2012-01-27
27     2012-01-28
28     2012-01-29
29     2012-01-30
          ...    
2892   2015-12-02
2893   2015-12-03
2894   2015-12-04
2895   2015-12-05
2896   2015-12-06
2897   2015-12-07
2898   2015-12-08
2899   2015-12-09
2900   2015-12-10
2901   2015-12-11
2902   2015-12-12
2903   2015-12-13
2904   2015-12-14
2905   2015-12-15
2906   2015-12-16
2907   2015-12-17
2908   2015-12-18
2909   2015-12-19
2910   2015-12-20
2911   2015-12-21
2912   2015-12-22
2913   2015-12-23
2914   2015-12-24
2915   2015-12-25
2916   2015-12-26
2917   2015-12-27
2918   2015-12-28
2919   2015-12-29
2920   2015-12-30
2921   2015-12-31
Name: date, dtype: datetime64[ns]
In [5]:
# an alternative way to do it is using
pd.to_datetime(my_df.date).head() # do you remember head method?
Out[5]:
0   2012-01-01
1   2012-01-02
2   2012-01-03
3   2012-01-04
4   2012-01-05
Name: date, dtype: datetime64[ns]
In [6]:
# now let us examine the date column
my_df.date.head()

# why is it still of type object?
# How to fix it?
Out[6]:
0    2012-01-01 00:00
1    2012-01-02 00:00
2    2012-01-03 00:00
3    2012-01-04 00:00
4    2012-01-05 00:00
Name: date, dtype: object
In [ ]:
# just like the dataframe, the command creates a copy
# but does not store it
# We need to replace the old date column with the new one
my_df.date = my_df.date.astype(np.datetime64)
In [23]:
# check the types
my_df.dtypes
Out[23]:
location                 object
date             datetime64[ns]
precipitation           float64
temp_max                float64
temp_min                float64
wind                    float64
weather                  object
dtype: object
In [2]:
# Your turn: examine the airports_df dataframe
# are there any date columns that you can convert?
# Check then numeric columns, what should their data type be?

Data Types and Variable Types

Variable Type Data Type
Continuous float64, datetime64[ns]
Discrete int64
Ordinal int64, category
Nominal int64, object, category
Categorical int64, object, category

Why convert an object column into a date column?

  • As you will find out later, pandas can do more fancy things if it knows the column is a date
  • For example:
    • Sort
    • Filter based on date range
    • Date arethmatic
  • Always make sure date/time columns have the correct data type

Indexing Columns

  • Using square brackets [ ]
  • Using dot notation .
In [7]:
# a single column is known as a series
my_df['location']
Out[7]:
0        Seattle
1        Seattle
2        Seattle
3        Seattle
4        Seattle
5        Seattle
6        Seattle
7        Seattle
8        Seattle
9        Seattle
10       Seattle
11       Seattle
12       Seattle
13       Seattle
14       Seattle
15       Seattle
16       Seattle
17       Seattle
18       Seattle
19       Seattle
20       Seattle
21       Seattle
22       Seattle
23       Seattle
24       Seattle
25       Seattle
26       Seattle
27       Seattle
28       Seattle
29       Seattle
          ...   
2892    New York
2893    New York
2894    New York
2895    New York
2896    New York
2897    New York
2898    New York
2899    New York
2900    New York
2901    New York
2902    New York
2903    New York
2904    New York
2905    New York
2906    New York
2907    New York
2908    New York
2909    New York
2910    New York
2911    New York
2912    New York
2913    New York
2914    New York
2915    New York
2916    New York
2917    New York
2918    New York
2919    New York
2920    New York
2921    New York
Name: location, dtype: object
In [8]:
# Some methods that work on Dataframes also work on Series
my_df['location'].head()
Out[8]:
0    Seattle
1    Seattle
2    Seattle
3    Seattle
4    Seattle
Name: location, dtype: object
In [13]:
# Dot notation to access series
my_df.location.head()
Out[13]:
0    Seattle
1    Seattle
2    Seattle
3    Seattle
4    Seattle
Name: location, dtype: object
In [ ]:
# Your turn: Try to index the columns for airports_df using square brackets and dot notation
# Use head() to get an idea of what the data is
In [14]:
# Descriptive statistics
my_df['location'].describe()
Out[14]:
count         2922
unique           2
top       New York
freq          1461
Name: location, dtype: object
In [15]:
# works also on dataframe
my_df.describe()
Out[15]:
precipitation temp_max temp_min wind
count 2922.000000 2922.000000 2922.000000 2922.000000
mean 2.944764 16.769131 8.612320 4.101129
std 7.695286 8.644596 7.511776 1.880791
min 0.000000 -7.700000 -16.000000 0.400000
25% 0.000000 10.000000 3.300000 2.700000
50% 0.000000 16.100000 8.900000 3.800000
75% 1.800000 23.900000 13.900000 5.100000
max 118.900000 37.800000 26.700000 16.200000
In [26]:
# Different data types will have different descriptives
my_df['date'].describe()
Out[26]:
count                    2922
unique                   1461
top       2013-06-05 00:00:00
freq                        2
first     2012-01-01 00:00:00
last      2015-12-31 00:00:00
Name: date, dtype: object
In [21]:
my_df.precipitation.describe()
Out[21]:
count    2922.000000
mean        2.944764
std         7.695286
min         0.000000
25%         0.000000
50%         0.000000
75%         1.800000
max       118.900000
Name: precipitation, dtype: float64
In [ ]:
# Your turn: Use describe() on airports_df

# Which columns are included in describe?

# Try it on the columns that were excluded:

# Why were these columns excluded?

You can also plot a dataframe

  • Pandas will try to show it in the best way possible
  • Plotting from dataframe is very simplistic and used for quick univariate exploration
In [ ]:
my_df.plot()
In [29]:
# You need to tell pandas that you want to display plots in the notebook
%matplotlib inline
In [31]:
# now try it
my_df.plot()
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x1147020b8>

Don't forget!

Always include in your notebook:

# this is the first cell in your notebook
import pandas as pd
%matplotlib inline   # dont forget this
In [32]:
# It's more meaningful to plot Series
my_df['precipitation'].plot()
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x1147250f0>
In [9]:
# Remember plots show change from one observation to the next
my_df['wind'].plot()

# in some cases it might not be useful
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b33f710>
In [40]:
# you can try a histogram
my_df['wind'].hist()

# Which is useful to know distributions
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x116d89240>

How can you find out if percipitation is usually high in the year or low?

In [ ]:
# Your turn:
In [45]:
# Sometime pandas cannot plot it
my_df['location'].plot()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-45-e083b00ff51a> in <module>()
      1 # Sometime pandas cannot plot it
----> 2 my_df['location'].plot()

/Users/koutbo6/anaconda/lib/python3.5/site-packages/pandas/tools/plotting.py in __call__(self, kind, ax, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, label, secondary_y, **kwds)
   3564                            colormap=colormap, table=table, yerr=yerr,
   3565                            xerr=xerr, label=label, secondary_y=secondary_y,
-> 3566                            **kwds)
   3567     __call__.__doc__ = plot_series.__doc__
   3568 

/Users/koutbo6/anaconda/lib/python3.5/site-packages/pandas/tools/plotting.py in plot_series(data, kind, ax, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, label, secondary_y, **kwds)
   2643                  yerr=yerr, xerr=xerr,
   2644                  label=label, secondary_y=secondary_y,
-> 2645                  **kwds)
   2646 
   2647 

/Users/koutbo6/anaconda/lib/python3.5/site-packages/pandas/tools/plotting.py in _plot(data, x, y, subplots, ax, kind, **kwds)
   2439         plot_obj = klass(data, subplots=subplots, ax=ax, kind=kind, **kwds)
   2440 
-> 2441     plot_obj.generate()
   2442     plot_obj.draw()
   2443     return plot_obj.result

/Users/koutbo6/anaconda/lib/python3.5/site-packages/pandas/tools/plotting.py in generate(self)
   1024     def generate(self):
   1025         self._args_adjust()
-> 1026         self._compute_plot_data()
   1027         self._setup_subplots()
   1028         self._make_plot()

/Users/koutbo6/anaconda/lib/python3.5/site-packages/pandas/tools/plotting.py in _compute_plot_data(self)
   1133         if is_empty:
   1134             raise TypeError('Empty {0!r}: no numeric data to '
-> 1135                             'plot'.format(numeric_data.__class__.__name__))
   1136 
   1137         self.data = numeric_data

TypeError: Empty 'DataFrame': no numeric data to plot
In [ ]:
# Your turn: try to plot the columns in airport_df using either plot() or hist()


# What can you find out about the data?
In [49]:
# Such variables are usually categorical and you can get frequencies like so
my_df['location'].value_counts()
Out[49]:
New York    1461
Seattle     1461
Name: location, dtype: int64
In [3]:
# Your turn: Examine the columns for airports_df
# what would be the best columns to check frequencies for?
# try it:


# The best columns are:

# The reason frequencies is best calculated on them is because:

# What did you find out about your data?

# try to plot value_count(), how do you do it? what do you get?

Are data frames immutable?

  • Yes, however, all operations that change values will produce a copy and not change the original
  • You have to use assignment to change columns or dataframes
  • So be careful!

Finding Missing Value (Nulls or Nans)

  • Missing values are usually represented by:
    • The Python Null if the value doesn't exists
    • Numpy nan if the value is Not a Number (like zero division)
  • Use isnull() value to find if any value is null in the DataFrame:
In [26]:
my_df.isnull()
Out[26]:
location date precipitation temp_max temp_min wind weather
0 False False False False False False False
1 False False False False False False False
2 False False False False False False False
3 False False False False False False False
4 False False False False False False False
5 False False False False False False False
6 False False False False False False False
7 False False False False False False False
8 False False False False False False False
9 False False False False False False False
10 False False False False False False False
11 False False False False False False False
12 False False False False False False False
13 False False False False False False False
14 False False False False False False False
15 False False False False False False False
16 False False False False False False False
17 False False False False False False False
18 False False False False False False False
19 False False False False False False False
20 False False False False False False False
21 False False False False False False False
22 False False False False False False False
23 False False False False False False False
24 False False False False False False False
25 False False False False False False False
26 False False False False False False False
27 False False False False False False False
28 False False False False False False False
29 False False False False False False False
... ... ... ... ... ... ... ...
2892 False False False False False False False
2893 False False False False False False False
2894 False False False False False False False
2895 False False False False False False False
2896 False False False False False False False
2897 False False False False False False False
2898 False False False False False False False
2899 False False False False False False False
2900 False False False False False False False
2901 False False False False False False False
2902 False False False False False False False
2903 False False False False False False False
2904 False False False False False False False
2905 False False False False False False False
2906 False False False False False False False
2907 False False False False False False False
2908 False False False False False False False
2909 False False False False False False False
2910 False False False False False False False
2911 False False False False False False False
2912 False False False False False False False
2913 False False False False False False False
2914 False False False False False False False
2915 False False False False False False False
2916 False False False False False False False
2917 False False False False False False False
2918 False False False False False False False
2919 False False False False False False False
2920 False False False False False False False
2921 False False False False False False False

2922 rows × 7 columns

In [ ]:
# Its a Dataframe containing boolean values
# Use describe()

# or use value_counts() on each column to count the Nulls

Call any() on isnull() to see which columns have null values

In [11]:
my_df.isnull().any()
Out[11]:
location         False
date             False
precipitation    False
temp_max         False
temp_min         False
wind             False
weather          False
dtype: bool

Call any() again to see if the dataframe has any null value

In [14]:
my_df.isnull().any().any()
Out[14]:
False

Duplicates

  • Search for any repeated values
  • use duplicated()
    • Note that this looks at all the columns in the record
    • You can pass a list of column names to check duplication based on
In [28]:
my_df.duplicated()
Out[28]:
0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
2892    False
2893    False
2894    False
2895    False
2896    False
2897    False
2898    False
2899    False
2900    False
2901    False
2902    False
2903    False
2904    False
2905    False
2906    False
2907    False
2908    False
2909    False
2910    False
2911    False
2912    False
2913    False
2914    False
2915    False
2916    False
2917    False
2918    False
2919    False
2920    False
2921    False
dtype: bool
In [33]:
# Try to pass location as the duplication column and see what happens
my_df.duplicated("location").value_counts()

# what is value_counts?
Out[33]:
True     2920
False       2
dtype: int64
In [34]:
# now try it with both location and weather columns
my_df.duplicated(["location", "weather"]).value_counts()
Out[34]:
True     2912
False      10
dtype: int64