Introduction to Pandas¶

It is a python 3rd party library
Used for data analysis and visualization
Part of Anaconda python distribution
Best used with Jupyter notebook, can be used with regular python programs
Main feature is the Data Frame

In [1]:

# Load the pandas library to let python know you will use it
import pandas as pd

What is a Data Frame?¶

Its a data structure, like lists and dictionaries
Consists of rows and columns, similar to SQL tables and excel spreadsheets
- Columns are attributes or variables
- Rows are records or single observations
Operations are typically performed on columns
Has both numeric and named indexing

Loading data into a data frame¶

Data is usually loaded from an external source, like a csv or excel file.
Download the weather data set from vega-dataset (right click and save as)
Place it in the same directory as the jupyter notbook you are working on

In [5]:

# load the data using pandas library
# do you remember what was pd?
pd.read_csv("weather.csv")

# Jupter notebook tip:
# type: pd.
# then hit tab, see what happens
# try also: pd.read_ (then hit tab)

Out[5]:

	location	date	precipitation	temp_max	temp_min	wind	weather
0	Seattle	2012-01-01 00:00	0.0	12.8	5.0	4.7	drizzle
1	Seattle	2012-01-02 00:00	10.9	10.6	2.8	4.5	rain
2	Seattle	2012-01-03 00:00	0.8	11.7	7.2	2.3	rain
3	Seattle	2012-01-04 00:00	20.3	12.2	5.6	4.7	rain
4	Seattle	2012-01-05 00:00	1.3	8.9	2.8	6.1	rain
5	Seattle	2012-01-06 00:00	2.5	4.4	2.2	2.2	rain
6	Seattle	2012-01-07 00:00	0.0	7.2	2.8	2.3	rain
7	Seattle	2012-01-08 00:00	0.0	10.0	2.8	2.0	sun
8	Seattle	2012-01-09 00:00	4.3	9.4	5.0	3.4	rain
9	Seattle	2012-01-10 00:00	1.0	6.1	0.6	3.4	rain
10	Seattle	2012-01-11 00:00	0.0	6.1	-1.1	5.1	sun
11	Seattle	2012-01-12 00:00	0.0	6.1	-1.7	1.9	sun
12	Seattle	2012-01-13 00:00	0.0	5.0	-2.8	1.3	sun
13	Seattle	2012-01-14 00:00	4.1	4.4	0.6	5.3	snow
14	Seattle	2012-01-15 00:00	5.3	1.1	-3.3	3.2	snow
15	Seattle	2012-01-16 00:00	2.5	1.7	-2.8	5.0	snow
16	Seattle	2012-01-17 00:00	8.1	3.3	0.0	5.6	snow
17	Seattle	2012-01-18 00:00	19.8	0.0	-2.8	5.0	snow
18	Seattle	2012-01-19 00:00	15.2	-1.1	-2.8	1.6	snow
19	Seattle	2012-01-20 00:00	13.5	7.2	-1.1	2.3	snow
20	Seattle	2012-01-21 00:00	3.0	8.3	3.3	8.2	rain
21	Seattle	2012-01-22 00:00	6.1	6.7	2.2	4.8	rain
22	Seattle	2012-01-23 00:00	0.0	8.3	1.1	3.6	rain
23	Seattle	2012-01-24 00:00	8.6	10.0	2.2	5.1	rain
24	Seattle	2012-01-25 00:00	8.1	8.9	4.4	5.4	rain
25	Seattle	2012-01-26 00:00	4.8	8.9	1.1	4.8	rain
26	Seattle	2012-01-27 00:00	0.0	6.7	-2.2	1.4	drizzle
27	Seattle	2012-01-28 00:00	0.0	6.7	0.6	2.2	rain
28	Seattle	2012-01-29 00:00	27.7	9.4	3.9	4.5	rain
29	Seattle	2012-01-30 00:00	3.6	8.3	6.1	5.1	rain
...	...	...	...	...	...	...	...
2892	New York	2015-12-02 00:00	3.0	13.9	8.3	2.0	fog
2893	New York	2015-12-03 00:00	0.0	13.3	7.2	7.2	sun
2894	New York	2015-12-04 00:00	0.0	11.7	5.0	4.7	sun
2895	New York	2015-12-05 00:00	0.0	11.7	1.7	2.4	sun
2896	New York	2015-12-06 00:00	0.0	10.6	3.3	2.9	sun
2897	New York	2015-12-07 00:00	0.0	12.8	4.4	3.4	drizzle
2898	New York	2015-12-08 00:00	0.0	10.6	4.4	3.5	sun
2899	New York	2015-12-09 00:00	0.0	12.8	1.1	3.4	sun
2900	New York	2015-12-10 00:00	0.0	15.0	8.9	3.0	drizzle
2901	New York	2015-12-11 00:00	0.0	14.4	7.8	2.7	drizzle
2902	New York	2015-12-12 00:00	0.0	17.8	9.4	1.9	fog
2903	New York	2015-12-13 00:00	0.0	21.1	11.7	3.1	drizzle
2904	New York	2015-12-14 00:00	9.1	16.1	11.7	4.8	fog
2905	New York	2015-12-15 00:00	2.3	17.8	11.7	8.2	fog
2906	New York	2015-12-16 00:00	1.3	11.7	7.2	4.1	fog
2907	New York	2015-12-17 00:00	29.7	15.0	10.0	4.1	fog
2908	New York	2015-12-18 00:00	0.3	14.4	3.9	6.1	sun
2909	New York	2015-12-19 00:00	0.0	5.0	2.2	9.0	sun
2910	New York	2015-12-20 00:00	0.0	6.7	1.7	5.1	sun
2911	New York	2015-12-21 00:00	0.0	12.8	3.3	5.3	sun
2912	New York	2015-12-22 00:00	4.8	15.6	11.1	3.8	fog
2913	New York	2015-12-23 00:00	29.5	17.2	8.9	4.5	fog
2914	New York	2015-12-24 00:00	0.5	20.6	13.9	4.9	fog
2915	New York	2015-12-25 00:00	2.5	17.8	11.1	0.9	fog
2916	New York	2015-12-26 00:00	0.3	15.6	9.4	4.8	drizzle
2917	New York	2015-12-27 00:00	2.0	17.2	8.9	5.5	fog
2918	New York	2015-12-28 00:00	1.3	8.9	1.7	6.3	snow
2919	New York	2015-12-29 00:00	16.8	9.4	1.1	5.3	fog
2920	New York	2015-12-30 00:00	9.4	10.6	5.0	3.0	fog
2921	New York	2015-12-31 00:00	1.5	11.1	6.1	5.5	fog

2922 rows × 7 columns

Now it is your turn¶

Download airport.csv then load it into the notebook

Remember: Right click on the link and select save target as

In [ ]:

How to work with the data?¶

You must place it in a variable so you can refer to it
The current data was displayed and not assigned to a variable, so you cannot use it
Assign it to a variable named my_df

In [3]:

my_df = pd.read_csv("weather.csv")

In [ ]:

# Your turn: Load airports.csv into airports_df

Let us discover how the data looks like¶

In [9]:

my_df.head()

Out[9]:

	location	date	precipitation	temp_max	temp_min	wind	weather
0	Seattle	2012-01-01 00:00	0.0	12.8	5.0	4.7	drizzle
1	Seattle	2012-01-02 00:00	10.9	10.6	2.8	4.5	rain
2	Seattle	2012-01-03 00:00	0.8	11.7	7.2	2.3	rain
3	Seattle	2012-01-04 00:00	20.3	12.2	5.6	4.7	rain
4	Seattle	2012-01-05 00:00	1.3	8.9	2.8	6.1	rain

In [ ]:

# You can pass a number in the head() method to show more data
# show 10 items (try it)

# do the same for airports_df

In [13]:

# To know which columns are available use the columns attribute
my_df.columns

Out[13]:

Index(['location', 'date', 'precipitation', 'temp_max', 'temp_min', 'wind',
       'weather'],
      dtype='object')

In [ ]:

# Your turn: explore the columns for airports_df

Data types¶

Each column will have its own data type
Remember, variables will be in columns
Observations in rows
Use dtypes attribute of to discover columns and datatypes
- OOP: What is the difference between a function, a method, an attribute, and a variable?

In [18]:

my_df.dtypes

Out[18]:

location          object
date              object
precipitation    float64
temp_max         float64
temp_min         float64
wind             float64
weather           object
dtype: object

In [ ]:

# Your turn: Find out the data types for the airports_df column

In [18]:

# Pandas uses data types provided by numpy
# load numpy
import numpy as np

# convert the column to datetime
my_df.date.astype(np.datetime64)

Out[18]:

0      2012-01-01
1      2012-01-02
2      2012-01-03
3      2012-01-04
4      2012-01-05
5      2012-01-06
6      2012-01-07
7      2012-01-08
8      2012-01-09
9      2012-01-10
10     2012-01-11
11     2012-01-12
12     2012-01-13
13     2012-01-14
14     2012-01-15
15     2012-01-16
16     2012-01-17
17     2012-01-18
18     2012-01-19
19     2012-01-20
20     2012-01-21
21     2012-01-22
22     2012-01-23
23     2012-01-24
24     2012-01-25
25     2012-01-26
26     2012-01-27
27     2012-01-28
28     2012-01-29
29     2012-01-30
          ...    
2892   2015-12-02
2893   2015-12-03
2894   2015-12-04
2895   2015-12-05
2896   2015-12-06
2897   2015-12-07
2898   2015-12-08
2899   2015-12-09
2900   2015-12-10
2901   2015-12-11
2902   2015-12-12
2903   2015-12-13
2904   2015-12-14
2905   2015-12-15
2906   2015-12-16
2907   2015-12-17
2908   2015-12-18
2909   2015-12-19
2910   2015-12-20
2911   2015-12-21
2912   2015-12-22
2913   2015-12-23
2914   2015-12-24
2915   2015-12-25
2916   2015-12-26
2917   2015-12-27
2918   2015-12-28
2919   2015-12-29
2920   2015-12-30
2921   2015-12-31
Name: date, dtype: datetime64[ns]

In [5]:

# an alternative way to do it is using
pd.to_datetime(my_df.date).head() # do you remember head method?

Out[5]:

0   2012-01-01
1   2012-01-02
2   2012-01-03
3   2012-01-04
4   2012-01-05
Name: date, dtype: datetime64[ns]

In [6]:

# now let us examine the date column
my_df.date.head()

# why is it still of type object?
# How to fix it?

Out[6]:

0    2012-01-01 00:00
1    2012-01-02 00:00
2    2012-01-03 00:00
3    2012-01-04 00:00
4    2012-01-05 00:00
Name: date, dtype: object

In [ ]:

# just like the dataframe, the command creates a copy
# but does not store it
# We need to replace the old date column with the new one
my_df.date = my_df.date.astype(np.datetime64)

In [23]:

# check the types
my_df.dtypes

Out[23]:

location                 object
date             datetime64[ns]
precipitation           float64
temp_max                float64
temp_min                float64
wind                    float64
weather                  object
dtype: object

In [ ]:

# Your turn: examine the airports_df dataframe
# are there any date columns that you can convert?
# Check then umeric columns, what should their data type be?

Why convert an object column into a date column?¶

As you will find out later, pandas can do more fancy things if it knows the column is a date
For example:
- Sort
- Filter based on date range
- Date arethmatic
Always make sure date/time columns have the correct data type

Indexing Columns¶

Using square brackets [ ]
Using dot notation .

In [7]:

# a single column is known as a series
my_df['location']

Out[7]:

0        Seattle
1        Seattle
2        Seattle
3        Seattle
4        Seattle
5        Seattle
6        Seattle
7        Seattle
8        Seattle
9        Seattle
10       Seattle
11       Seattle
12       Seattle
13       Seattle
14       Seattle
15       Seattle
16       Seattle
17       Seattle
18       Seattle
19       Seattle
20       Seattle
21       Seattle
22       Seattle
23       Seattle
24       Seattle
25       Seattle
26       Seattle
27       Seattle
28       Seattle
29       Seattle
          ...   
2892    New York
2893    New York
2894    New York
2895    New York
2896    New York
2897    New York
2898    New York
2899    New York
2900    New York
2901    New York
2902    New York
2903    New York
2904    New York
2905    New York
2906    New York
2907    New York
2908    New York
2909    New York
2910    New York
2911    New York
2912    New York
2913    New York
2914    New York
2915    New York
2916    New York
2917    New York
2918    New York
2919    New York
2920    New York
2921    New York
Name: location, dtype: object

In [8]:

# Some methods that work on Dataframes also work on Series
my_df['location'].head()

Out[8]:

0    Seattle
1    Seattle
2    Seattle
3    Seattle
4    Seattle
Name: location, dtype: object

In [13]:

# Dot notation to access series
my_df.location.head()

Out[13]:

0    Seattle
1    Seattle
2    Seattle
3    Seattle
4    Seattle
Name: location, dtype: object

In [ ]:

# Your turn: Try to index the columns for airports_df using square brackets and dot notation
# Use head() to get an idea of what the data is

In [14]:

# Descriptive statistics
my_df['location'].describe()

Out[14]:

count         2922
unique           2
top       New York
freq          1461
Name: location, dtype: object

In [15]:

# works also on dataframe
my_df.describe()

Out[15]:

	precipitation	temp_max	temp_min	wind
count	2922.000000	2922.000000	2922.000000	2922.000000
mean	2.944764	16.769131	8.612320	4.101129
std	7.695286	8.644596	7.511776	1.880791
min	0.000000	-7.700000	-16.000000	0.400000
25%	0.000000	10.000000	3.300000	2.700000
50%	0.000000	16.100000	8.900000	3.800000
75%	1.800000	23.900000	13.900000	5.100000
max	118.900000	37.800000	26.700000	16.200000

In [26]:

# Different data types will have different descriptives
my_df['date'].describe()

Out[26]:

count                    2922
unique                   1461
top       2013-06-05 00:00:00
freq                        2
first     2012-01-01 00:00:00
last      2015-12-31 00:00:00
Name: date, dtype: object

In [21]:

my_df.precipitation.describe()

Out[21]:

count    2922.000000
mean        2.944764
std         7.695286
min         0.000000
25%         0.000000
50%         0.000000
75%         1.800000
max       118.900000
Name: precipitation, dtype: float64

In [ ]:

# Your turn: Use describe() on airports_df

# Which columns are included in describe?

# Try it on the columns that were excluded:

# Why were these columns excluded?

You can also plot a dataframe¶

Pandas will try to show it in the best way possible

In [ ]:

my_df.plot()

In [29]:

# You need to tell pandas that you want to display plots in the notebook
%matplotlib inline

In [31]:

# now try it
my_df.plot()

Out[31]:

<matplotlib.axes._subplots.AxesSubplot at 0x1147020b8>

Don't forget!¶

Always include in your notebook:

# this is the first cell in your notebook
import pandas as pd
%matplotlib inline   # dont forget this

In [32]:

# It's more meaningful to plot Series
my_df['precipitation'].plot()

Out[32]:

<matplotlib.axes._subplots.AxesSubplot at 0x1147250f0>

In [9]:

# Remember plots show change from one observation to the next
my_df['wind'].plot()

# in some cases it might not be useful

Out[9]:

<matplotlib.axes._subplots.AxesSubplot at 0x10b33f710>

In [40]:

# you can try a histogram
my_df['wind'].hist()

# Which is useful to know distributions

Out[40]:

<matplotlib.axes._subplots.AxesSubplot at 0x116d89240>

How can you find out if percipitation is usually high in the year or low?¶

In [ ]:

# Your turn:

In [45]:

# Sometime pandas cannot plot it
my_df['location'].plot()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-45-e083b00ff51a> in <module>()
      1 # Sometime pandas cannot plot it
----> 2 my_df['location'].plot()

/Users/koutbo6/anaconda/lib/python3.5/site-packages/pandas/tools/plotting.py in __call__(self, kind, ax, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, label, secondary_y, **kwds)
   3564                            colormap=colormap, table=table, yerr=yerr,
   3565                            xerr=xerr, label=label, secondary_y=secondary_y,
-> 3566                            **kwds)
   3567     __call__.__doc__ = plot_series.__doc__
   3568 

/Users/koutbo6/anaconda/lib/python3.5/site-packages/pandas/tools/plotting.py in plot_series(data, kind, ax, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, label, secondary_y, **kwds)
   2643                  yerr=yerr, xerr=xerr,
   2644                  label=label, secondary_y=secondary_y,
-> 2645                  **kwds)
   2646 
   2647 

/Users/koutbo6/anaconda/lib/python3.5/site-packages/pandas/tools/plotting.py in _plot(data, x, y, subplots, ax, kind, **kwds)
   2439         plot_obj = klass(data, subplots=subplots, ax=ax, kind=kind, **kwds)
   2440 
-> 2441     plot_obj.generate()
   2442     plot_obj.draw()
   2443     return plot_obj.result

/Users/koutbo6/anaconda/lib/python3.5/site-packages/pandas/tools/plotting.py in generate(self)
   1024     def generate(self):
   1025         self._args_adjust()
-> 1026         self._compute_plot_data()
   1027         self._setup_subplots()
   1028         self._make_plot()

/Users/koutbo6/anaconda/lib/python3.5/site-packages/pandas/tools/plotting.py in _compute_plot_data(self)
   1133         if is_empty:
   1134             raise TypeError('Empty {0!r}: no numeric data to '
-> 1135                             'plot'.format(numeric_data.__class__.__name__))
   1136 
   1137         self.data = numeric_data

TypeError: Empty 'DataFrame': no numeric data to plot

In [ ]:

# Your turn: try to plot the columns in airport_df using either plot() or hist()


# What can you find out about the data?

In [49]:

# Such variables are usually categorical and you can get frequencies like so
my_df['location'].value_counts()

Out[49]:

New York    1461
Seattle     1461
Name: location, dtype: int64

In [ ]:

# Your turn: Examine the columns for airports_df
# what would be the best columns to check frequencies for?
# try it:


# The best columns are:

# The reason frequencies is best calculated on them is because:

# What did you find out about your data?

Are data frames immutable?¶

Yes, however, all operations that change values will produce a copy and not change the original
You have to use assignment to change columns or dataframes
So be careful!