{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# MIS 492 - Data Analysis and Visualization\n",
"## Week 4\n",
"\n",
"## Data Cleansing & Transformation\n",
"\n",
"### Dr. Mohammad AlMarzouq\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Data Cleansing & Transformation\n",
"- Is the act of cleaning and preparing the data for analysis\n",
"- It is an iterative process and requires that we explore the distributional characteristics of data\n",
" - as we did last week\n",
"- This week we will learn the skills to slice, filter, and sort dataframes, which allows us discover problems in data and select the data that we will use for our analysis"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"# As always, start with the following\n",
"\n",
"import pandas as pd\n",
"\n",
"# Let's load the weather data\n",
"# Notice how you can load directly from a url\n",
"weather_df = pd.read_csv(\"https://github.com/vega/vega-datasets/raw/gh-pages/data/weather.csv\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Indexing/Slicing Dataframes\n",
"\n",
"- Using square brackets\n",
"- Using loc\n",
"- Using iloc"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Using Square Brackets"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 Seattle\n",
"1 Seattle\n",
"2 Seattle\n",
"3 Seattle\n",
"4 Seattle\n",
"Name: location, dtype: object"
]
},
"execution_count": 88,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Labels are used to select columns\n",
"weather_df[\"location\"].head()"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" wind | \n",
" location | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 4.7 | \n",
" Seattle | \n",
"
\n",
" \n",
" 1 | \n",
" 4.5 | \n",
" Seattle | \n",
"
\n",
" \n",
" 2 | \n",
" 2.3 | \n",
" Seattle | \n",
"
\n",
" \n",
" 3 | \n",
" 4.7 | \n",
" Seattle | \n",
"
\n",
" \n",
" 4 | \n",
" 6.1 | \n",
" Seattle | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" wind location\n",
"0 4.7 Seattle\n",
"1 4.5 Seattle\n",
"2 2.3 Seattle\n",
"3 4.7 Seattle\n",
"4 6.1 Seattle"
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# List or tuple of labels can also be passed to select multiple columns and their location\n",
"\n",
"weather_df[[\"wind\",\"location\"]].head()"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" location | \n",
" date | \n",
" precipitation | \n",
" temp_max | \n",
" temp_min | \n",
" wind | \n",
" weather | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" Seattle | \n",
" 2012-01-02 00:00 | \n",
" 10.9 | \n",
" 10.6 | \n",
" 2.8 | \n",
" 4.5 | \n",
" rain | \n",
"
\n",
" \n",
" 2 | \n",
" Seattle | \n",
" 2012-01-03 00:00 | \n",
" 0.8 | \n",
" 11.7 | \n",
" 7.2 | \n",
" 2.3 | \n",
" rain | \n",
"
\n",
" \n",
" 3 | \n",
" Seattle | \n",
" 2012-01-04 00:00 | \n",
" 20.3 | \n",
" 12.2 | \n",
" 5.6 | \n",
" 4.7 | \n",
" rain | \n",
"
\n",
" \n",
" 4 | \n",
" Seattle | \n",
" 2012-01-05 00:00 | \n",
" 1.3 | \n",
" 8.9 | \n",
" 2.8 | \n",
" 6.1 | \n",
" rain | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" location date precipitation temp_max temp_min wind weather\n",
"1 Seattle 2012-01-02 00:00 10.9 10.6 2.8 4.5 rain\n",
"2 Seattle 2012-01-03 00:00 0.8 11.7 7.2 2.3 rain\n",
"3 Seattle 2012-01-04 00:00 20.3 12.2 5.6 4.7 rain\n",
"4 Seattle 2012-01-05 00:00 1.3 8.9 2.8 6.1 rain"
]
},
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# When using numeric slicing, it is for selecting/slicing rows\n",
"# Works exactly like python list slicing\n",
"weather_df[1:5]\n",
"\n",
"#Can you fetch the last 5 records?"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Using loc\n",
"Used for **labeled** slicing of both rows and columns\n",
"\n",
"**NOTE:** loc is used with square brackets"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"location Seattle\n",
"date 2012-01-02 00:00\n",
"precipitation 10.9\n",
"temp_max 10.6\n",
"temp_min 2.8\n",
"wind 4.5\n",
"weather rain\n",
"Name: 1, dtype: object"
]
},
"execution_count": 91,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Fetch rows based on index number Look at the far left column\n",
"\n",
"weather_df.loc[1]"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" location | \n",
" date | \n",
" precipitation | \n",
" temp_max | \n",
" temp_min | \n",
" wind | \n",
" weather | \n",
"
\n",
" \n",
" \n",
" \n",
" 5 | \n",
" Seattle | \n",
" 2012-01-06 00:00 | \n",
" 2.5 | \n",
" 4.4 | \n",
" 2.2 | \n",
" 2.2 | \n",
" rain | \n",
"
\n",
" \n",
" 6 | \n",
" Seattle | \n",
" 2012-01-07 00:00 | \n",
" 0.0 | \n",
" 7.2 | \n",
" 2.8 | \n",
" 2.3 | \n",
" rain | \n",
"
\n",
" \n",
" 7 | \n",
" Seattle | \n",
" 2012-01-08 00:00 | \n",
" 0.0 | \n",
" 10.0 | \n",
" 2.8 | \n",
" 2.0 | \n",
" sun | \n",
"
\n",
" \n",
" 8 | \n",
" Seattle | \n",
" 2012-01-09 00:00 | \n",
" 4.3 | \n",
" 9.4 | \n",
" 5.0 | \n",
" 3.4 | \n",
" rain | \n",
"
\n",
" \n",
" 9 | \n",
" Seattle | \n",
" 2012-01-10 00:00 | \n",
" 1.0 | \n",
" 6.1 | \n",
" 0.6 | \n",
" 3.4 | \n",
" rain | \n",
"
\n",
" \n",
" 10 | \n",
" Seattle | \n",
" 2012-01-11 00:00 | \n",
" 0.0 | \n",
" 6.1 | \n",
" -1.1 | \n",
" 5.1 | \n",
" sun | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" location date precipitation temp_max temp_min wind weather\n",
"5 Seattle 2012-01-06 00:00 2.5 4.4 2.2 2.2 rain\n",
"6 Seattle 2012-01-07 00:00 0.0 7.2 2.8 2.3 rain\n",
"7 Seattle 2012-01-08 00:00 0.0 10.0 2.8 2.0 sun\n",
"8 Seattle 2012-01-09 00:00 4.3 9.4 5.0 3.4 rain\n",
"9 Seattle 2012-01-10 00:00 1.0 6.1 0.6 3.4 rain\n",
"10 Seattle 2012-01-11 00:00 0.0 6.1 -1.1 5.1 sun"
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# You can also use the python list slicing syntax to fetch multiple rows \n",
"weather_df.loc[5:10]"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" location | \n",
" date | \n",
" precipitation | \n",
" temp_max | \n",
" temp_min | \n",
" wind | \n",
" weather | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" Seattle | \n",
" 2012-01-02 00:00 | \n",
" 10.9 | \n",
" 10.6 | \n",
" 2.8 | \n",
" 4.5 | \n",
" rain | \n",
"
\n",
" \n",
" 5 | \n",
" Seattle | \n",
" 2012-01-06 00:00 | \n",
" 2.5 | \n",
" 4.4 | \n",
" 2.2 | \n",
" 2.2 | \n",
" rain | \n",
"
\n",
" \n",
" 7 | \n",
" Seattle | \n",
" 2012-01-08 00:00 | \n",
" 0.0 | \n",
" 10.0 | \n",
" 2.8 | \n",
" 2.0 | \n",
" sun | \n",
"
\n",
" \n",
" 10 | \n",
" Seattle | \n",
" 2012-01-11 00:00 | \n",
" 0.0 | \n",
" 6.1 | \n",
" -1.1 | \n",
" 5.1 | \n",
" sun | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" location date precipitation temp_max temp_min wind weather\n",
"1 Seattle 2012-01-02 00:00 10.9 10.6 2.8 4.5 rain\n",
"5 Seattle 2012-01-06 00:00 2.5 4.4 2.2 2.2 rain\n",
"7 Seattle 2012-01-08 00:00 0.0 10.0 2.8 2.0 sun\n",
"10 Seattle 2012-01-11 00:00 0.0 6.1 -1.1 5.1 sun"
]
},
"execution_count": 93,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# You can use a list of ids to fetch\n",
"weather_df.loc[[1,5,7,10]]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# Negative indecies don't work!\n",
"\n",
"weather_df.loc[-1]\n",
"\n",
"# Why?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# you can also set the column you want like so\n",
"weather_df.loc[1:5, [\"location\",\"weather\"]]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Using iloc\n",
"It is exactly like loc, but uses numeric indecies"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"weather_df.loc[1:5]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# Negative indicies work this time!\n",
"weather_df.iloc[-1]\n",
"\n",
"# why?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# selecting columns is also numeric\n",
"weather_df.iloc[:,0:2].head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Sorting\n",
"\n",
"- Sort rows based on values of column(s)\n",
"- Descending or ascending order"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# Sort based on temp_max\n",
"weather_df.sort_values(by=\"temp_max\").head()\n",
"\n",
"# is this ascending or descending order?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# To sort in descending order, set ascending argument to False\n",
"weather_df.sort_values(by=\"temp_max\", ascending=False).head()\n",
"\n",
"# Seems like New York has the highest and lowest tempratures!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# Sort weather_df by percipitation in ascending order"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# Notice how percipitation is 0 for many observations\n",
"# To sort by percipitation, then by wind speed, both ascending, do the following:\n",
"weather_df.sort_values(by=[\"precipitation\", \"wind\"]).head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# To sort by percipitation ascending, then by wind speed descending, do the following:\n",
"weather_df.sort_values(by=[\"precipitation\", \"wind\"], ascending=[True, False]).head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Filteration\n",
"\n",
"Selecting rows based on logical conditions. e.g., weather observations in New York, or observations where wind speed is higher than 10\n",
"\n",
"You use conditions very similar to Python conditions in syntax, with some slight variation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# to fetch observations for New York\n",
"weather_df[weather_df[\"location\"] == \"New York\"].head()\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Perform the same filter using dot notation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# Filter observations where wind is higher than 10\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Now try to find out how many observations there are using 2 different methods"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# Filter all observations where temp_min is less than or equal to zero and weather is rain\n",
"weather_df[(weather_df.temp_min < 0) & (weather_df.weather == \"rain\")].head()\n",
"\n",
"# The parantheses are important!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# You can write it over multiple lines to be easier to read\n",
"\n",
"weather_df[\n",
" (weather_df.temp_min < 0) & \n",
" (weather_df.weather == \"rain\")\n",
" ].head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# You can also use 'or' in the condition\n",
"weather_df[\n",
" (weather_df.temp_min < 0) & \n",
" (\n",
" (weather_df.weather == \"rain\") | \n",
" (weather_df.weather == \"snow\")\n",
" )\n",
" ].head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Some useful functions used in filteration\n",
"\n",
"- isin(values)\n",
"- isnull(), notnull()\n",
"- duplicated\n",
"\n",
"You can use these in filteration conditions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# Filter using method isin to find observations where whether is either rain or snow"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Data Transformation\n",
"\n",
"- Operations can be performed on columns\n",
" - Used to create new columns by combining or tranforming other columns\n",
"- All values in a column will have the same operation performed on them\n",
"- When operating on two or more columns, the operations are performed on items in the same position\n",
" - Columns must match in size"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Useful methods and Operators\n",
"\n",
"- Almost all the mathematical operators are available\n",
"- Useful methods to perform calculations on columns are:\n",
" - max, min, mean, median, mode, std, var, count, sum, mod\n",
"- Method **apply** will accept a function that takes a single argument, and returns a value\n",
" - The function is applied to every item in the column and a new column is created with the results\n",
"- Useful methods to clean the dataframe are:\n",
" - dropna, drop_duplicates, fillna\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# Calculating the temprature range\n",
"\n",
"\n",
"# Try to store it in a column called temp_range\n",
"# be sure to try dot and index notations"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# Calculate the mean range and store it in a column called mean_range\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# Calculate the mean centered value of range \n",
"# mean centering = temp_range - mean_range\n",
"# tells us how much the observation is different from the mean\n",
"# name the collumn mc_range\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# Caclulate the square of mc_range\n",
"# name the new column mc_range_sq\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# calculate the natural log of mc_range and use name mc_range_log\n",
"# tip: search numpy \n",
"# be sure to examine the data, what do you see?\n",
"# What should you do?\n",
"# Is fillna(0) a good idea?\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# based on what you know so far,\n",
"# try to plot range, mc_range, and mc_range_sq\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# try to plot the distributions for the new range columns\n",
"# hint: search for histograms"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# try to count the number of observations where the temprature change is above average\n",
"\n",
"# can you produce a scalar value instead of a column?\n",
"\n",
"# Can you calculate the ratio?\n",
"\n",
"# Can you calculate the percentage?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# Calculate the average temprature for the day\n",
"# hint: use temp_max and temp_min\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# plot the average temp"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# plot the distribution for average temp"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# compare the distribution of average temp with mc_range"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# find the days in which the average temprature is below zero and it is snowing\n",
"# calculate the percentage of these days"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}