MIS 492 - Data Analysis and Visualization¶

Week 5¶

Advanced Data Transformation¶

Dr. Mohammad AlMarzouq¶

Joining Data¶

Analysis is typically done a single dataframe
Sometime the data might be in two different files/dataframes
joining combines the data into a single dataframe

Concatination operation¶

Easiest form of joining data
Dataframes must have identical columns
Rows from one dataframe are added to another
- End result is a dataframe containing all the rows from combined dataframes

Join Operation¶

Combines columns from two different dataframe into a single dataframe
This is what we typically mean by joining data
In pandas, you use
- join() if you are joining on dataframe indecies
- merge() if you are joining on columns

Things to consider when joining data¶

But first, let's learn how to connect to fetch data from databases¶

Dataset can be found at European Soccer Kaggle Dataset

by Hugo Mathien

You can download the sqlite db for this exercise from here

Working with Sqlite3 DBs¶

In [1]:

# import libraries
import pandas as pd
import sqlite3

# connect to database 
# database.sqlite is the name of the database
db = sqlite3.connect("database.sqlite")

Fetching data from the database connection¶

This involves writing SQL

This page describes how the data looks like.

If you look to the left of the page, you will see the following tables:

Player
Player_Attributes

Let's load 500 entries from them

In [78]:

# prepare the sql statement
sql = "SELECT * FROM Player limit 5000"

# execute it on the database
player_df = pd.read_sql(sql, db)

# let's view the data
player_df.head()

Out[78]:

	id	player_api_id	player_name	player_fifa_api_id	birthday	height	weight
0	1	505942	Aaron Appindangoye	218353	1992-02-29 00:00:00	182.88	187
1	2	155782	Aaron Cresswell	189615	1989-12-15 00:00:00	170.18	146
2	3	162549	Aaron Doran	186170	1991-05-13 00:00:00	170.18	163
3	4	30572	Aaron Galindo	140161	1982-05-08 00:00:00	182.88	198
4	5	23780	Aaron Hughes	17725	1979-11-08 00:00:00	182.88	154

In [79]:

# Now it is your turn to fetch 500 entries from Player_Attributes
sql = "SELECT * FROM Player_Attributes limit 5000"
atts_df = pd.read_sql(sql, db)
atts_df.head()

Out[79]:

	id	player_fifa_api_id	player_api_id	date	overall_rating	potential	preferred_foot	attacking_work_rate	defensive_work_rate	crossing	...	vision	penalties	marking	standing_tackle	sliding_tackle	gk_diving	gk_handling	gk_kicking	gk_positioning	gk_reflexes
0	1	218353	505942	2016-02-18 00:00:00	67	71	right	medium	medium	49	...	54	48	65	69	69	6	11	10	8	8
1	2	218353	505942	2015-11-19 00:00:00	67	71	right	medium	medium	49	...	54	48	65	69	69	6	11	10	8	8
2	3	218353	505942	2015-09-21 00:00:00	62	66	right	medium	medium	49	...	54	48	65	66	69	6	11	10	8	8
3	4	218353	505942	2015-03-20 00:00:00	61	65	right	medium	medium	48	...	53	47	62	63	66	5	10	9	7	7
4	5	218353	505942	2007-02-22 00:00:00	61	65	right	medium	medium	48	...	53	47	62	63	66	5	10	9	7	7

5 rows × 42 columns

Things to consider when joining data¶

Is there a key to combine data on? How will rows be matched to one another?

Examine the two data frames and suggest a key to use to combine

Things to consider when joining data¶

What about entries that do not have a match in the other dataframe? do we include them?
- Inner means to include in the result only the records that have matching entries in both dataframes
- Outer means to include all entries, including ones without matching entries
  - The values for columns with non-matching entries will be NaN

In [35]:

# Let's try inner join


player_df.merge(atts_df,how="inner",on="player_api_id").head()

Out[35]:

	id_x	player_api_id	player_name	player_fifa_api_id_x	birthday	height	weight	id_y	player_fifa_api_id_y	date	...	vision	penalties	marking	standing_tackle	sliding_tackle	gk_diving	gk_handling	gk_kicking	gk_positioning	gk_reflexes
0	1	505942	Aaron Appindangoye	218353	1992-02-29 00:00:00	182.88	187	1	218353	2016-02-18 00:00:00	...	54	48	65	69	69	6	11	10	8	8
1	1	505942	Aaron Appindangoye	218353	1992-02-29 00:00:00	182.88	187	2	218353	2015-11-19 00:00:00	...	54	48	65	69	69	6	11	10	8	8
2	1	505942	Aaron Appindangoye	218353	1992-02-29 00:00:00	182.88	187	3	218353	2015-09-21 00:00:00	...	54	48	65	66	69	6	11	10	8	8
3	1	505942	Aaron Appindangoye	218353	1992-02-29 00:00:00	182.88	187	4	218353	2015-03-20 00:00:00	...	53	47	62	63	66	5	10	9	7	7
4	1	505942	Aaron Appindangoye	218353	1992-02-29 00:00:00	182.88	187	5	218353	2007-02-22 00:00:00	...	53	47	62	63	66	5	10	9	7	7

5 rows × 48 columns

In [37]:

# Outer join will give us an idea of where the data went

player_df.merge(atts_df,how="outer",on="player_api_id").head()

Out[37]:

	id_x	player_api_id	player_name	player_fifa_api_id_x	birthday	height	weight	id_y	player_fifa_api_id_y	date	...	vision	penalties	marking	standing_tackle	sliding_tackle	gk_diving	gk_handling	gk_kicking	gk_positioning	gk_reflexes
0	1	505942	Aaron Appindangoye	218353	1992-02-29 00:00:00	182.88	187	1	218353	2016-02-18 00:00:00	...	54	48	65	69	69	6	11	10	8	8
1	1	505942	Aaron Appindangoye	218353	1992-02-29 00:00:00	182.88	187	2	218353	2015-11-19 00:00:00	...	54	48	65	69	69	6	11	10	8	8
2	1	505942	Aaron Appindangoye	218353	1992-02-29 00:00:00	182.88	187	3	218353	2015-09-21 00:00:00	...	54	48	65	66	69	6	11	10	8	8
3	1	505942	Aaron Appindangoye	218353	1992-02-29 00:00:00	182.88	187	4	218353	2015-03-20 00:00:00	...	53	47	62	63	66	5	10	9	7	7
4	1	505942	Aaron Appindangoye	218353	1992-02-29 00:00:00	182.88	187	5	218353	2007-02-22 00:00:00	...	53	47	62	63	66	5	10	9	7	7

5 rows × 48 columns

In [38]:

# Try to compare both operations by check counts, and null values
# can you find differences?
# Can you explain why these differences exist?

Things to consider when joining data¶

The level of analysis
- Be careful in your analysis with this!
- Make sure you have the correct variable
Consider the dataframes for: Player, Team, Match, League, Country
- What are the levels of analysis and their relationship to observations in other dataframes?
- What happens when we combine them?

What is Level of Analysis?¶

Consider the Euro Soccer data:
- A league will have many teams
- A team will have many players
Analysis can be at:
- League level, where you compare leagues
- Team level, where you compare teams
- Player level, where you compare players
- This is what we mean by level of analysis (AKA Unit of Analysis)

Team Level Analysis¶

Do you include the league attributes?
Do you include the team attributes?
Do you include the player attributes?

Load Data¶

Load players.csv and teams.csv into player_df and team_df respectively

In [80]:

# Load players and teams data here
player_df = 
team_df =

In [104]:

player_df.head()

Out[104]:

	id_x	player_api_id	player_name	player_fifa_api_id_x	birthday	height	weight	id_y	player_fifa_api_id_y	date	...	penalties	marking	standing_tackle	sliding_tackle	gk_diving	gk_handling	gk_kicking	gk_positioning	gk_reflexes	team_api_id
765	48	439366	Abdoulaye Toure	210450	1994-03-03 00:00:00	187.96	170	766	210450	2016-05-05 00:00:00	...	39	57	62	58	8	15	9	9	7	8674
1050	68	37422	Abella Perez Damia	159580	1982-04-15 00:00:00	187.96	174	1051	159580	2016-01-28 00:00:00	...	46	65	70	69	13	9	12	19	12	8674
2118	129	160447	Adam Smith	190885	1991-04-29 00:00:00	180.34	179	2119	190885	2015-11-12 00:00:00	...	48	71	70	68	8	9	15	9	6	8674
4180	253	32547	Alan Gow	140307	1982-10-09 00:00:00	182.88	154	4181	140307	2015-02-27 00:00:00	...	62	25	23	25	6	9	10	12	9	8674
3081	182	168047	Adrian Stoian	192072	1991-02-11 00:00:00	177.80	146	3082	192072	2016-02-04 00:00:00	...	65	28	30	32	10	11	7	11	6	8674

5 rows × 49 columns

In [106]:

team_df.head()

Out[106]:

	id_x	team_api_id	team_fifa_api_id_x	team_long_name	team_short_name	id_y	team_fifa_api_id_y	date	buildUpPlaySpeed	buildUpPlaySpeedClass	...	chanceCreationShooting	chanceCreationShootingClass	chanceCreationPositioningClass	defencePressure	defencePressureClass	defenceAggression	defenceAggressionClass	defenceTeamWidth	defenceTeamWidthClass	defenceDefenderLineClass
317	26548	8674	1915	FC Groningen	GRO	427	1915	2010-02-22 00:00:00	41	Balanced	...	69	Lots	Organised	30	Deep	30	Contain	30	Narrow	Cover
150	11822	4087	111271	Évian Thonon Gaillard FC	ETG	411	111271	2011-02-22 00:00:00	35	Balanced	...	65	Normal	Organised	45	Medium	55	Press	65	Normal	Cover
456	43053	9906	240	Atlético Madrid	AMA	95	240	2010-02-22 00:00:00	64	Balanced	...	70	Lots	Free Form	70	High	34	Press	55	Normal	Offside Trap
374	35284	9807	1889	CF Os Belenenses	BEL	156	1889	2010-02-22 00:00:00	30	Slow	...	60	Normal	Organised	30	Deep	30	Contain	30	Narrow	Offside Trap
355	27780	10218	1971	Excelsior	EXC	416	1971	2011-02-22 00:00:00	73	Fast	...	52	Normal	Organised	25	Deep	47	Press	33	Narrow	Cover

5 rows × 29 columns

Team Level Analysis¶

Do you include the league attributes?
- Yes you can
Do you include the team attributes?
- Yes you can, this is the point of the analysis
Do you include the player attributes?
- No! unless you aggregate!

What is aggregation?¶

Combining observations from the same level of analysis into a single observation at a higher level of analysis
This follows the principle of tidy data, where each dataframe represents a single observational unit
- e.g.: team data only, or player data only in the dataframe

Match Analysis Example¶

buildUpPlaySpeed is a team attribute.
However, overall_rating is a player attribute.
- You cannot include a single player overall_rating in the analysis of a team, because the value describe a single player.
- However, if you calculate the average_overall_rating for all players in that team, you get a value that we can use to describe a team, because a team consists of players.
- Any operatino to combine the overall_rating for all the players in the team will work:
  - Count, Sum, Min, Max, Std, Var, Mean, Median ...etc.
You can include all match attributes in the analysis of matches
You must aggregate player attribute to include it in match analysis

In [110]:

team_df[["team_api_id","buildUpPlaySpeed"]].head()

Out[110]:

	team_api_id	buildUpPlaySpeed
317	8674	41
150	4087	35
456	9906	64
374	9807	30
355	10218	73

In [112]:

player_df[["player_api_id","team_api_id","overall_rating"]].head()

Out[112]:

	player_api_id	team_api_id	overall_rating
765	439366	8674	64
1050	37422	8674	71
2118	160447	8674	70
4180	32547	8674	63
3081	168047	8674	70

In [113]:

# First we have to merge based on team_api_id

merged_df = player_df.merge(team_df, how="inner", on="team_api_id")
merged_df.columns

Out[113]:

Index(['id_x_x', 'player_api_id', 'player_name', 'player_fifa_api_id_x',
       'birthday', 'height', 'weight', 'id_y_x', 'player_fifa_api_id_y',
       'date_x', 'overall_rating', 'potential', 'preferred_foot',
       'attacking_work_rate', 'defensive_work_rate', 'crossing', 'finishing',
       'heading_accuracy', 'short_passing', 'volleys', 'dribbling', 'curve',
       'free_kick_accuracy', 'long_passing', 'ball_control', 'acceleration',
       'sprint_speed', 'agility', 'reactions', 'balance', 'shot_power',
       'jumping', 'stamina', 'strength', 'long_shots', 'aggression',
       'interceptions', 'positioning', 'vision', 'penalties', 'marking',
       'standing_tackle', 'sliding_tackle', 'gk_diving', 'gk_handling',
       'gk_kicking', 'gk_positioning', 'gk_reflexes', 'team_api_id', 'id_x_y',
       'team_fifa_api_id_x', 'team_long_name', 'team_short_name', 'id_y_y',
       'team_fifa_api_id_y', 'date_y', 'buildUpPlaySpeed',
       'buildUpPlaySpeedClass', 'buildUpPlayDribbling',
       'buildUpPlayDribblingClass', 'buildUpPlayPassing',
       'buildUpPlayPassingClass', 'buildUpPlayPositioningClass',
       'chanceCreationPassing', 'chanceCreationPassingClass',
       'chanceCreationCrossing', 'chanceCreationCrossingClass',
       'chanceCreationShooting', 'chanceCreationShootingClass',
       'chanceCreationPositioningClass', 'defencePressure',
       'defencePressureClass', 'defenceAggression', 'defenceAggressionClass',
       'defenceTeamWidth', 'defenceTeamWidthClass',
       'defenceDefenderLineClass'],
      dtype='object')

In [120]:

# notice that team entries are duplicated
# and that we have an entry for every player
merged_df[["player_name","team_long_name","overall_rating","buildUpPlaySpeed"]].head()

Out[120]:

	player_name	team_long_name	overall_rating	buildUpPlaySpeed
0	Abdoulaye Toure	FC Groningen	64	41
1	Abella Perez Damia	FC Groningen	71	41
2	Adam Smith	FC Groningen	70	41
3	Alan Gow	FC Groningen	63	41
4	Adrian Stoian	FC Groningen	70	41

In [122]:

# To analyze teams, you must aggregate player observations if you want to include them
# otherwise, you have to filter on team attributes and remove duplicates

# let's aggregate overall_rating by calculating the average for the players
merged_df[
            ["player_name","team_api_id","team_long_name","overall_rating","buildUpPlaySpeed"]
         ].groupby("team_api_id").agg({"overall_rating":"mean"})

Out[122]:

	overall_rating
team_api_id
4087	62.090909
7788	67.909091
7819	68.000000
8262	67.818182
8322	66.000000
8342	67.090909
8526	69.636364
8535	69.000000
8559	67.545455
8674	68.000000
9789	66.727273
9807	69.363636
9810	67.363636
9825	65.454545
9826	70.636364
9880	72.363636
9906	66.272727
9987	69.000000
10218	64.909091
208931	67.454545

In [125]:

# simply merge it to team_df to start analyzing teams
# but dont forget to reset_index to convert the index into a regular column
ratings_df = merged_df[
            ["player_name","team_api_id","team_long_name","overall_rating","buildUpPlaySpeed"]
         ].groupby("team_api_id").agg({"overall_rating":"mean"}).reset_index()

team_df.merge(ratings_df, how="inner", on="team_api_id").head()

Out[125]:

	id_x	team_api_id	team_fifa_api_id_x	team_long_name	team_short_name	id_y	team_fifa_api_id_y	date	buildUpPlaySpeed	buildUpPlaySpeedClass	...	chanceCreationShootingClass	chanceCreationPositioningClass	defencePressure	defencePressureClass	defenceAggression	defenceAggressionClass	defenceTeamWidth	defenceTeamWidthClass	defenceDefenderLineClass	overall_rating
0	26548	8674	1915	FC Groningen	GRO	427	1915	2010-02-22 00:00:00	41	Balanced	...	Lots	Organised	30	Deep	30	Contain	30	Narrow	Cover	68.000000
1	11822	4087	111271	Évian Thonon Gaillard FC	ETG	411	111271	2011-02-22 00:00:00	35	Balanced	...	Normal	Organised	45	Medium	55	Press	65	Normal	Cover	62.090909
2	43053	9906	240	Atlético Madrid	AMA	95	240	2010-02-22 00:00:00	64	Balanced	...	Lots	Free Form	70	High	34	Press	55	Normal	Offside Trap	66.272727
3	35284	9807	1889	CF Os Belenenses	BEL	156	1889	2010-02-22 00:00:00	30	Slow	...	Normal	Organised	30	Deep	30	Contain	30	Narrow	Offside Trap	69.363636
4	27780	10218	1971	Excelsior	EXC	416	1971	2011-02-22 00:00:00	73	Fast	...	Normal	Organised	25	Deep	47	Press	33	Narrow	Cover	64.909091

5 rows × 30 columns

Player Analysis Example¶

overall_rating is a player attribute
buildUpPlaySpeed is a team attribute
- While this is an attribute that describes a team, this is the team that the player is part of
- The player is affected by the overall performance of the team, and describes the team that the player is part of, so in a way, it is a player attribute
- You will notice that the value of buildUpPlaySpeed does not change for players in the same team
You can include all player attribute to analyze and compare players
You can also include team attributes without problems, because they can also be considered player attribute

In [127]:

# you can perform your analysis directly on marged_df
# because the level of analysis is the player there
merged_df[["player_name","team_api_id","team_long_name","overall_rating","buildUpPlaySpeed"]].head()

Out[127]:

	player_name	team_api_id	team_long_name	overall_rating	buildUpPlaySpeed
0	Abdoulaye Toure	8674	FC Groningen	64	41
1	Abella Perez Damia	8674	FC Groningen	71	41
2	Adam Smith	8674	FC Groningen	70	41
3	Alan Gow	8674	FC Groningen	63	41
4	Adrian Stoian	8674	FC Groningen	70	41

Aggregating With Transform¶

If you want to create a column in merged_df that contains the average overall_rating then you use transform

In [142]:

import numpy as np
merged_df.groupby("team_api_id").transform(np.mean).overall_rating.head(15)

Out[142]:

0     68.000000
1     68.000000
2     68.000000
3     68.000000
4     68.000000
5     68.000000
6     68.000000
7     68.000000
8     68.000000
9     68.000000
10    68.000000
11    62.090909
12    62.090909
13    62.090909
14    62.090909
Name: overall_rating, dtype: float64

In [141]:

# simply assign this column to merged_df and give it an appropriate name

merged_df["mean_overall_rating"] = merged_df.groupby("team_api_id").transform(np.mean).overall_rating
merged_df.head(15)

Out[141]:

	id_x_x	player_api_id	player_name	player_fifa_api_id_x	birthday	height	weight	id_y_x	player_fifa_api_id_y	date_x	...	chanceCreationShootingClass	chanceCreationPositioningClass	defencePressure	defencePressureClass	defenceAggression	defenceAggressionClass	defenceTeamWidth	defenceTeamWidthClass	defenceDefenderLineClass	mean_overall_rating
0	48	439366	Abdoulaye Toure	210450	1994-03-03 00:00:00	187.96	170	766	210450	2016-05-05 00:00:00	...	Lots	Organised	30	Deep	30	Contain	30	Narrow	Cover	68.000000
1	68	37422	Abella Perez Damia	159580	1982-04-15 00:00:00	187.96	174	1051	159580	2016-01-28 00:00:00	...	Lots	Organised	30	Deep	30	Contain	30	Narrow	Cover	68.000000
2	129	160447	Adam Smith	190885	1991-04-29 00:00:00	180.34	179	2119	190885	2015-11-12 00:00:00	...	Lots	Organised	30	Deep	30	Contain	30	Narrow	Cover	68.000000
3	253	32547	Alan Gow	140307	1982-10-09 00:00:00	182.88	154	4181	140307	2015-02-27 00:00:00	...	Lots	Organised	30	Deep	30	Contain	30	Narrow	Cover	68.000000
4	182	168047	Adrian Stoian	192072	1991-02-11 00:00:00	177.80	146	3082	192072	2016-02-04 00:00:00	...	Lots	Organised	30	Deep	30	Contain	30	Narrow	Cover	68.000000
5	246	34268	Alain Nef	49939	1982-02-06 00:00:00	190.50	194	4057	49939	2016-03-10 00:00:00	...	Lots	Organised	30	Deep	30	Contain	30	Narrow	Cover	68.000000
6	65	302985	Abel Khaled	207541	1992-11-09 00:00:00	180.34	148	1023	207541	2015-03-13 00:00:00	...	Lots	Organised	30	Deep	30	Contain	30	Narrow	Cover	68.000000
7	206	213366	Afriyie Acquah	201223	1992-01-05 00:00:00	177.80	154	3487	201223	2016-05-12 00:00:00	...	Lots	Organised	30	Deep	30	Contain	30	Narrow	Cover	68.000000
8	73	80592	Aboubakar Oumarou	218548	1987-04-01 00:00:00	182.88	168	1129	218548	2015-04-01 00:00:00	...	Lots	Organised	30	Deep	30	Contain	30	Narrow	Cover	68.000000
9	275	37503	Albano Benjamin Bizzarri	14907	1977-11-09 00:00:00	193.04	196	4532	14907	2015-11-26 00:00:00	...	Lots	Organised	30	Deep	30	Contain	30	Narrow	Cover	68.000000
10	37	173955	Abdoul Razzagui Camara	193953	1990-02-20 00:00:00	177.80	157	555	193953	2016-04-21 00:00:00	...	Lots	Organised	30	Deep	30	Contain	30	Narrow	Cover	68.000000
11	247	182847	Alain Pierre Mendy	209352	1989-11-17 00:00:00	182.88	159	4088	209352	2013-03-15 00:00:00	...	Normal	Organised	45	Medium	55	Press	65	Normal	Cover	62.090909
12	162	121643	Adrian Chomiuk	186629	1988-06-23 00:00:00	182.88	179	2749	186629	2010-08-30 00:00:00	...	Normal	Organised	45	Medium	55	Press	65	Normal	Cover	62.090909
13	243	127255	Akwetey Mensah	198781	1983-04-15 00:00:00	172.72	163	4025	198781	2010-08-30 00:00:00	...	Normal	Organised	45	Medium	55	Press	65	Normal	Cover	62.090909
14	51	419681	Abdul Aziz Tetteh	190193	1989-02-10 00:00:00	182.88	190	803	190193	2016-05-05 00:00:00	...	Normal	Organised	45	Medium	55	Press	65	Normal	Cover	62.090909

15 rows × 78 columns

Performing Analysis¶

We combine data into single dataframe
With varying levels of analysis, we have varying degrees of variability because of duplication
- Remember how all players in the same team share the same value of the team attribute buildUpPlaySpeed
When we combine data into a single dataframe we have non-normal form data with lots of duplicated values
- Normal form is a database term, not stats
- Data stored in a data is usually in normal form
- While some values might be duplicated, the records are not

Summary¶

Level of analysis is important
You can include variables from higher levels of analysis without issues
- Be aware that you might not have variability
Including variables from lower levels of analysis requires aggregation
- You aggregate in many different ways: Sums, Counts, Min, Max, Mean, Median, Mode ..etc
- Aggregation is to produce a single scalar value from a group of values
Represent many observations at a lower level into a single value at a higher level

Final Note About Groupby¶

You don't have to have different levels of analysis to use groupby and aggregations
You can use agg() and transform() with group by to analyze subgroups
- Just group by the variable you want to create subgroups from
- groupby should be given a categorical or discrete variable
- subgroups can be created from a combination of variables

In [38]:

# At the team level of analysis
# create 4 new columns from player sprint_speed data:
# mean_sprint_speed, max_sprint_speed, min_sprint_speed, and std_sprint_speed

In [147]:

# at the player unit of analysis 
# create 4 new columns from player sprint_speed data:
# mean_sprint_speed, max_sprint_speed, min_sprint_speed, and std_sprint_speed