{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# MIS 492 - Data Analysis and Visualization\n", "## Week 5\n", "\n", "## Advanced Data Transformation\n", "\n", "### Dr. Mohammad AlMarzouq\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Joining Data\n", "\n", "- Analysis is typically done a single dataframe\n", "- Sometime the data might be in two different files/dataframes\n", "- joining combines the data into a single dataframe\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Concatination operation\n", "- Easiest form of joining data\n", "- Dataframes must have identical columns\n", "- Rows from one dataframe are added to another\n", " - End result is a dataframe containing all the rows from combined dataframes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Join Operation\n", "- Combines columns from two different dataframe into a single dataframe\n", "- This is what we typically mean by joining data\n", "- In pandas, you use \n", " - join() if you are joining on dataframe indecies\n", " - merge() if you are joining on columns \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Things to consider when joining data\n", "\n", "## But first, let's learn how to connect to fetch data from databases\n", "Dataset can be found at [European Soccer Kaggle Dataset](https://www.kaggle.com/hugomathien/soccer)\n", "\n", "by Hugo Mathien\n", "\n", "You can download the sqlite db for this exercise from [here](https://github.com/qmisr/mis492/raw/master/database.sqlite.zip)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Working with Sqlite3 DBs" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "# import libraries\n", "import pandas as pd\n", "import sqlite3\n", "\n", "# connect to database \n", "# database.sqlite is the name of the database\n", "db = sqlite3.connect(\"database.sqlite\") " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Fetching data from the database connection\n", "This involves writing SQL\n", "\n", "This [page](https://www.kaggle.com/hugomathien/soccer/data) describes how the data looks like.\n", "\n", "If you look to the left of the page, you will see the following tables:\n", "- Player\n", "- Player_Attributes\n", "\n", "Let's load 500 entries from them" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idplayer_api_idplayer_nameplayer_fifa_api_idbirthdayheightweight
01505942Aaron Appindangoye2183531992-02-29 00:00:00182.88187
12155782Aaron Cresswell1896151989-12-15 00:00:00170.18146
23162549Aaron Doran1861701991-05-13 00:00:00170.18163
3430572Aaron Galindo1401611982-05-08 00:00:00182.88198
4523780Aaron Hughes177251979-11-08 00:00:00182.88154
\n", "
" ], "text/plain": [ " id player_api_id player_name player_fifa_api_id \\\n", "0 1 505942 Aaron Appindangoye 218353 \n", "1 2 155782 Aaron Cresswell 189615 \n", "2 3 162549 Aaron Doran 186170 \n", "3 4 30572 Aaron Galindo 140161 \n", "4 5 23780 Aaron Hughes 17725 \n", "\n", " birthday height weight \n", "0 1992-02-29 00:00:00 182.88 187 \n", "1 1989-12-15 00:00:00 170.18 146 \n", "2 1991-05-13 00:00:00 170.18 163 \n", "3 1982-05-08 00:00:00 182.88 198 \n", "4 1979-11-08 00:00:00 182.88 154 " ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# prepare the sql statement\n", "sql = \"SELECT * FROM Player limit 5000\"\n", "\n", "# execute it on the database\n", "player_df = pd.read_sql(sql, db)\n", "\n", "# let's view the data\n", "player_df.head()" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idplayer_fifa_api_idplayer_api_iddateoverall_ratingpotentialpreferred_footattacking_work_ratedefensive_work_ratecrossing...visionpenaltiesmarkingstanding_tacklesliding_tacklegk_divinggk_handlinggk_kickinggk_positioninggk_reflexes
012183535059422016-02-18 00:00:006771rightmediummedium49...54486569696111088
122183535059422015-11-19 00:00:006771rightmediummedium49...54486569696111088
232183535059422015-09-21 00:00:006266rightmediummedium49...54486566696111088
342183535059422015-03-20 00:00:006165rightmediummedium48...5347626366510977
452183535059422007-02-22 00:00:006165rightmediummedium48...5347626366510977
\n", "

5 rows × 42 columns

\n", "
" ], "text/plain": [ " id player_fifa_api_id player_api_id date overall_rating \\\n", "0 1 218353 505942 2016-02-18 00:00:00 67 \n", "1 2 218353 505942 2015-11-19 00:00:00 67 \n", "2 3 218353 505942 2015-09-21 00:00:00 62 \n", "3 4 218353 505942 2015-03-20 00:00:00 61 \n", "4 5 218353 505942 2007-02-22 00:00:00 61 \n", "\n", " potential preferred_foot attacking_work_rate defensive_work_rate crossing \\\n", "0 71 right medium medium 49 \n", "1 71 right medium medium 49 \n", "2 66 right medium medium 49 \n", "3 65 right medium medium 48 \n", "4 65 right medium medium 48 \n", "\n", " ... vision penalties marking standing_tackle sliding_tackle \\\n", "0 ... 54 48 65 69 69 \n", "1 ... 54 48 65 69 69 \n", "2 ... 54 48 65 66 69 \n", "3 ... 53 47 62 63 66 \n", "4 ... 53 47 62 63 66 \n", "\n", " gk_diving gk_handling gk_kicking gk_positioning gk_reflexes \n", "0 6 11 10 8 8 \n", "1 6 11 10 8 8 \n", "2 6 11 10 8 8 \n", "3 5 10 9 7 7 \n", "4 5 10 9 7 7 \n", "\n", "[5 rows x 42 columns]" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Now it is your turn to fetch 500 entries from Player_Attributes\n", "sql = \"SELECT * FROM Player_Attributes limit 5000\"\n", "atts_df = pd.read_sql(sql, db)\n", "atts_df.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Things to consider when joining data\n", "\n", "- Is there a key to combine data on? How will rows be matched to one another?\n", "\n", "Examine the two data frames and suggest a key to use to combine" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Things to consider when joining data\n", "\n", "- What about entries that do not have a match in the other dataframe? do we include them?\n", " - Inner means to include in the result only the records that have matching entries in both dataframes\n", " - Outer means to include all entries, including ones without matching entries\n", " - The values for columns with non-matching entries will be NaN" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
id_xplayer_api_idplayer_nameplayer_fifa_api_id_xbirthdayheightweightid_yplayer_fifa_api_id_ydate...visionpenaltiesmarkingstanding_tacklesliding_tacklegk_divinggk_handlinggk_kickinggk_positioninggk_reflexes
01505942Aaron Appindangoye2183531992-02-29 00:00:00182.8818712183532016-02-18 00:00:00...54486569696111088
11505942Aaron Appindangoye2183531992-02-29 00:00:00182.8818722183532015-11-19 00:00:00...54486569696111088
21505942Aaron Appindangoye2183531992-02-29 00:00:00182.8818732183532015-09-21 00:00:00...54486566696111088
31505942Aaron Appindangoye2183531992-02-29 00:00:00182.8818742183532015-03-20 00:00:00...5347626366510977
41505942Aaron Appindangoye2183531992-02-29 00:00:00182.8818752183532007-02-22 00:00:00...5347626366510977
\n", "

5 rows × 48 columns

\n", "
" ], "text/plain": [ " id_x player_api_id player_name player_fifa_api_id_x \\\n", "0 1 505942 Aaron Appindangoye 218353 \n", "1 1 505942 Aaron Appindangoye 218353 \n", "2 1 505942 Aaron Appindangoye 218353 \n", "3 1 505942 Aaron Appindangoye 218353 \n", "4 1 505942 Aaron Appindangoye 218353 \n", "\n", " birthday height weight id_y player_fifa_api_id_y \\\n", "0 1992-02-29 00:00:00 182.88 187 1 218353 \n", "1 1992-02-29 00:00:00 182.88 187 2 218353 \n", "2 1992-02-29 00:00:00 182.88 187 3 218353 \n", "3 1992-02-29 00:00:00 182.88 187 4 218353 \n", "4 1992-02-29 00:00:00 182.88 187 5 218353 \n", "\n", " date ... vision penalties marking \\\n", "0 2016-02-18 00:00:00 ... 54 48 65 \n", "1 2015-11-19 00:00:00 ... 54 48 65 \n", "2 2015-09-21 00:00:00 ... 54 48 65 \n", "3 2015-03-20 00:00:00 ... 53 47 62 \n", "4 2007-02-22 00:00:00 ... 53 47 62 \n", "\n", " standing_tackle sliding_tackle gk_diving gk_handling gk_kicking \\\n", "0 69 69 6 11 10 \n", "1 69 69 6 11 10 \n", "2 66 69 6 11 10 \n", "3 63 66 5 10 9 \n", "4 63 66 5 10 9 \n", "\n", " gk_positioning gk_reflexes \n", "0 8 8 \n", "1 8 8 \n", "2 8 8 \n", "3 7 7 \n", "4 7 7 \n", "\n", "[5 rows x 48 columns]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's try inner join\n", "\n", "\n", "player_df.merge(atts_df,how=\"inner\",on=\"player_api_id\").head()\n" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
id_xplayer_api_idplayer_nameplayer_fifa_api_id_xbirthdayheightweightid_yplayer_fifa_api_id_ydate...visionpenaltiesmarkingstanding_tacklesliding_tacklegk_divinggk_handlinggk_kickinggk_positioninggk_reflexes
01505942Aaron Appindangoye2183531992-02-29 00:00:00182.8818712183532016-02-18 00:00:00...54486569696111088
11505942Aaron Appindangoye2183531992-02-29 00:00:00182.8818722183532015-11-19 00:00:00...54486569696111088
21505942Aaron Appindangoye2183531992-02-29 00:00:00182.8818732183532015-09-21 00:00:00...54486566696111088
31505942Aaron Appindangoye2183531992-02-29 00:00:00182.8818742183532015-03-20 00:00:00...5347626366510977
41505942Aaron Appindangoye2183531992-02-29 00:00:00182.8818752183532007-02-22 00:00:00...5347626366510977
\n", "

5 rows × 48 columns

\n", "
" ], "text/plain": [ " id_x player_api_id player_name player_fifa_api_id_x \\\n", "0 1 505942 Aaron Appindangoye 218353 \n", "1 1 505942 Aaron Appindangoye 218353 \n", "2 1 505942 Aaron Appindangoye 218353 \n", "3 1 505942 Aaron Appindangoye 218353 \n", "4 1 505942 Aaron Appindangoye 218353 \n", "\n", " birthday height weight id_y player_fifa_api_id_y \\\n", "0 1992-02-29 00:00:00 182.88 187 1 218353 \n", "1 1992-02-29 00:00:00 182.88 187 2 218353 \n", "2 1992-02-29 00:00:00 182.88 187 3 218353 \n", "3 1992-02-29 00:00:00 182.88 187 4 218353 \n", "4 1992-02-29 00:00:00 182.88 187 5 218353 \n", "\n", " date ... vision penalties marking \\\n", "0 2016-02-18 00:00:00 ... 54 48 65 \n", "1 2015-11-19 00:00:00 ... 54 48 65 \n", "2 2015-09-21 00:00:00 ... 54 48 65 \n", "3 2015-03-20 00:00:00 ... 53 47 62 \n", "4 2007-02-22 00:00:00 ... 53 47 62 \n", "\n", " standing_tackle sliding_tackle gk_diving gk_handling gk_kicking \\\n", "0 69 69 6 11 10 \n", "1 69 69 6 11 10 \n", "2 66 69 6 11 10 \n", "3 63 66 5 10 9 \n", "4 63 66 5 10 9 \n", "\n", " gk_positioning gk_reflexes \n", "0 8 8 \n", "1 8 8 \n", "2 8 8 \n", "3 7 7 \n", "4 7 7 \n", "\n", "[5 rows x 48 columns]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Outer join will give us an idea of where the data went\n", "\n", "player_df.merge(atts_df,how=\"outer\",on=\"player_api_id\").head()\n" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# Try to compare both operations by check counts, and null values\n", "# can you find differences?\n", "# Can you explain why these differences exist?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Things to consider when joining data\n", "\n", "- The level of analysis\n", " - Be careful in your analysis with this!\n", " - Make sure you have the correct variable\n", "- Consider the dataframes for: Player, Team, Match, League, Country\n", " - What are the levels of analysis and their relationship to observations in other dataframes?\n", " - What happens when we combine them?\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# What is Level of Analysis?\n", "\n", "- Consider the Euro Soccer data:\n", " - A league will have many teams\n", " - A team will have many players\n", "- Analysis can be at:\n", " - League level, where you compare leagues\n", " - Team level, where you compare teams\n", " - Player level, where you compare players\n", " - This is what we mean by level of analysis (AKA Unit of Analysis)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Team Level Analysis\n", "- Do you include the league attributes?\n", "- Do you include the team attributes?\n", "- Do you include the player attributes?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Load Data\n", "\n", "Load [players.csv](https://github.com/qmisr/mis492/raw/master/players.csv) and [teams.csv](https://github.com/qmisr/mis492/raw/master/teams.csv) into **player_df** and **team_df** respectively" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# Load players and teams data here\n", "player_df = \n", "team_df = " ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
id_xplayer_api_idplayer_nameplayer_fifa_api_id_xbirthdayheightweightid_yplayer_fifa_api_id_ydate...penaltiesmarkingstanding_tacklesliding_tacklegk_divinggk_handlinggk_kickinggk_positioninggk_reflexesteam_api_id
76548439366Abdoulaye Toure2104501994-03-03 00:00:00187.961707662104502016-05-05 00:00:00...395762588159978674
10506837422Abella Perez Damia1595801982-04-15 00:00:00187.9617410511595802016-01-28 00:00:00...466570691391219128674
2118129160447Adam Smith1908851991-04-29 00:00:00180.3417921191908852015-11-12 00:00:00...487170688915968674
418025332547Alan Gow1403071982-10-09 00:00:00182.8815441811403072015-02-27 00:00:00...6225232569101298674
3081182168047Adrian Stoian1920721991-02-11 00:00:00177.8014630821920722016-02-04 00:00:00...65283032101171168674
\n", "

5 rows × 49 columns

\n", "
" ], "text/plain": [ " id_x player_api_id player_name player_fifa_api_id_x \\\n", "765 48 439366 Abdoulaye Toure 210450 \n", "1050 68 37422 Abella Perez Damia 159580 \n", "2118 129 160447 Adam Smith 190885 \n", "4180 253 32547 Alan Gow 140307 \n", "3081 182 168047 Adrian Stoian 192072 \n", "\n", " birthday height weight id_y player_fifa_api_id_y \\\n", "765 1994-03-03 00:00:00 187.96 170 766 210450 \n", "1050 1982-04-15 00:00:00 187.96 174 1051 159580 \n", "2118 1991-04-29 00:00:00 180.34 179 2119 190885 \n", "4180 1982-10-09 00:00:00 182.88 154 4181 140307 \n", "3081 1991-02-11 00:00:00 177.80 146 3082 192072 \n", "\n", " date ... penalties marking standing_tackle \\\n", "765 2016-05-05 00:00:00 ... 39 57 62 \n", "1050 2016-01-28 00:00:00 ... 46 65 70 \n", "2118 2015-11-12 00:00:00 ... 48 71 70 \n", "4180 2015-02-27 00:00:00 ... 62 25 23 \n", "3081 2016-02-04 00:00:00 ... 65 28 30 \n", "\n", " sliding_tackle gk_diving gk_handling gk_kicking gk_positioning \\\n", "765 58 8 15 9 9 \n", "1050 69 13 9 12 19 \n", "2118 68 8 9 15 9 \n", "4180 25 6 9 10 12 \n", "3081 32 10 11 7 11 \n", "\n", " gk_reflexes team_api_id \n", "765 7 8674 \n", "1050 12 8674 \n", "2118 6 8674 \n", "4180 9 8674 \n", "3081 6 8674 \n", "\n", "[5 rows x 49 columns]" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "player_df.head()" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
id_xteam_api_idteam_fifa_api_id_xteam_long_nameteam_short_nameid_yteam_fifa_api_id_ydatebuildUpPlaySpeedbuildUpPlaySpeedClass...chanceCreationShootingchanceCreationShootingClasschanceCreationPositioningClassdefencePressuredefencePressureClassdefenceAggressiondefenceAggressionClassdefenceTeamWidthdefenceTeamWidthClassdefenceDefenderLineClass
3172654886741915FC GroningenGRO42719152010-02-22 00:00:0041Balanced...69LotsOrganised30Deep30Contain30NarrowCover
150118224087111271Évian Thonon Gaillard FCETG4111112712011-02-22 00:00:0035Balanced...65NormalOrganised45Medium55Press65NormalCover
456430539906240Atlético MadridAMA952402010-02-22 00:00:0064Balanced...70LotsFree Form70High34Press55NormalOffside Trap
3743528498071889CF Os BelenensesBEL15618892010-02-22 00:00:0030Slow...60NormalOrganised30Deep30Contain30NarrowOffside Trap
35527780102181971ExcelsiorEXC41619712011-02-22 00:00:0073Fast...52NormalOrganised25Deep47Press33NarrowCover
\n", "

5 rows × 29 columns

\n", "
" ], "text/plain": [ " id_x team_api_id team_fifa_api_id_x team_long_name \\\n", "317 26548 8674 1915 FC Groningen \n", "150 11822 4087 111271 Évian Thonon Gaillard FC \n", "456 43053 9906 240 Atlético Madrid \n", "374 35284 9807 1889 CF Os Belenenses \n", "355 27780 10218 1971 Excelsior \n", "\n", " team_short_name id_y team_fifa_api_id_y date \\\n", "317 GRO 427 1915 2010-02-22 00:00:00 \n", "150 ETG 411 111271 2011-02-22 00:00:00 \n", "456 AMA 95 240 2010-02-22 00:00:00 \n", "374 BEL 156 1889 2010-02-22 00:00:00 \n", "355 EXC 416 1971 2011-02-22 00:00:00 \n", "\n", " buildUpPlaySpeed buildUpPlaySpeedClass ... \\\n", "317 41 Balanced ... \n", "150 35 Balanced ... \n", "456 64 Balanced ... \n", "374 30 Slow ... \n", "355 73 Fast ... \n", "\n", " chanceCreationShooting chanceCreationShootingClass \\\n", "317 69 Lots \n", "150 65 Normal \n", "456 70 Lots \n", "374 60 Normal \n", "355 52 Normal \n", "\n", " chanceCreationPositioningClass defencePressure defencePressureClass \\\n", "317 Organised 30 Deep \n", "150 Organised 45 Medium \n", "456 Free Form 70 High \n", "374 Organised 30 Deep \n", "355 Organised 25 Deep \n", "\n", " defenceAggression defenceAggressionClass defenceTeamWidth \\\n", "317 30 Contain 30 \n", "150 55 Press 65 \n", "456 34 Press 55 \n", "374 30 Contain 30 \n", "355 47 Press 33 \n", "\n", " defenceTeamWidthClass defenceDefenderLineClass \n", "317 Narrow Cover \n", "150 Normal Cover \n", "456 Normal Offside Trap \n", "374 Narrow Offside Trap \n", "355 Narrow Cover \n", "\n", "[5 rows x 29 columns]" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "team_df.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Team Level Analysis\n", "- Do you include the league attributes?\n", " - Yes you can\n", "- Do you include the team attributes?\n", " - Yes you can, this is the point of the analysis\n", "- Do you include the player attributes?\n", " - No! **unless you aggregate!**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# What is aggregation?\n", "- Combining observations from the same level of analysis into a single observation at a higher level of analysis\n", "- This follows the principle of tidy data, where each dataframe represents a single observational unit\n", " - e.g.: team data only, or player data only in the dataframe" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Match Analysis Example\n", "- **buildUpPlaySpeed** is a team attribute.\n", "- However, **overall_rating** is a player attribute.\n", " - You cannot include a single player overall_rating in the analysis of a team, because the value describe a single player.\n", " - However, if you calculate the **average_overall_rating** for all players in that team, you get a value that we can use to describe a team, because a team consists of players.\n", " - Any operatino to combine the overall_rating for all the players in the team will work:\n", " - Count, Sum, Min, Max, Std, Var, Mean, Median ...etc.\n", "- You can include all match attributes in the analysis of matches\n", "- You must aggregate player attribute to include it in match analysis\n", " " ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
team_api_idbuildUpPlaySpeed
317867441
150408735
456990664
374980730
3551021873
\n", "
" ], "text/plain": [ " team_api_id buildUpPlaySpeed\n", "317 8674 41\n", "150 4087 35\n", "456 9906 64\n", "374 9807 30\n", "355 10218 73" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "team_df[[\"team_api_id\",\"buildUpPlaySpeed\"]].head()" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
player_api_idteam_api_idoverall_rating
765439366867464
105037422867471
2118160447867470
418032547867463
3081168047867470
\n", "
" ], "text/plain": [ " player_api_id team_api_id overall_rating\n", "765 439366 8674 64\n", "1050 37422 8674 71\n", "2118 160447 8674 70\n", "4180 32547 8674 63\n", "3081 168047 8674 70" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "player_df[[\"player_api_id\",\"team_api_id\",\"overall_rating\"]].head()" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['id_x_x', 'player_api_id', 'player_name', 'player_fifa_api_id_x',\n", " 'birthday', 'height', 'weight', 'id_y_x', 'player_fifa_api_id_y',\n", " 'date_x', 'overall_rating', 'potential', 'preferred_foot',\n", " 'attacking_work_rate', 'defensive_work_rate', 'crossing', 'finishing',\n", " 'heading_accuracy', 'short_passing', 'volleys', 'dribbling', 'curve',\n", " 'free_kick_accuracy', 'long_passing', 'ball_control', 'acceleration',\n", " 'sprint_speed', 'agility', 'reactions', 'balance', 'shot_power',\n", " 'jumping', 'stamina', 'strength', 'long_shots', 'aggression',\n", " 'interceptions', 'positioning', 'vision', 'penalties', 'marking',\n", " 'standing_tackle', 'sliding_tackle', 'gk_diving', 'gk_handling',\n", " 'gk_kicking', 'gk_positioning', 'gk_reflexes', 'team_api_id', 'id_x_y',\n", " 'team_fifa_api_id_x', 'team_long_name', 'team_short_name', 'id_y_y',\n", " 'team_fifa_api_id_y', 'date_y', 'buildUpPlaySpeed',\n", " 'buildUpPlaySpeedClass', 'buildUpPlayDribbling',\n", " 'buildUpPlayDribblingClass', 'buildUpPlayPassing',\n", " 'buildUpPlayPassingClass', 'buildUpPlayPositioningClass',\n", " 'chanceCreationPassing', 'chanceCreationPassingClass',\n", " 'chanceCreationCrossing', 'chanceCreationCrossingClass',\n", " 'chanceCreationShooting', 'chanceCreationShootingClass',\n", " 'chanceCreationPositioningClass', 'defencePressure',\n", " 'defencePressureClass', 'defenceAggression', 'defenceAggressionClass',\n", " 'defenceTeamWidth', 'defenceTeamWidthClass',\n", " 'defenceDefenderLineClass'],\n", " dtype='object')" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# First we have to merge based on team_api_id\n", "\n", "merged_df = player_df.merge(team_df, how=\"inner\", on=\"team_api_id\")\n", "merged_df.columns" ] }, { "cell_type": "code", "execution_count": 120, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
player_nameteam_long_nameoverall_ratingbuildUpPlaySpeed
0Abdoulaye ToureFC Groningen6441
1Abella Perez DamiaFC Groningen7141
2Adam SmithFC Groningen7041
3Alan GowFC Groningen6341
4Adrian StoianFC Groningen7041
\n", "
" ], "text/plain": [ " player_name team_long_name overall_rating buildUpPlaySpeed\n", "0 Abdoulaye Toure FC Groningen 64 41\n", "1 Abella Perez Damia FC Groningen 71 41\n", "2 Adam Smith FC Groningen 70 41\n", "3 Alan Gow FC Groningen 63 41\n", "4 Adrian Stoian FC Groningen 70 41" ] }, "execution_count": 120, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# notice that team entries are duplicated\n", "# and that we have an entry for every player\n", "merged_df[[\"player_name\",\"team_long_name\",\"overall_rating\",\"buildUpPlaySpeed\"]].head()" ] }, { "cell_type": "code", "execution_count": 122, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
overall_rating
team_api_id
408762.090909
778867.909091
781968.000000
826267.818182
832266.000000
834267.090909
852669.636364
853569.000000
855967.545455
867468.000000
978966.727273
980769.363636
981067.363636
982565.454545
982670.636364
988072.363636
990666.272727
998769.000000
1021864.909091
20893167.454545
\n", "
" ], "text/plain": [ " overall_rating\n", "team_api_id \n", "4087 62.090909\n", "7788 67.909091\n", "7819 68.000000\n", "8262 67.818182\n", "8322 66.000000\n", "8342 67.090909\n", "8526 69.636364\n", "8535 69.000000\n", "8559 67.545455\n", "8674 68.000000\n", "9789 66.727273\n", "9807 69.363636\n", "9810 67.363636\n", "9825 65.454545\n", "9826 70.636364\n", "9880 72.363636\n", "9906 66.272727\n", "9987 69.000000\n", "10218 64.909091\n", "208931 67.454545" ] }, "execution_count": 122, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# To analyze teams, you must aggregate player observations if you want to include them\n", "# otherwise, you have to filter on team attributes and remove duplicates\n", "\n", "# let's aggregate overall_rating by calculating the average for the players\n", "merged_df[\n", " [\"player_name\",\"team_api_id\",\"team_long_name\",\"overall_rating\",\"buildUpPlaySpeed\"]\n", " ].groupby(\"team_api_id\").agg({\"overall_rating\":\"mean\"})" ] }, { "cell_type": "code", "execution_count": 125, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
id_xteam_api_idteam_fifa_api_id_xteam_long_nameteam_short_nameid_yteam_fifa_api_id_ydatebuildUpPlaySpeedbuildUpPlaySpeedClass...chanceCreationShootingClasschanceCreationPositioningClassdefencePressuredefencePressureClassdefenceAggressiondefenceAggressionClassdefenceTeamWidthdefenceTeamWidthClassdefenceDefenderLineClassoverall_rating
02654886741915FC GroningenGRO42719152010-02-22 00:00:0041Balanced...LotsOrganised30Deep30Contain30NarrowCover68.000000
1118224087111271Évian Thonon Gaillard FCETG4111112712011-02-22 00:00:0035Balanced...NormalOrganised45Medium55Press65NormalCover62.090909
2430539906240Atlético MadridAMA952402010-02-22 00:00:0064Balanced...LotsFree Form70High34Press55NormalOffside Trap66.272727
33528498071889CF Os BelenensesBEL15618892010-02-22 00:00:0030Slow...NormalOrganised30Deep30Contain30NarrowOffside Trap69.363636
427780102181971ExcelsiorEXC41619712011-02-22 00:00:0073Fast...NormalOrganised25Deep47Press33NarrowCover64.909091
\n", "

5 rows × 30 columns

\n", "
" ], "text/plain": [ " id_x team_api_id team_fifa_api_id_x team_long_name \\\n", "0 26548 8674 1915 FC Groningen \n", "1 11822 4087 111271 Évian Thonon Gaillard FC \n", "2 43053 9906 240 Atlético Madrid \n", "3 35284 9807 1889 CF Os Belenenses \n", "4 27780 10218 1971 Excelsior \n", "\n", " team_short_name id_y team_fifa_api_id_y date \\\n", "0 GRO 427 1915 2010-02-22 00:00:00 \n", "1 ETG 411 111271 2011-02-22 00:00:00 \n", "2 AMA 95 240 2010-02-22 00:00:00 \n", "3 BEL 156 1889 2010-02-22 00:00:00 \n", "4 EXC 416 1971 2011-02-22 00:00:00 \n", "\n", " buildUpPlaySpeed buildUpPlaySpeedClass ... \\\n", "0 41 Balanced ... \n", "1 35 Balanced ... \n", "2 64 Balanced ... \n", "3 30 Slow ... \n", "4 73 Fast ... \n", "\n", " chanceCreationShootingClass chanceCreationPositioningClass \\\n", "0 Lots Organised \n", "1 Normal Organised \n", "2 Lots Free Form \n", "3 Normal Organised \n", "4 Normal Organised \n", "\n", " defencePressure defencePressureClass defenceAggression \\\n", "0 30 Deep 30 \n", "1 45 Medium 55 \n", "2 70 High 34 \n", "3 30 Deep 30 \n", "4 25 Deep 47 \n", "\n", " defenceAggressionClass defenceTeamWidth defenceTeamWidthClass \\\n", "0 Contain 30 Narrow \n", "1 Press 65 Normal \n", "2 Press 55 Normal \n", "3 Contain 30 Narrow \n", "4 Press 33 Narrow \n", "\n", " defenceDefenderLineClass overall_rating \n", "0 Cover 68.000000 \n", "1 Cover 62.090909 \n", "2 Offside Trap 66.272727 \n", "3 Offside Trap 69.363636 \n", "4 Cover 64.909091 \n", "\n", "[5 rows x 30 columns]" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# simply merge it to team_df to start analyzing teams\n", "# but dont forget to reset_index to convert the index into a regular column\n", "ratings_df = merged_df[\n", " [\"player_name\",\"team_api_id\",\"team_long_name\",\"overall_rating\",\"buildUpPlaySpeed\"]\n", " ].groupby(\"team_api_id\").agg({\"overall_rating\":\"mean\"}).reset_index()\n", "\n", "team_df.merge(ratings_df, how=\"inner\", on=\"team_api_id\").head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Player Analysis Example\n", "- **overall_rating** is a player attribute\n", "- **buildUpPlaySpeed** is a team attribute\n", " - While this is an attribute that describes a team, this is the team that the player is part of\n", " - The player is affected by the overall performance of the team, and describes the **team that the player is part of**, so in a way, it is a player attribute\n", " - You will notice that the value of buildUpPlaySpeed does not change for players in the same team\n", "- You can include all player attribute to analyze and compare players\n", "- You can also include team attributes without problems, because they can also be considered player attribute" ] }, { "cell_type": "code", "execution_count": 127, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
player_nameteam_api_idteam_long_nameoverall_ratingbuildUpPlaySpeed
0Abdoulaye Toure8674FC Groningen6441
1Abella Perez Damia8674FC Groningen7141
2Adam Smith8674FC Groningen7041
3Alan Gow8674FC Groningen6341
4Adrian Stoian8674FC Groningen7041
\n", "
" ], "text/plain": [ " player_name team_api_id team_long_name overall_rating \\\n", "0 Abdoulaye Toure 8674 FC Groningen 64 \n", "1 Abella Perez Damia 8674 FC Groningen 71 \n", "2 Adam Smith 8674 FC Groningen 70 \n", "3 Alan Gow 8674 FC Groningen 63 \n", "4 Adrian Stoian 8674 FC Groningen 70 \n", "\n", " buildUpPlaySpeed \n", "0 41 \n", "1 41 \n", "2 41 \n", "3 41 \n", "4 41 " ] }, "execution_count": 127, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# you can perform your analysis directly on marged_df\n", "# because the level of analysis is the player there\n", "merged_df[[\"player_name\",\"team_api_id\",\"team_long_name\",\"overall_rating\",\"buildUpPlaySpeed\"]].head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Aggregating With Transform\n", "\n", "If you want to create a column in merged_df that contains the average overall_rating then you use **transform**\n" ] }, { "cell_type": "code", "execution_count": 142, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 68.000000\n", "1 68.000000\n", "2 68.000000\n", "3 68.000000\n", "4 68.000000\n", "5 68.000000\n", "6 68.000000\n", "7 68.000000\n", "8 68.000000\n", "9 68.000000\n", "10 68.000000\n", "11 62.090909\n", "12 62.090909\n", "13 62.090909\n", "14 62.090909\n", "Name: overall_rating, dtype: float64" ] }, "execution_count": 142, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "merged_df.groupby(\"team_api_id\").transform(np.mean).overall_rating.head(15)" ] }, { "cell_type": "code", "execution_count": 141, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
id_x_xplayer_api_idplayer_nameplayer_fifa_api_id_xbirthdayheightweightid_y_xplayer_fifa_api_id_ydate_x...chanceCreationShootingClasschanceCreationPositioningClassdefencePressuredefencePressureClassdefenceAggressiondefenceAggressionClassdefenceTeamWidthdefenceTeamWidthClassdefenceDefenderLineClassmean_overall_rating
048439366Abdoulaye Toure2104501994-03-03 00:00:00187.961707662104502016-05-05 00:00:00...LotsOrganised30Deep30Contain30NarrowCover68.000000
16837422Abella Perez Damia1595801982-04-15 00:00:00187.9617410511595802016-01-28 00:00:00...LotsOrganised30Deep30Contain30NarrowCover68.000000
2129160447Adam Smith1908851991-04-29 00:00:00180.3417921191908852015-11-12 00:00:00...LotsOrganised30Deep30Contain30NarrowCover68.000000
325332547Alan Gow1403071982-10-09 00:00:00182.8815441811403072015-02-27 00:00:00...LotsOrganised30Deep30Contain30NarrowCover68.000000
4182168047Adrian Stoian1920721991-02-11 00:00:00177.8014630821920722016-02-04 00:00:00...LotsOrganised30Deep30Contain30NarrowCover68.000000
524634268Alain Nef499391982-02-06 00:00:00190.501944057499392016-03-10 00:00:00...LotsOrganised30Deep30Contain30NarrowCover68.000000
665302985Abel Khaled2075411992-11-09 00:00:00180.3414810232075412015-03-13 00:00:00...LotsOrganised30Deep30Contain30NarrowCover68.000000
7206213366Afriyie Acquah2012231992-01-05 00:00:00177.8015434872012232016-05-12 00:00:00...LotsOrganised30Deep30Contain30NarrowCover68.000000
87380592Aboubakar Oumarou2185481987-04-01 00:00:00182.8816811292185482015-04-01 00:00:00...LotsOrganised30Deep30Contain30NarrowCover68.000000
927537503Albano Benjamin Bizzarri149071977-11-09 00:00:00193.041964532149072015-11-26 00:00:00...LotsOrganised30Deep30Contain30NarrowCover68.000000
1037173955Abdoul Razzagui Camara1939531990-02-20 00:00:00177.801575551939532016-04-21 00:00:00...LotsOrganised30Deep30Contain30NarrowCover68.000000
11247182847Alain Pierre Mendy2093521989-11-17 00:00:00182.8815940882093522013-03-15 00:00:00...NormalOrganised45Medium55Press65NormalCover62.090909
12162121643Adrian Chomiuk1866291988-06-23 00:00:00182.8817927491866292010-08-30 00:00:00...NormalOrganised45Medium55Press65NormalCover62.090909
13243127255Akwetey Mensah1987811983-04-15 00:00:00172.7216340251987812010-08-30 00:00:00...NormalOrganised45Medium55Press65NormalCover62.090909
1451419681Abdul Aziz Tetteh1901931989-02-10 00:00:00182.881908031901932016-05-05 00:00:00...NormalOrganised45Medium55Press65NormalCover62.090909
\n", "

15 rows × 78 columns

\n", "
" ], "text/plain": [ " id_x_x player_api_id player_name player_fifa_api_id_x \\\n", "0 48 439366 Abdoulaye Toure 210450 \n", "1 68 37422 Abella Perez Damia 159580 \n", "2 129 160447 Adam Smith 190885 \n", "3 253 32547 Alan Gow 140307 \n", "4 182 168047 Adrian Stoian 192072 \n", "5 246 34268 Alain Nef 49939 \n", "6 65 302985 Abel Khaled 207541 \n", "7 206 213366 Afriyie Acquah 201223 \n", "8 73 80592 Aboubakar Oumarou 218548 \n", "9 275 37503 Albano Benjamin Bizzarri 14907 \n", "10 37 173955 Abdoul Razzagui Camara 193953 \n", "11 247 182847 Alain Pierre Mendy 209352 \n", "12 162 121643 Adrian Chomiuk 186629 \n", "13 243 127255 Akwetey Mensah 198781 \n", "14 51 419681 Abdul Aziz Tetteh 190193 \n", "\n", " birthday height weight id_y_x player_fifa_api_id_y \\\n", "0 1994-03-03 00:00:00 187.96 170 766 210450 \n", "1 1982-04-15 00:00:00 187.96 174 1051 159580 \n", "2 1991-04-29 00:00:00 180.34 179 2119 190885 \n", "3 1982-10-09 00:00:00 182.88 154 4181 140307 \n", "4 1991-02-11 00:00:00 177.80 146 3082 192072 \n", "5 1982-02-06 00:00:00 190.50 194 4057 49939 \n", "6 1992-11-09 00:00:00 180.34 148 1023 207541 \n", "7 1992-01-05 00:00:00 177.80 154 3487 201223 \n", "8 1987-04-01 00:00:00 182.88 168 1129 218548 \n", "9 1977-11-09 00:00:00 193.04 196 4532 14907 \n", "10 1990-02-20 00:00:00 177.80 157 555 193953 \n", "11 1989-11-17 00:00:00 182.88 159 4088 209352 \n", "12 1988-06-23 00:00:00 182.88 179 2749 186629 \n", "13 1983-04-15 00:00:00 172.72 163 4025 198781 \n", "14 1989-02-10 00:00:00 182.88 190 803 190193 \n", "\n", " date_x ... chanceCreationShootingClass \\\n", "0 2016-05-05 00:00:00 ... Lots \n", "1 2016-01-28 00:00:00 ... Lots \n", "2 2015-11-12 00:00:00 ... Lots \n", "3 2015-02-27 00:00:00 ... Lots \n", "4 2016-02-04 00:00:00 ... Lots \n", "5 2016-03-10 00:00:00 ... Lots \n", "6 2015-03-13 00:00:00 ... Lots \n", "7 2016-05-12 00:00:00 ... Lots \n", "8 2015-04-01 00:00:00 ... Lots \n", "9 2015-11-26 00:00:00 ... Lots \n", "10 2016-04-21 00:00:00 ... Lots \n", "11 2013-03-15 00:00:00 ... Normal \n", "12 2010-08-30 00:00:00 ... Normal \n", "13 2010-08-30 00:00:00 ... Normal \n", "14 2016-05-05 00:00:00 ... Normal \n", "\n", " chanceCreationPositioningClass defencePressure defencePressureClass \\\n", "0 Organised 30 Deep \n", "1 Organised 30 Deep \n", "2 Organised 30 Deep \n", "3 Organised 30 Deep \n", "4 Organised 30 Deep \n", "5 Organised 30 Deep \n", "6 Organised 30 Deep \n", "7 Organised 30 Deep \n", "8 Organised 30 Deep \n", "9 Organised 30 Deep \n", "10 Organised 30 Deep \n", "11 Organised 45 Medium \n", "12 Organised 45 Medium \n", "13 Organised 45 Medium \n", "14 Organised 45 Medium \n", "\n", " defenceAggression defenceAggressionClass defenceTeamWidth \\\n", "0 30 Contain 30 \n", "1 30 Contain 30 \n", "2 30 Contain 30 \n", "3 30 Contain 30 \n", "4 30 Contain 30 \n", "5 30 Contain 30 \n", "6 30 Contain 30 \n", "7 30 Contain 30 \n", "8 30 Contain 30 \n", "9 30 Contain 30 \n", "10 30 Contain 30 \n", "11 55 Press 65 \n", "12 55 Press 65 \n", "13 55 Press 65 \n", "14 55 Press 65 \n", "\n", " defenceTeamWidthClass defenceDefenderLineClass mean_overall_rating \n", "0 Narrow Cover 68.000000 \n", "1 Narrow Cover 68.000000 \n", "2 Narrow Cover 68.000000 \n", "3 Narrow Cover 68.000000 \n", "4 Narrow Cover 68.000000 \n", "5 Narrow Cover 68.000000 \n", "6 Narrow Cover 68.000000 \n", "7 Narrow Cover 68.000000 \n", "8 Narrow Cover 68.000000 \n", "9 Narrow Cover 68.000000 \n", "10 Narrow Cover 68.000000 \n", "11 Normal Cover 62.090909 \n", "12 Normal Cover 62.090909 \n", "13 Normal Cover 62.090909 \n", "14 Normal Cover 62.090909 \n", "\n", "[15 rows x 78 columns]" ] }, "execution_count": 141, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# simply assign this column to merged_df and give it an appropriate name\n", "\n", "merged_df[\"mean_overall_rating\"] = merged_df.groupby(\"team_api_id\").transform(np.mean).overall_rating\n", "merged_df.head(15)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Performing Analysis\n", "- We combine data into single dataframe\n", "- With varying levels of analysis, we have varying degrees of variability because of duplication\n", " - Remember how all players in the same team share the same value of the team attribute buildUpPlaySpeed\n", "- When we combine data into a single dataframe we have **non-normal form** data with lots of duplicated values\n", " - Normal form is a database term, not stats\n", " - Data stored in a data is usually in normal form\n", " - While some values might be duplicated, the records are not\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Summary\n", "- Level of analysis is important\n", "- You can include variables from higher levels of analysis without issues\n", " - Be aware that you might not have variability\n", "- Including variables from lower levels of analysis requires aggregation\n", " - You aggregate in many different ways: Sums, Counts, Min, Max, Mean, Median, Mode ..etc\n", " - Aggregation is to produce a single scalar value from a group of values\n", "- Represent many observations at a lower level into a single value at a higher level\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Final Note About Groupby\n", "\n", "- You don't have to have different levels of analysis to use groupby and aggregations\n", "- You can use agg() and transform() with group by to analyze subgroups\n", " - Just group by the variable you want to create subgroups from\n", " - groupby should be given a categorical or discrete variable\n", " - subgroups can be created from a combination of variables" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# At the team level of analysis\n", "# create 4 new columns from player sprint_speed data:\n", "# mean_sprint_speed, max_sprint_speed, min_sprint_speed, and std_sprint_speed\n" ] }, { "cell_type": "code", "execution_count": 147, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# at the player unit of analysis \n", "# create 4 new columns from player sprint_speed data:\n", "# mean_sprint_speed, max_sprint_speed, min_sprint_speed, and std_sprint_speed" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }