Time series are graphs where the X-Axis is a time value, for example:
The line is the typical mark method used for time series to show change over time.
# Let's load our data
# using altair
import pandas as pd
import altair as alt
# you need a dataset
cars_df = pd.read_json("https://github.com/vega/vega-datasets/raw/gh-pages/data/cars.json")
# notice the year column
cars_df.head()
Acceleration | Cylinders | Displacement | Horsepower | Miles_per_Gallon | Name | Origin | Weight_in_lbs | Year | |
---|---|---|---|---|---|---|---|---|---|
0 | 12.0 | 8 | 307 | 130 | 18 | chevrolet chevelle malibu | USA | 3504 | 1970-01-01 |
1 | 11.5 | 8 | 350 | 165 | 15 | buick skylark 320 | USA | 3693 | 1970-01-01 |
2 | 11.0 | 8 | 318 | 150 | 18 | plymouth satellite | USA | 3436 | 1970-01-01 |
3 | 12.0 | 8 | 304 | 150 | 16 | amc rebel sst | USA | 3433 | 1970-01-01 |
4 | 10.5 | 8 | 302 | 140 | 17 | ford torino | USA | 3449 | 1970-01-01 |
# Let's find out how gas milage improved over the years
alt.Chart(cars_df).mark_line().encode(x='Miles_per_Gallon', y='Year')
# where did I go wrong?
alt.Chart(cars_df).mark_line().encode(y='Miles_per_Gallon', x='Year')
# How can we improve? let's fix the year
# we want to convert into year only i.e. 1970
cars_df['Year'].head()
0 1970-01-01 1 1970-01-01 2 1970-01-01 3 1970-01-01 4 1970-01-01 Name: Year, dtype: object
# use apply to perform a function on every element in series
# lambda is a way to quickly define simple one line functions
cars_df['Year'].apply(lambda x:x.split("-")[0]).head()
0 1970 1 1970 2 1970 3 1970 4 1970 Name: Year, dtype: object
# let's update
cars_df['Year'] = cars_df['Year'].apply(lambda x:x.split("-")[0])
# let's show the new graph
alt.Chart(cars_df).mark_line().encode(y='Miles_per_Gallon', x='Year')
# How can we improve? we have multiple observations per year, so let's get the average
alt.Chart(cars_df).mark_line().encode(y='mean(Miles_per_Gallon)', x='Year')
# Can we improve the x axis using altair channels?
# Let's set the year type to time
alt.Chart(cars_df).mark_line().encode(y='mean(Miles_per_Gallon)', x='Year:T')
# We are back to dates at the bottom, let's fix the time unit
# We have to use alt.X to give additional options for the x-axis
# We could have done this without changing the dataframe originally
alt.Chart(cars_df).mark_line().encode(
y='mean(Miles_per_Gallon)',
x=alt.X('Year:T', timeUnit='year'),
)
# clearly the average miles per gallons are improving over the years
# is it true for all countries? let's improve our graph
alt.Chart(cars_df).mark_line().encode(
y='mean(Miles_per_Gallon)',
x=alt.X('Year:T', timeUnit='year'),
color='Origin',
)
# All are improving, but the average for europe and Japan is better than US
# Could it be from the type of vehicle? let's further break down based on cylinders
# Since we can only pass a single variable for color
# let's use row or column to draw graphs for cylinders
alt.Chart(cars_df).mark_line().encode(
y='mean(Miles_per_Gallon)',
x=alt.X('Year:T', timeUnit='year'),
color='Origin',
row='Cylinders',
)
# Lets filter the data and include only 6 and 4 cylinders
filtered_df = cars_df[cars_df.Cylinders.isin([4,6])]
alt.Chart(filtered_df).mark_line().encode(
y='mean(Miles_per_Gallon)',
x=alt.X('Year:T', timeUnit='year'),
color='Origin',
row='Cylinders',
)
# Let's improve our labels
# Lets filter the data and include only 6 and 4 cylinders
alt.Chart(filtered_df).mark_line().encode(
y=alt.Y('mean(Miles_per_Gallon)', title='Average MPG'),
x=alt.X('Year:T', timeUnit='year', title='Year'),
color='Origin',
row='Cylinders',
)
cars_df.head()
Acceleration | Cylinders | Displacement | Horsepower | Miles_per_Gallon | Name | Origin | Weight_in_lbs | Year | |
---|---|---|---|---|---|---|---|---|---|
0 | 12.0 | 8 | 307 | 130 | 18 | chevrolet chevelle malibu | USA | 3504 | 1970 |
1 | 11.5 | 8 | 350 | 165 | 15 | buick skylark 320 | USA | 3693 | 1970 |
2 | 11.0 | 8 | 318 | 150 | 18 | plymouth satellite | USA | 3436 | 1970 |
3 | 12.0 | 8 | 304 | 150 | 16 | amc rebel sst | USA | 3433 | 1970 |
4 | 10.5 | 8 | 302 | 140 | 17 | ford torino | USA | 3449 | 1970 |
# lets extract the manufacturer from the name
cars_df.Name.apply(lambda x:x.split()[0]).head()
# Seems to work, let's assign it to the column: Manufact
0 chevrolet 1 buick 2 plymouth 3 amc 4 ford Name: Name, dtype: object
cars_df["Manufact"] = cars_df.Name.apply(lambda x:x.split()[0])
# let's find out who is available
cars_df.Manufact.value_counts()
# let's visually represent this data to see who is represented the most our data
ford 53 chevrolet 44 plymouth 32 amc 29 dodge 28 toyota 25 datsun 23 buick 17 pontiac 16 volkswagen 16 honda 13 mercury 11 mazda 10 oldsmobile 10 peugeot 8 fiat 8 audi 7 chrysler 6 vw 6 volvo 6 saab 5 renault 5 subaru 4 opel 4 chevy 3 cadillac 2 maxda 2 mercedes-benz 2 bmw 2 toyouta 1 citroen 1 mercedes 1 vokswagen 1 chevroelt 1 hi 1 nissan 1 triumph 1 capri 1 Name: Manufact, dtype: int64
alt.Chart(cars_df).mark_bar().encode(
x='Manufact',
y='count(*)',
)
# Let's breakdown the cars by cylinders
alt.Chart(cars_df).mark_bar().encode(
x='Manufact',
y='count(*)',
color='Cylinders:N',
)
# Let's order the bars by the value
alt.Chart(cars_df).mark_bar().encode(
x=alt.X('Manufact', sort=alt.SortField(field='*', op='count', order='descending')),
y='count(*)',
color='Cylinders:N',
)
# Let's order the bars by the value
Prepare a report/table/list that details data definitions and characteristics, and details any transformations performed to the data.
Report should also explain whether the data is suffecient for targetted approaches and what additional information is needed to complete the analysis
Data should be ready for the analysis after this step
Write everything down
Include appropriate graphs from analysis
![]( image_url_here )
Reading the result section should highlight the main findings/insights from the analysis