The effort to understand data by placing it in a visual context

Source: Timer Higher Education
![]()
Tufte, E. R. (2001). The visual display of quantitative information.
Resources for 424 Info Vis. Course at University of Washington By. Prof. Maureen Stone and Prof. Polle Zellweger.
# using altair
import pandas as pd
import altair as alt
# you need a dataset
cars_df = pd.read_json("https://github.com/vega/vega-datasets/raw/gh-pages/data/cars.json")
# you can also load the sample data provided with altair using
# cars_df = alt.load_dataset('cars')
# for list of data sets, run the following command in jupyter:
# alt.datasets.list_datasets()
# Build the chart and configure it
chart = alt.Chart(cars_df).mark_circle().encode(
x='Horsepower',
y='Miles_per_Gallon',
color='Origin',
)
# display it
chart
# Same chart on matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
# use loop to plot each circle in a different color
for (origin), group in cars_df.groupby('Origin'):
plt.plot(group['Horsepower'], group['Miles_per_Gallon'],
'o', label=origin)
# set the legend and labels
plt.legend(title='Origin')
plt.xlabel('Horsepower')
plt.ylabel('Miles_per_Gallon');
# enable grid
plt.grid(True)
Chart( data ).mark_type( options ).encode( channels )
1 2 3 4 5 6
# alternatively you can reverse mark and encode
Chart( data ).encode( channels ).mark_type( options )
Chart( data ).mark_type( options ).encode( channels )
1 2 3 4 5 6
Construct a chart object (OOP), can be:
Chart( data ).mark_type( options ).encode( channels )
1 2 3 4 5 6
Tells Altair what data set to use for the plot, can be:
# url also works
url = 'https://vega.github.io/vega-datasets/data/cars.json'
alt.Chart(url).mark_circle().encode(
x='Horsepower',
y='Miles_per_Gallon',
#color="Origin", # bug, does not work with url
)
Chart( data ).mark_type( options ).encode( channels )
1 2 3 4 5 6
Tells Altair how to represent values on the chart, includes:
# we can use this command to display multiple charts from a single cell
chart.display()
# let's modify our chart
chart.mark_area() # this mutated chart
# try other mark_* types
chart.display() # this will show the mutated plot
# These are options that affect all the points
chart.mark_square(opacity=0.3, size=100)
Chart( data ).mark_type( options ).encode( channels )
1 2 3 4 5 6
Chart( data ).mark_type( options ).encode( channels )
1 2 3 4 5 6
These are the options to tell altair how to:
These options are referred to as Channels
# plot Displacement vs Cylinders
chart.encode(x="Displacement", y="Cylinders")
# notice how previous options remain if not changed (like color)
# it's better to create a new chart object for new charts
# so that it is not affected by previous changes
alt.Chart(cars_df).mark_circle().encode(x="Displacement", y="Cylinders")
# Notice how values are no longer colored
alt.Chart(cars_df).mark_bar().encode(
x="Cylinders",
y="count(*)",)
You can use the following functions to describe the aggregation for the axes values in the following format:
'aggregation(variable)'
Use * in place of variable to mean for any row/observation
The functions include: sum, mean, media, variance, stdev, distinct .. and more
#
alt.Chart(cars_df).mark_bar().encode(
x="Cylinders:N",
y="count(*)",)
'sales:Q' tells Altair that the sales column is a quantitative value.| Data Type | Letter | Description |
|---|---|---|
| quantitative | Q | a continuous real-valued quantity |
| ordinal | O | a discrete ordered quantity |
| nominal | N | a discrete unordered category |
| temporal | T | a time or date value |
# you can use column or row to split the graphs based on group
# this is called a trellis plot
alt.Chart(cars_df).mark_bar().encode(
column="Origin",
x="Cylinders:N",
y="count(*)",)
alt.Chart(cars_df).mark_bar().encode(
color="Origin",
x="Cylinders:N",
y="count(*)",)
alt.Chart(cars_df).mark_circle().encode(
color="Origin",
size="Cylinders",
x="Miles_per_Gallon",
y="Horsepower",)
alt.Chart(cars_df).mark_circle().encode(
color="Origin",
size="Weight_in_lbs",
x="Miles_per_Gallon",
y="Horsepower",)
cart.max_rows = 10000