# Labels are used to select columns
weather_df["location"].head()
0 Seattle 1 Seattle 2 Seattle 3 Seattle 4 Seattle Name: location, dtype: object
# List or tuple of labels can also be passed to select multiple columns and their location
weather_df[["wind","location"]].head()
wind | location | |
---|---|---|
0 | 4.7 | Seattle |
1 | 4.5 | Seattle |
2 | 2.3 | Seattle |
3 | 4.7 | Seattle |
4 | 6.1 | Seattle |
# When using numeric slicing, it is for selecting/slicing rows
# Works exactly like python list slicing
weather_df[1:5]
#Can you fetch the last 5 records?
location | date | precipitation | temp_max | temp_min | wind | weather | |
---|---|---|---|---|---|---|---|
1 | Seattle | 2012-01-02 00:00 | 10.9 | 10.6 | 2.8 | 4.5 | rain |
2 | Seattle | 2012-01-03 00:00 | 0.8 | 11.7 | 7.2 | 2.3 | rain |
3 | Seattle | 2012-01-04 00:00 | 20.3 | 12.2 | 5.6 | 4.7 | rain |
4 | Seattle | 2012-01-05 00:00 | 1.3 | 8.9 | 2.8 | 6.1 | rain |
# Fetch rows based on index number Look at the far left column
weather_df.loc[1]
location Seattle date 2012-01-02 00:00 precipitation 10.9 temp_max 10.6 temp_min 2.8 wind 4.5 weather rain Name: 1, dtype: object
# You can also use the python list slicing syntax to fetch multiple rows
weather_df.loc[5:10]
location | date | precipitation | temp_max | temp_min | wind | weather | |
---|---|---|---|---|---|---|---|
5 | Seattle | 2012-01-06 00:00 | 2.5 | 4.4 | 2.2 | 2.2 | rain |
6 | Seattle | 2012-01-07 00:00 | 0.0 | 7.2 | 2.8 | 2.3 | rain |
7 | Seattle | 2012-01-08 00:00 | 0.0 | 10.0 | 2.8 | 2.0 | sun |
8 | Seattle | 2012-01-09 00:00 | 4.3 | 9.4 | 5.0 | 3.4 | rain |
9 | Seattle | 2012-01-10 00:00 | 1.0 | 6.1 | 0.6 | 3.4 | rain |
10 | Seattle | 2012-01-11 00:00 | 0.0 | 6.1 | -1.1 | 5.1 | sun |
# You can use a list of ids to fetch
weather_df.loc[[1,5,7,10]]
location | date | precipitation | temp_max | temp_min | wind | weather | |
---|---|---|---|---|---|---|---|
1 | Seattle | 2012-01-02 00:00 | 10.9 | 10.6 | 2.8 | 4.5 | rain |
5 | Seattle | 2012-01-06 00:00 | 2.5 | 4.4 | 2.2 | 2.2 | rain |
7 | Seattle | 2012-01-08 00:00 | 0.0 | 10.0 | 2.8 | 2.0 | sun |
10 | Seattle | 2012-01-11 00:00 | 0.0 | 6.1 | -1.1 | 5.1 | sun |
# Negative indecies don't work!
weather_df.loc[-1]
# Why?
# you can also set the column you want like so
weather_df.loc[1:5, ["location","weather"]]
It is exactly like loc, but uses numeric indecies
weather_df.loc[1:5]
# Negative indicies work this time!
weather_df.iloc[-1]
# why?
# selecting columns is also numeric
weather_df.iloc[:,0:2].head()
# Sort based on temp_max
weather_df.sort_values(by="temp_max").head()
# is this ascending or descending order?
# To sort in descending order, set ascending argument to False
weather_df.sort_values(by="temp_max", ascending=False).head()
# Seems like New York has the highest and lowest tempratures!
# Sort weather_df by percipitation in ascending order
# Notice how percipitation is 0 for many observations
# To sort by percipitation, then by wind speed, both ascending, do the following:
weather_df.sort_values(by=["precipitation", "wind"]).head()
# To sort by percipitation ascending, then by wind speed descending, do the following:
weather_df.sort_values(by=["precipitation", "wind"], ascending=[True, False]).head()
Selecting rows based on logical conditions. e.g., weather observations in New York, or observations where wind speed is higher than 10
You use conditions very similar to Python conditions in syntax, with some slight variation
# to fetch observations for New York
weather_df[weather_df["location"] == "New York"].head()
# Perform the same filter using dot notation
# Filter observations where wind is higher than 10
# Now try to find out how many observations there are using 2 different methods
# Filter all observations where temp_min is less than or equal to zero and weather is rain
weather_df[(weather_df.temp_min < 0) & (weather_df.weather == "rain")].head()
# The parantheses are important!
# You can write it over multiple lines to be easier to read
weather_df[
(weather_df.temp_min < 0) &
(weather_df.weather == "rain")
].head()
# You can also use 'or' in the condition
weather_df[
(weather_df.temp_min < 0) &
(
(weather_df.weather == "rain") |
(weather_df.weather == "snow")
)
].head(10)
You can use these in filteration conditions
# Filter using method isin to find observations where whether is either rain or snow
# Calculating the temprature range
# Try to store it in a column called temp_range
# be sure to try dot and index notations
# Calculate the mean range and store it in a column called mean_range
# Calculate the mean centered value of range
# mean centering = temp_range - mean_range
# tells us how much the observation is different from the mean
# name the collumn mc_range
# Caclulate the square of mc_range
# name the new column mc_range_sq
# calculate the natural log of mc_range and use name mc_range_log
# tip: search numpy
# be sure to examine the data, what do you see?
# What should you do?
# Is fillna(0) a good idea?
# based on what you know so far,
# try to plot range, mc_range, and mc_range_sq
# try to plot the distributions for the new range columns
# hint: search for histograms
# try to count the number of observations where the temprature change is above average
# can you produce a scalar value instead of a column?
# Can you calculate the ratio?
# Can you calculate the percentage?
# Calculate the average temprature for the day
# hint: use temp_max and temp_min
# plot the average temp
# plot the distribution for average temp
# compare the distribution of average temp with mc_range
# find the days in which the average temprature is below zero and it is snowing
# calculate the percentage of these days