import pandas
Intro to Pandas
While built-in Python functions suffice for general programming, data analysis requires additional functions and objects. Pandas is a popular data analysis package that is simple to use the moment you start Python.
The primary reason Pandas is popular for data science and analysis is that it introduces three useful objects that mirror similar data structures from other statistical packages and don’t add complexity to the simple Pythonic syntax. These objects are:
- The DataFrame
- The Series
- The Index
This guide is not meant to be comprehensive, the most comprehensive documentation for Pandas is found here. The purpose of this guide is to inform a potential user of the functionality of Pandas and a general overview of how to accomplish basic data analysis tools.
Getting Started
Importing Pandas
Since Pandas is not native to Python, you will have to install it. Fortunately, Pandas has grown so much in popularity, that most downloads (including Anaconda) will contain Pandas, but you will still have to load the package. This code will do the trick:
However, because Pandas is used so often, the Python community loads pandas as pd
for ease of access:
import pandas as pd
It’s also useful to import a few other packages, mostly notably Numpy.
import numpy as np
DataFrame
The DataFrame object is one of the main contributions of Pandas. The DataFrame is a 2-dimensional labeled structure with columns and rows. The columns and the rows represent one dimension each. The DataFrame is analogous to the R and Stata DataFrame and the Excel spreadsheet. Or put in more technical terms, the DataFrame is a tabular data structure. The code below defines a DataFrame and then prints it out to view:
= pd.DataFrame({'one': [1., 2., 3., 4.], 'two': [4., 3., 2., 1.]})
d
d
one | two | |
---|---|---|
0 | 1.0 | 4.0 |
1 | 2.0 | 3.0 |
2 | 3.0 | 2.0 |
3 | 4.0 | 1.0 |
The name of our DataFrame is d
, and it has two columns, one
and two
, and four rows, 0-3. (Python is 0 indexed, meaning it counts from 0…1…2…and so on.)
Each data point can be referenced through its Index (which corresponds to its row, the far most left value) and the column signifies what the value means. We can call a single Pandas column with this command:
'one'] d[
0 1.0
1 2.0
2 3.0
3 4.0
Name: one, dtype: float64
The values on the left represent the Index we saw earlier. Notice: A column’s type is itself a Pandas object, the Series. More on this object below.
Series
A Series is one-dimensional indexed array that can contain any Python data type. To create a Series you use the function:
= pd.Series([1,2,3,4]) series_ex
A Series in Pandas is similar visually to a list. But there are key distinctions in how they operate. As mentioned, Pandas is used for data analysis, so a Series has functions that allow for data analysis to be done easily, while a list would require either a for loop or list comprehension for the same operations. Example of this below:
= series_ex*2
series_ex print("This is a Series multiplied by two")
print(series_ex)
This is a Series multiplied by two
0 2
1 4
2 6
3 8
dtype: int64
= [1,2,3,4]
alist = alist*2
alist print("This is a List multiplied by two")
print(alist)
This is a List multiplied by two
[1, 2, 3, 4, 1, 2, 3, 4]
Index
Both the Series and the DataFrame have an Index that signifies order and allows for referencing specific points. The Index itself is an object.
= pd.Index([2, 3, 5, 7, 11])
ind ind
Int64Index([2, 3, 5, 7, 11], dtype='int64')
Importing Data
Pandas has the ability to read and export multiple data format types: be it a CSV, DTA, SAS file, JSON file, SQL and many more. Almost all data reading will be in the format:
pd.read_(NAME OF DATA FORMAT)('FILE_NAME')
Let’s take a look at a few examples.
Stata Files (DTA)
# Importing data from Stata
'data/State_ETO_short.dta') pd.read_stata(
state_abbv | ETO | ETW | ETOW | ET_cat | ET_Work_cat | ET_ET_Work_cat | |
---|---|---|---|---|---|---|---|
0 | VA | 2.0 | 62.5 | 64.5 | 0 - 5% | 30% < | 30% < |
1 | TN | 5.5 | 20.8 | 26.3 | 5 - 10% | 10 - 30% | 20 - 30% |
2 | VT | 19.9 | 0.7 | 20.6 | 15 - 20% | 0 - 5% | 20 - 30% |
3 | ID | 6.0 | 11.4 | 17.4 | 5 - 10% | 10 - 30% | 10 - 20% |
4 | OH | 1.9 | 15.0 | 16.9 | 0 - 5% | 10 - 30% | 10 - 20% |
This is how we read in a DataFrame, but we still need to store and name it. This can be accomplished in one line:
= pd.read_stata('data/State_ETO_short.dta') df
Now, when we call df
, we’ll get the DataFrame that corresponds to the data referenced in pd.read_stata('data/State_ETO.dta')
.
df
state_abbv | ETO | ETW | ETOW | ET_cat | ET_Work_cat | ET_ET_Work_cat | |
---|---|---|---|---|---|---|---|
0 | VA | 2.0 | 62.5 | 64.5 | 0 - 5% | 30% < | 30% < |
1 | TN | 5.5 | 20.8 | 26.3 | 5 - 10% | 10 - 30% | 20 - 30% |
2 | VT | 19.9 | 0.7 | 20.6 | 15 - 20% | 0 - 5% | 20 - 30% |
3 | ID | 6.0 | 11.4 | 17.4 | 5 - 10% | 10 - 30% | 10 - 20% |
4 | OH | 1.9 | 15.0 | 16.9 | 0 - 5% | 10 - 30% | 10 - 20% |
Excel Files (XLSX)
(Note: May need to pip install openpyxl
for this to work!)
# Importing data from excel into pandas
= pd.read_excel('data/Minimum_Wage_Data_Short.xlsx')
df df
Year | State | Table_Data | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
---|---|---|---|---|---|---|---|---|
0 | 1968 | Alabama | NaN | 0.00000 | 0.00000 | 34.783333 | 0.00 | 0.00 |
1 | 1968 | Alaska | 2.1 | 2.10000 | 2.10000 | 34.783333 | 15.12 | 15.12 |
2 | 1968 | Arizona | 18.72 - 26.40/wk(b) | 0.66000 | 0.46800 | 34.783333 | 4.75 | 3.37 |
3 | 1968 | Arkansas | 1.25/day(b) | 0.15625 | 0.15625 | 34.783333 | 1.12 | 1.12 |
4 | 1968 | California | 1.65(b) | 1.65000 | 1.65000 | 34.783333 | 11.88 | 11.88 |
5 | 1968 | Colorado | 1.00 - 1.25(b) | 1.25000 | 1.00000 | 34.783333 | 9.00 | 7.20 |
Other Data Types
Similar variants for uploading exists for each common data type - CSV, SAS, and so on.
Exporting Data
Exporting data is very simple as well and follows a pattern similar to exporting.
CSV
'exports/Minimum_Wage_Data.csv') df.to_csv(
Converting Data
Now that we know how to load in data, it will be useful to examine ways to convert already existing data structures into DataFrames.
List
= [1,2,3,4,5,6,7,8,9]
my_list = ['a', 'b', 'c']
columns
3,3), columns = columns) pd.DataFrame(np.array(my_list).reshape(
a | b | c | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 |
2 | 7 | 8 | 9 |
An important thing to note: pd.DataFrame
has a multitude of options. The most import of which is what follows right after the first parentheses which is the data that is to be transformed. Here are transform the list [1,2,3,4,5,6,7,7,8,9]
into an np.array
with the shape of:
[[1,2,3],
[4,5,6],
[7.8.9]]
Then we transform the data to a Pandas DataFrame which gives us:
0 1 2 3
1 4 5 6
2 7 8 9
Finally, we add a list of column name with the option columns = columns
to get the final dataframe.
a b c
1 4 5 6
2 7 8 9
Dictionary
= {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
data pd.DataFrame.from_dict(data)
col_1 | col_2 | |
---|---|---|
0 | 3 | a |
1 | 2 | b |
2 | 1 | c |
3 | 0 | d |
Numpy Array
= [('Col1','int32'), ('Col2','float32'), ('Col3','float32')]
dtype = np.zeros(5, dtype=dtype)
values = ['Row'+str(i) for i in range(1, len(values)+1)]
index
=index).head() pd.DataFrame(values, index
Col1 | Col2 | Col3 | |
---|---|---|---|
Row1 | 0 | 0.0 | 0.0 |
Row2 | 0 | 0.0 | 0.0 |
Row3 | 0 | 0.0 | 0.0 |
Row4 | 0 | 0.0 | 0.0 |
Row5 | 0 | 0.0 | 0.0 |
Now that we know how to load in our DataFrame, we will try to view and manipulate our data before our analysis is run.
Viewing Data
We know that we can view our data by simply printing the dataframe, but what if our data is too large?
The .head()
function prints out the first 5 rows (inside the parentheticals you can specify the first N observations you want to see).
df.head()
Year | State | Table_Data | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
---|---|---|---|---|---|---|---|---|
0 | 1968 | Alabama | NaN | 0.00000 | 0.00000 | 34.783333 | 0.00 | 0.00 |
1 | 1968 | Alaska | 2.1 | 2.10000 | 2.10000 | 34.783333 | 15.12 | 15.12 |
2 | 1968 | Arizona | 18.72 - 26.40/wk(b) | 0.66000 | 0.46800 | 34.783333 | 4.75 | 3.37 |
3 | 1968 | Arkansas | 1.25/day(b) | 0.15625 | 0.15625 | 34.783333 | 1.12 | 1.12 |
4 | 1968 | California | 1.65(b) | 1.65000 | 1.65000 | 34.783333 | 11.88 | 11.88 |
2) df.head(
Year | State | Table_Data | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
---|---|---|---|---|---|---|---|---|
0 | 1968 | Alabama | NaN | 0.0 | 0.0 | 34.783333 | 0.00 | 0.00 |
1 | 1968 | Alaska | 2.1 | 2.1 | 2.1 | 34.783333 | 15.12 | 15.12 |
You can also do all of these methods on a single Series.
'Year'].head() df[
0 1968
1 1968
2 1968
3 1968
4 1968
Name: Year, dtype: int64
Or, you can do multiple columns at one time.
'State', 'Table_Data']].head() df[[
State | Table_Data | |
---|---|---|
0 | Alabama | NaN |
1 | Alaska | 2.1 |
2 | Arizona | 18.72 - 26.40/wk(b) |
3 | Arkansas | 1.25/day(b) |
4 | California | 1.65(b) |
It’s also good to view general information on the dataframe from .info()
and to understand the datatypes of each of the columns.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year 6 non-null int64
1 State 6 non-null object
2 Table_Data 5 non-null object
3 High_Value 6 non-null float64
4 Low_Value 6 non-null float64
5 CPI_Average 6 non-null float64
6 High_2018 6 non-null float64
7 Low_2018 6 non-null float64
dtypes: float64(5), int64(1), object(2)
memory usage: 512.0+ bytes
df.dtypes
Year int64
State object
Table_Data object
High_Value float64
Low_Value float64
CPI_Average float64
High_2018 float64
Low_2018 float64
dtype: object
Similar to our use of head()
, we can use tail()
to examine the last few data points.
2) df.tail(
Year | State | Table_Data | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
---|---|---|---|---|---|---|---|---|
4 | 1968 | California | 1.65(b) | 1.65 | 1.65 | 34.783333 | 11.88 | 11.88 |
5 | 1968 | Colorado | 1.00 - 1.25(b) | 1.25 | 1.00 | 34.783333 | 9.00 | 7.20 |
Subsetting Data
Slicing the data by rows
We can do traditional slicing through the index [start, end]
. This will subset by rows.
1:4] df[
Year | State | Table_Data | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
---|---|---|---|---|---|---|---|---|
1 | 1968 | Alaska | 2.1 | 2.10000 | 2.10000 | 34.783333 | 15.12 | 15.12 |
2 | 1968 | Arizona | 18.72 - 26.40/wk(b) | 0.66000 | 0.46800 | 34.783333 | 4.75 | 3.37 |
3 | 1968 | Arkansas | 1.25/day(b) | 0.15625 | 0.15625 | 34.783333 | 1.12 | 1.12 |
Slicing the data by columns
To subset by columns, we can use df[['columns_we_want_1', 'columns_we_want_2']]
.
= ['State', 'Table_Data']
columns_you_want df[columns_you_want].head()
State | Table_Data | |
---|---|---|
0 | Alabama | NaN |
1 | Alaska | 2.1 |
2 | Arizona | 18.72 - 26.40/wk(b) |
3 | Arkansas | 1.25/day(b) |
4 | California | 1.65(b) |
Let’s bring in a larger dataset for this analysis to be meaningful. This will be a panel dataset of states, so there will be 50 rows for every year
= pd.read_excel("data/Minimum_Wage_Data.xlsx") df
print("These are the first five observations: \n")
print(df[['Year', 'State']].head(5))
print("\n")
print("These are the last five observations: \n")
print(df[['Year', 'State']].tail(5))
These are the first five observations:
Year State
0 1968 Alabama
1 1968 Alaska
2 1968 Arizona
3 1968 Arkansas
4 1968 California
These are the last five observations:
Year State
2745 2017 Virginia
2746 2017 Washington
2747 2017 West Virginia
2748 2017 Wisconsin
2749 2017 Wyoming
Viewing the size and columns of the data:
print(len(df))
print(df.columns)
2750
Index(['Year', 'State', 'Table_Data', 'High_Value', 'Low_Value', 'CPI_Average',
'High_2018', 'Low_2018'],
dtype='object')
We have 8 variables: Year
, State
, Table_Data
, High_Value
, Low_Value
, CPI_Average
, High_2018
, and Low_2018
. Sometimes in our analysis we only want to keep certain years. For this, the traditional boolean logic mixed with Pandas slices the data into the segments we want.
Slicing based on conditions
# Rows past 2015
> 2015].head(3) df[df.Year
Year | State | Table_Data | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
---|---|---|---|---|---|---|---|---|
2640 | 2016 | Alabama | ... | 0.00 | 0.00 | 240.007167 | 0.00 | 0.00 |
2641 | 2016 | Alaska | 9.75 | 9.75 | 9.75 | 240.007167 | 10.17 | 10.17 |
2642 | 2016 | Arizona | 8.05 | 8.05 | 8.05 | 240.007167 | 8.40 | 8.40 |
Segmenting on multiple conditions is also desirable. Most programs allow for an “AND” operator and an “OR” operator.
# California AND 2010
print(df[(df.Year == 2010) & (df.State == 'California')].head(3))
print('\n')
Year State Table_Data High_Value Low_Value CPI_Average \
2314 2010 California 8 8.0 8.0 218.0555
High_2018 Low_2018
2314 9.19 9.19
# Alabama OR before 2015
== "Alabama") | (df.Year < 2015)].head(3) df[(df.State
Year | State | Table_Data | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
---|---|---|---|---|---|---|---|---|
0 | 1968 | Alabama | NaN | 0.00 | 0.000 | 34.783333 | 0.00 | 0.00 |
1 | 1968 | Alaska | 2.1 | 2.10 | 2.100 | 34.783333 | 15.12 | 15.12 |
2 | 1968 | Arizona | 18.72 - 26.40/wk(b) | 0.66 | 0.468 | 34.783333 | 4.75 | 3.37 |
The traditional Python index slicers are also applicable for DataFrames with the .loc[]
method.
print(df.iloc[99])
print('\n')
print(df.iloc[[1, 50, 300]])
Year 1969
State South Dakota
Table_Data 17.00 - 20.00/wk
High_Value 0.5
Low_Value 0.425
CPI_Average 36.683333
High_2018 3.41
Low_2018 2.9
Name: 99, dtype: object
Year State Table_Data High_Value Low_Value CPI_Average \
1 1968 Alaska 2.1 2.1 2.10 34.783333
50 1968 Virginia ... 0.0 0.00 34.783333
300 1973 Minnesota .75 - 1.60 1.6 0.75 44.400000
High_2018 Low_2018
1 15.12 15.12
50 0.00 0.00
300 9.02 4.23
The .loc[]
method is also highly useful for subsetting rows and columns at the same time
1:3, 'State'] df.loc[
1 Alaska
2 Arizona
3 Arkansas
Name: State, dtype: object
2, ['Year', 'State']] df.loc[:
Year | State | |
---|---|---|
0 | 1968 | Alabama |
1 | 1968 | Alaska |
2 | 1968 | Arizona |
Manipulating Data
Summary Statistics
Exporting summary stats in a different document type can help visualize results and understand data. Also note the different steps we can take to print out summary stats in a more visually intuitive way.
df.describe()
Year | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
---|---|---|---|---|---|---|
count | 2750.000000 | 2739.000000 | 2739.000000 | 2750.000000 | 2739.000000 | 2739.000000 |
mean | 1992.500000 | 3.653761 | 3.533555 | 138.828983 | 6.441486 | 6.200252 |
std | 14.433494 | 2.560308 | 2.539424 | 65.823807 | 3.025140 | 3.017818 |
min | 1968.000000 | 0.000000 | 0.000000 | 34.783333 | 0.000000 | 0.000000 |
25% | 1980.000000 | 1.600000 | 1.600000 | 82.408333 | 5.980000 | 5.230000 |
50% | 1992.500000 | 3.350000 | 3.350000 | 142.387500 | 7.370000 | 7.170000 |
75% | 2005.000000 | 5.150000 | 5.150000 | 195.291667 | 8.280000 | 8.070000 |
max | 2017.000000 | 11.500000 | 11.500000 | 245.119583 | 15.120000 | 15.120000 |
round(df.describe(), 2) np.
Year | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
---|---|---|---|---|---|---|
count | 2750.00 | 2739.00 | 2739.00 | 2750.00 | 2739.00 | 2739.00 |
mean | 1992.50 | 3.65 | 3.53 | 138.83 | 6.44 | 6.20 |
std | 14.43 | 2.56 | 2.54 | 65.82 | 3.03 | 3.02 |
min | 1968.00 | 0.00 | 0.00 | 34.78 | 0.00 | 0.00 |
25% | 1980.00 | 1.60 | 1.60 | 82.41 | 5.98 | 5.23 |
50% | 1992.50 | 3.35 | 3.35 | 142.39 | 7.37 | 7.17 |
75% | 2005.00 | 5.15 | 5.15 | 195.29 | 8.28 | 8.07 |
max | 2017.00 | 11.50 | 11.50 | 245.12 | 15.12 | 15.12 |
print(np.round(df.describe(), 2).T)
'exports/summary_stats.csv', sep=',') df.describe().transpose().to_csv(
count mean std min 25% 50% 75% \
Year 2750.0 1992.50 14.43 1968.00 1980.00 1992.50 2005.00
High_Value 2739.0 3.65 2.56 0.00 1.60 3.35 5.15
Low_Value 2739.0 3.53 2.54 0.00 1.60 3.35 5.15
CPI_Average 2750.0 138.83 65.82 34.78 82.41 142.39 195.29
High_2018 2739.0 6.44 3.03 0.00 5.98 7.37 8.28
Low_2018 2739.0 6.20 3.02 0.00 5.23 7.17 8.07
max
Year 2017.00
High_Value 11.50
Low_Value 11.50
CPI_Average 245.12
High_2018 15.12
Low_2018 15.12
Creating a new variable
The standard way to create a new Pandas column or replace an old one is to call the Series (whether it exists or not) on the left hand side and set it equal to the expression that expresses that value you want to create. For example:
'7'] = 7
df[print(df.head())
Year State Table_Data High_Value Low_Value CPI_Average \
0 1968 Alabama NaN 0.00000 0.00000 34.783333
1 1968 Alaska 2.1 2.10000 2.10000 34.783333
2 1968 Arizona 18.72 - 26.40/wk(b) 0.66000 0.46800 34.783333
3 1968 Arkansas 1.25/day(b) 0.15625 0.15625 34.783333
4 1968 California 1.65(b) 1.65000 1.65000 34.783333
High_2018 Low_2018 7
0 0.00 0.00 7
1 15.12 15.12 7
2 4.75 3.37 7
3 1.12 1.12 7
4 11.88 11.88 7
Or, we can replace an old variable in a similar way.
Replacing a variable
'7'] = 8
df[print(df.head())
Year State Table_Data High_Value Low_Value CPI_Average \
0 1968 Alabama NaN 0.00000 0.00000 34.783333
1 1968 Alaska 2.1 2.10000 2.10000 34.783333
2 1968 Arizona 18.72 - 26.40/wk(b) 0.66000 0.46800 34.783333
3 1968 Arkansas 1.25/day(b) 0.15625 0.15625 34.783333
4 1968 California 1.65(b) 1.65000 1.65000 34.783333
High_2018 Low_2018 7
0 0.00 0.00 8
1 15.12 15.12 8
2 4.75 3.37 8
3 1.12 1.12 8
4 11.88 11.88 8
We can also rename variables. In this case we will rename Year
to Date
and Table_Data
to Values.
Renaming Variables
={"Year": "Date", "Table_Data": "Values"}).head() df.rename(columns
Date | State | Values | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | 7 | |
---|---|---|---|---|---|---|---|---|---|
0 | 1968 | Alabama | NaN | 0.00000 | 0.00000 | 34.783333 | 0.00 | 0.00 | 8 |
1 | 1968 | Alaska | 2.1 | 2.10000 | 2.10000 | 34.783333 | 15.12 | 15.12 | 8 |
2 | 1968 | Arizona | 18.72 - 26.40/wk(b) | 0.66000 | 0.46800 | 34.783333 | 4.75 | 3.37 | 8 |
3 | 1968 | Arkansas | 1.25/day(b) | 0.15625 | 0.15625 | 34.783333 | 1.12 | 1.12 | 8 |
4 | 1968 | California | 1.65(b) | 1.65000 | 1.65000 | 34.783333 | 11.88 | 11.88 | 8 |
Dropping Variables
We can also drop a variable with df.drop()
. This is one of many functions where you’ll need to specify an axis. axis=0
would indicate that you want to apply the operation along the rows, whereas axis=1
indicates that we want to apply this operation to a column.
# Dropping a variable
= df.drop("7", axis=1)
df print(df.head())
Year State Table_Data High_Value Low_Value CPI_Average \
0 1968 Alabama NaN 0.00000 0.00000 34.783333
1 1968 Alaska 2.1 2.10000 2.10000 34.783333
2 1968 Arizona 18.72 - 26.40/wk(b) 0.66000 0.46800 34.783333
3 1968 Arkansas 1.25/day(b) 0.15625 0.15625 34.783333
4 1968 California 1.65(b) 1.65000 1.65000 34.783333
High_2018 Low_2018
0 0.00 0.00
1 15.12 15.12
2 4.75 3.37
3 1.12 1.12
4 11.88 11.88
Arithmetic Operations
Basic math operations are also easily applied in Pandas.
'Difference'] = df['High_Value'] - df['Low_Value']
df['High*2'] = df['High_Value']*2
df['Low*2'] = df['Low_Value']*2
df[print(df.head())
Year State Table_Data High_Value Low_Value CPI_Average \
0 1968 Alabama NaN 0.00000 0.00000 34.783333
1 1968 Alaska 2.1 2.10000 2.10000 34.783333
2 1968 Arizona 18.72 - 26.40/wk(b) 0.66000 0.46800 34.783333
3 1968 Arkansas 1.25/day(b) 0.15625 0.15625 34.783333
4 1968 California 1.65(b) 1.65000 1.65000 34.783333
High_2018 Low_2018 Difference High*2 Low*2
0 0.00 0.00 0.000 0.0000 0.0000
1 15.12 15.12 0.000 4.2000 4.2000
2 4.75 3.37 0.192 1.3200 0.9360
3 1.12 1.12 0.000 0.3125 0.3125
4 11.88 11.88 0.000 3.3000 3.3000
Complex Operations with .map()
More complex operations can be solved through the .map()
method.
In the example below, .map()
uses a dictionary to change results, which can help keep code clean when replacing a large amount of values. Here we create an abbrev
column for state abbreviations (we will create it and drop the variable after).
# Using Data.map
= {'OH': 'Ohio', 'Illinois': 'IL', 'California': 'CA', 'Texas': 'TX', 'Alabama':'AL'}
state_2
'abbrev'] = df['State'].map(state_2)
df['State', 'abbrev']].head() df[[
State | abbrev | |
---|---|---|
0 | Alabama | AL |
1 | Alaska | NaN |
2 | Arizona | NaN |
3 | Arkansas | NaN |
4 | California | CA |
# Another simpler map example
= [1,2,3,4]
alist def function_to_be_mapped(x):
return x+3
= map(function_to_be_mapped, alist)
result print(list(result))
[4, 5, 6, 7]
'abbrev', ascending=True).head()
df.sort_values(= df.drop("abbrev", axis=1) df
Using Functions
You can use .apply()
to apply a function to a Series. You can either specify the function or use the lambda anonymous function:
Specfying a function:
def add(x):
= x + 1
x return x
print(df['Year'].head())
print("\n")
print(df['Year'].apply(add).head())
0 1968
1 1968
2 1968
3 1968
4 1968
Name: Year, dtype: int64
0 1969
1 1969
2 1969
3 1969
4 1969
Name: Year, dtype: int64
Lambda Functions
Lambda Functions (skipping defining a function if the function is simple enough):
print(df['Year'].head())
print("\n")
print((df['Year'].apply(lambda x: x + 1).head()))
0 1968
1 1968
2 1968
3 1968
4 1968
Name: Year, dtype: int64
0 1969
1 1969
2 1969
3 1969
4 1969
Name: Year, dtype: int64
Missing Values
Dealing with missing values is an important part of data cleaning and something to always be mindful of. Pandas codes missing values as numpy NaN
values.
print(df['Table_Data'].head(10))
print("\n")
print(df['Year'].head(10))
0 NaN
1 2.1
2 18.72 - 26.40/wk(b)
3 1.25/day(b)
4 1.65(b)
5 1.00 - 1.25(b)
6 1.4
7 1.25
8 1.25 - 1.40
9 $1.15 & $1.60
Name: Table_Data, dtype: object
0 1968
1 1968
2 1968
3 1968
4 1968
5 1968
6 1968
7 1968
8 1968
9 1968
Name: Year, dtype: int64
The following syntax can help identify missing values:
print(df['Table_Data'].isnull().values.any())
print(df['Year'].isnull().values.any())
True
False
Duplicate Values
We can use .drop_duplicates(keep = "first")
to drop all but the first observations of duplicates.
= {'col_1': [3, 3, 3, 3], 'col_2': ['a', 'a', 'a', 'a']}
data = pd.DataFrame.from_dict(data)
data_dup print(data_dup)
print("\n")
= data_dup.drop_duplicates(keep="last")
data_dup
print(data_dup)
col_1 col_2
0 3 a
1 3 a
2 3 a
3 3 a
col_1 col_2
3 3 a
Reshaping Data
Data can be reshaped using a variety of Pandas functions. Going from wide to long has a built in function wide_to_long()
. We will load in a new dataset for this:
= pd.DataFrame({
df_ex 'Unique Family Identifier': [1000, 1000, 1000, 1001, 1001, 1001, 1002, 324, 234],
'Order': [1, 2, 3, 1, 2, 3, 1, 2, 3],
'az1': [28, 82, 23, 234, 234, 324, 546, 546, 5464],
'az2': [2342, 2342, 54645, 56765, 65765, 65756, 3453, 56756, 3453]})
= pd.wide_to_long(df_ex, stubnames='az', i=['Unique Family Identifier', 'Order'], j='age')
reshape
df_ex.head()
Unique Family Identifier | Order | az1 | az2 | |
---|---|---|---|---|
0 | 1000 | 1 | 28 | 2342 |
1 | 1000 | 2 | 82 | 2342 |
2 | 1000 | 3 | 23 | 54645 |
3 | 1001 | 1 | 234 | 56765 |
4 | 1001 | 2 | 234 | 65765 |
reshape
az | |||
---|---|---|---|
Unique Family Identifier | Order | age | |
1000 | 1 | 1 | 28 |
2 | 2342 | ||
2 | 1 | 82 | |
2 | 2342 | ||
3 | 1 | 23 | |
2 | 54645 | ||
1001 | 1 | 1 | 234 |
2 | 56765 | ||
2 | 1 | 234 | |
2 | 65765 | ||
3 | 1 | 324 | |
2 | 65756 | ||
1002 | 1 | 1 | 546 |
2 | 3453 | ||
324 | 2 | 1 | 546 |
2 | 56756 | ||
234 | 3 | 1 | 5464 |
2 | 3453 |
Going from long to wide requires the use of unstack()
.
= reshape.unstack()
normal = normal.columns.map('{0[0]}{0[1]}'.format)
normal.columns
normal.reset_index() normal.head()
az1 | az2 | ||
---|---|---|---|
Unique Family Identifier | Order | ||
234 | 3 | 5464 | 3453 |
324 | 2 | 546 | 56756 |
1000 | 1 | 28 | 2342 |
2 | 82 | 2342 | |
3 | 23 | 54645 |
Merging Data
Merging data uses pd.merge(data_1, data_2, on = identifier)
.
= pd.DataFrame({'key': range(5),
left_frame 'left_value': ['a', 'b', 'c', 'd', 'e']})
= pd.DataFrame({'key': range(2, 7),
right_frame 'right_value': ['f', 'g', 'h', 'i', 'j']})
print(left_frame)
print('\n')
print(right_frame)
key left_value
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
key right_value
0 2 f
1 3 g
2 4 h
3 5 i
4 6 j
='key', how='inner') pd.merge(left_frame, right_frame, on
key | left_value | right_value | |
---|---|---|---|
0 | 2 | c | f |
1 | 3 | d | g |
2 | 4 | e | h |
='key', how='left') pd.merge(left_frame, right_frame, on
key | left_value | right_value | |
---|---|---|---|
0 | 0 | a | NaN |
1 | 1 | b | NaN |
2 | 2 | c | f |
3 | 3 | d | g |
4 | 4 | e | h |
='key', how='right') pd.merge(left_frame, right_frame, on
key | left_value | right_value | |
---|---|---|---|
0 | 2 | c | f |
1 | 3 | d | g |
2 | 4 | e | h |
3 | 5 | NaN | i |
4 | 6 | NaN | j |
Appending Data
Note: Here, axis=0
is implied as the default, so it will append along the rows. But columns or Series can also be appended to each other by specifying axis=1
(provided that they are the same length).
=True) pd.concat([left_frame, right_frame], sort
key | left_value | right_value | |
---|---|---|---|
0 | 0 | a | NaN |
1 | 1 | b | NaN |
2 | 2 | c | NaN |
3 | 3 | d | NaN |
4 | 4 | e | NaN |
0 | 2 | NaN | f |
1 | 3 | NaN | g |
2 | 4 | NaN | h |
3 | 5 | NaN | i |
4 | 6 | NaN | j |
Collapsing Data
Collapsing data is accomplished by the .groupby()
function. We will collapse by Year
.
= df.groupby('Year')
by_year print(by_year.count().head()) # NOT NULL records within each column
State Table_Data High_Value Low_Value CPI_Average High_2018 \
Year
1968 55 53 54 54 55 54
1969 55 54 54 54 55 54
1970 55 54 54 54 55 54
1971 55 54 54 54 55 54
1972 55 54 54 54 55 54
Low_2018 Difference High*2 Low*2
Year
1968 54 54 54 54
1969 54 54 54 54
1970 54 54 54 54
1971 54 54 54 54
1972 54 54 54 54
Note: This only generates counts for each year. If we want averages, sums or any other summary statistics we have to specify that:
print(by_year.sum()[20:25])
print('\n')
print(by_year.mean()[20:25])
print('\n')
print(by_year.median()[20:25])
High_Value Low_Value CPI_Average High_2018 Low_2018 Difference \
Year
1988 146.75 142.75 6504.208331 310.65 302.19 4.0
1989 146.75 142.75 6818.166669 296.49 288.41 4.0
1990 146.75 142.75 7186.208332 281.27 273.61 4.0
1991 179.20 175.00 7490.541669 329.52 321.80 4.2
1992 194.17 189.57 7717.416669 346.40 338.19 4.6
High*2 Low*2
Year
1988 293.50 285.50
1989 293.50 285.50
1990 293.50 285.50
1991 358.40 350.00
1992 388.34 379.14
High_Value Low_Value CPI_Average High_2018 Low_2018 Difference \
Year
1988 2.668182 2.595455 118.258333 5.648182 5.494364 0.072727
1989 2.668182 2.595455 123.966667 5.390727 5.243818 0.072727
1990 2.668182 2.595455 130.658333 5.114000 4.974727 0.072727
1991 3.258182 3.181818 136.191667 5.991273 5.850909 0.076364
1992 3.530364 3.446727 140.316667 6.298182 6.148909 0.083636
High*2 Low*2
Year
1988 5.336364 5.190909
1989 5.336364 5.190909
1990 5.336364 5.190909
1991 6.516364 6.363636
1992 7.060727 6.893455
High_Value Low_Value CPI_Average High_2018 Low_2018 Difference \
Year
1988 3.35 3.35 118.258333 7.09 7.09 0.0
1989 3.35 3.35 123.966667 6.77 6.77 0.0
1990 3.35 3.35 130.658333 6.42 6.42 0.0
1991 3.80 3.80 136.191667 6.99 6.99 0.0
1992 4.25 4.25 140.316667 7.58 7.58 0.0
High*2 Low*2
Year
1988 6.7 6.7
1989 6.7 6.7
1990 6.7 6.7
1991 7.6 7.6
1992 8.5 8.5