Intro to Pandas
Contents
Intro to Pandas#
While built-in Python functions suffice for general programming, data analysis requires additional functions and objects. Pandas is a popular data analysis package that is simple to use the moment you start Python.
The primary reason Pandas is popular for data science and analysis is that it introduces three useful objects that mirror similar data structures from other statistical packages and don’t add complexity to the simple Pythonic syntax. These objects are:
The DataFrame
The Series
The Index
This guide is not meant to be comprehensive; the most comprehensive documentation for Pandas is found here. The purpose of this guide is to inform a potential user of the functionality of Pandas and a general overview of how to accomplish basic data analysis tools.
Getting Started#
Importing Pandas#
Since Pandas is not native to Python, you will have to install it. Fortunately, Pandas has grown so much in popularity, that most downloads (including Anaconda) will contain Pandas, but you will still have to load the package. This code will do the trick:
import pandas
However, because Pandas is used so often, the Python community loads pandas as pd
for ease of access:
import pandas as pd
It’s also useful to import a few other packages, mostly notably Numpy.
import numpy as np
DataFrame#
The DataFrame object is one of the main contributions of Pandas. The DataFrame is a 2-dimensional labeled structure with columns and rows. The columns and the rows represent one dimension each. The DataFrame is analogous to the R and Stata DataFrame and the Excel spreadsheet. Or put in more technical terms, the DataFrame is a tabular data structure. The code below defines a DataFrame and then prints it out to view:
d = pd.DataFrame({'one': [1., 2., 3., 4.], 'two': [4., 3., 2., 1.]})
d
one | two | |
---|---|---|
0 | 1.0 | 4.0 |
1 | 2.0 | 3.0 |
2 | 3.0 | 2.0 |
3 | 4.0 | 1.0 |
The name of our DataFrame is d
, and it has two columns, one
and two
, and four rows, 0-3. (Python is 0 indexed, meaning it counts from 0…1…2…and so on.)
Each data point can be referenced through its Index (which corresponds to its row, the far most left value) and the column signifies what the value means. We can call a single Pandas column with this command:
d['one']
0 1.0
1 2.0
2 3.0
3 4.0
Name: one, dtype: float64
The values on the left represent the Index we saw earlier. Notice: A column’s type is itself a Pandas object, the Series. More on this object below.
Series#
A Series is one-dimensional indexed array that can contain any Python data type. To create a Series you use the function:
series_ex = pd.Series([1,2,3,4])
A Series in Pandas is similar visually to a list. But there are key distinctions in how they operate. As mentioned, Pandas is used for data analysis, so a Series has functions that allow for data analysis to be done easily, while a list would require either a for loop or list comprehension for the same operations. Example of this below:
series_ex = series_ex*2
print("This is a Series multiplied by two")
print(series_ex)
This is a Series multiplied by two
0 2
1 4
2 6
3 8
dtype: int64
alist = [1,2,3,4]
alist = alist*2
print("This is a List multiplied by two")
print(alist)
This is a List multiplied by two
[1, 2, 3, 4, 1, 2, 3, 4]
Index#
Both the Series and the DataFrame have an Index that signifies order and allows for referencing specific points. The Index itself is an object.
ind = pd.Index([2, 3, 5, 7, 11])
ind
Index([2, 3, 5, 7, 11], dtype='int64')
Importing Data#
Pandas has the ability to read and export multiple data format types: be it a CSV, DTA, SAS file, JSON file, SQL and many more. Almost all data reading will be in the format:
pd.read_(NAME OF DATA FORMAT)('FILE_NAME')
Let’s take a look at a few examples.
Stata Files (DTA)#
# Importing data from Stata
pd.read_stata('data/State_ETO_short.dta')
state_abbv | ETO | ETW | ETOW | ET_cat | ET_Work_cat | ET_ET_Work_cat | |
---|---|---|---|---|---|---|---|
0 | VA | 2.0 | 62.5 | 64.5 | 0 - 5% | 30% < | 30% < |
1 | TN | 5.5 | 20.8 | 26.3 | 5 - 10% | 10 - 30% | 20 - 30% |
2 | VT | 19.9 | 0.7 | 20.6 | 15 - 20% | 0 - 5% | 20 - 30% |
3 | ID | 6.0 | 11.4 | 17.4 | 5 - 10% | 10 - 30% | 10 - 20% |
4 | OH | 1.9 | 15.0 | 16.9 | 0 - 5% | 10 - 30% | 10 - 20% |
This is how we read in a DataFrame, but we still need to store and name it. This can be accomplished in one line:
df = pd.read_stata('data/State_ETO_short.dta')
Now, when we call df
, we’ll get the DataFrame that corresponds to the data referenced in pd.read_stata('data/State_ETO.dta')
.
df
state_abbv | ETO | ETW | ETOW | ET_cat | ET_Work_cat | ET_ET_Work_cat | |
---|---|---|---|---|---|---|---|
0 | VA | 2.0 | 62.5 | 64.5 | 0 - 5% | 30% < | 30% < |
1 | TN | 5.5 | 20.8 | 26.3 | 5 - 10% | 10 - 30% | 20 - 30% |
2 | VT | 19.9 | 0.7 | 20.6 | 15 - 20% | 0 - 5% | 20 - 30% |
3 | ID | 6.0 | 11.4 | 17.4 | 5 - 10% | 10 - 30% | 10 - 20% |
4 | OH | 1.9 | 15.0 | 16.9 | 0 - 5% | 10 - 30% | 10 - 20% |
Excel Files (XLSX)#
(Note: May need to pip install openpyxl
for this to work!)
# Importing data from excel into pandas
df = pd.read_excel('data/Minimum_Wage_Data_Short.xlsx')
df
Year | State | Table_Data | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
---|---|---|---|---|---|---|---|---|
0 | 1968 | Alabama | NaN | 0.00000 | 0.00000 | 34.783333 | 0.00 | 0.00 |
1 | 1968 | Alaska | 2.1 | 2.10000 | 2.10000 | 34.783333 | 15.12 | 15.12 |
2 | 1968 | Arizona | 18.72 - 26.40/wk(b) | 0.66000 | 0.46800 | 34.783333 | 4.75 | 3.37 |
3 | 1968 | Arkansas | 1.25/day(b) | 0.15625 | 0.15625 | 34.783333 | 1.12 | 1.12 |
4 | 1968 | California | 1.65(b) | 1.65000 | 1.65000 | 34.783333 | 11.88 | 11.88 |
5 | 1968 | Colorado | 1.00 - 1.25(b) | 1.25000 | 1.00000 | 34.783333 | 9.00 | 7.20 |
Other Data Types#
Similar variants for uploading exists for each common data type - CSV, SAS, and so on.
Exporting Data#
Exporting data is very simple as well and follows a pattern similar to exporting.
CSV#
df.to_csv('exports/Minimum_Wage_Data.csv')
Converting Data#
Now that we know how to load in data, it will be useful to examine ways to convert already existing data structures into DataFrames.
List#
my_list = [1,2,3,4,5,6,7,8,9]
columns = ['a', 'b', 'c']
pd.DataFrame(np.array(my_list).reshape(3,3), columns = columns)
a | b | c | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 |
2 | 7 | 8 | 9 |
An important thing to note: pd.DataFrame
has a multitude of options. The most import of which is what follows right after the first parentheses which is the data that is to be transformed. Here are transform the list [1,2,3,4,5,6,7,7,8,9]
into an np.array
with the shape of:
[[1,2,3],
[4,5,6],
[7.8.9]]
Then we transform the data to a Pandas DataFrame which gives us:
0 1 2 3
1 4 5 6
2 7 8 9
Finally, we add a list of column name with the option columns = columns
to get the final dataframe.
a b c
1 4 5 6
2 7 8 9
Dictionary#
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
pd.DataFrame.from_dict(data)
col_1 | col_2 | |
---|---|---|
0 | 3 | a |
1 | 2 | b |
2 | 1 | c |
3 | 0 | d |
Numpy Array#
dtype = [('Col1','int32'), ('Col2','float32'), ('Col3','float32')]
values = np.zeros(5, dtype=dtype)
index = ['Row'+str(i) for i in range(1, len(values)+1)]
pd.DataFrame(values, index=index).head()
Col1 | Col2 | Col3 | |
---|---|---|---|
Row1 | 0 | 0.0 | 0.0 |
Row2 | 0 | 0.0 | 0.0 |
Row3 | 0 | 0.0 | 0.0 |
Row4 | 0 | 0.0 | 0.0 |
Row5 | 0 | 0.0 | 0.0 |
Now that we know how to load in our DataFrame, we will try to view and manipulate our data before our analysis is run.
Viewing Data#
We know that we can view our data by simply printing the dataframe, but what if our data is too large?
The .head()
function prints out the first 5 rows (inside the parentheticals you can specify the first N observations you want to see).
df.head()
Year | State | Table_Data | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
---|---|---|---|---|---|---|---|---|
0 | 1968 | Alabama | NaN | 0.00000 | 0.00000 | 34.783333 | 0.00 | 0.00 |
1 | 1968 | Alaska | 2.1 | 2.10000 | 2.10000 | 34.783333 | 15.12 | 15.12 |
2 | 1968 | Arizona | 18.72 - 26.40/wk(b) | 0.66000 | 0.46800 | 34.783333 | 4.75 | 3.37 |
3 | 1968 | Arkansas | 1.25/day(b) | 0.15625 | 0.15625 | 34.783333 | 1.12 | 1.12 |
4 | 1968 | California | 1.65(b) | 1.65000 | 1.65000 | 34.783333 | 11.88 | 11.88 |
df.head(2)
Year | State | Table_Data | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
---|---|---|---|---|---|---|---|---|
0 | 1968 | Alabama | NaN | 0.0 | 0.0 | 34.783333 | 0.00 | 0.00 |
1 | 1968 | Alaska | 2.1 | 2.1 | 2.1 | 34.783333 | 15.12 | 15.12 |
You can also do all of these methods on a single Series.
df['Year'].head()
0 1968
1 1968
2 1968
3 1968
4 1968
Name: Year, dtype: int64
Or, you can do multiple columns at one time.
df[['State', 'Table_Data']].head()
State | Table_Data | |
---|---|---|
0 | Alabama | NaN |
1 | Alaska | 2.1 |
2 | Arizona | 18.72 - 26.40/wk(b) |
3 | Arkansas | 1.25/day(b) |
4 | California | 1.65(b) |
It’s also good to view general information on the dataframe from .info()
and to understand the datatypes of each of the columns.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year 6 non-null int64
1 State 6 non-null object
2 Table_Data 5 non-null object
3 High_Value 6 non-null float64
4 Low_Value 6 non-null float64
5 CPI_Average 6 non-null float64
6 High_2018 6 non-null float64
7 Low_2018 6 non-null float64
dtypes: float64(5), int64(1), object(2)
memory usage: 512.0+ bytes
df.dtypes
Year int64
State object
Table_Data object
High_Value float64
Low_Value float64
CPI_Average float64
High_2018 float64
Low_2018 float64
dtype: object
Similar to our use of head()
, we can use tail()
to examine the last few data points.
df.tail(2)
Year | State | Table_Data | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
---|---|---|---|---|---|---|---|---|
4 | 1968 | California | 1.65(b) | 1.65 | 1.65 | 34.783333 | 11.88 | 11.88 |
5 | 1968 | Colorado | 1.00 - 1.25(b) | 1.25 | 1.00 | 34.783333 | 9.00 | 7.20 |
Subsetting Data#
Slicing the data by rows#
We can do traditional slicing through the index [start, end]
. This will subset by rows.
df[1:4]
Year | State | Table_Data | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
---|---|---|---|---|---|---|---|---|
1 | 1968 | Alaska | 2.1 | 2.10000 | 2.10000 | 34.783333 | 15.12 | 15.12 |
2 | 1968 | Arizona | 18.72 - 26.40/wk(b) | 0.66000 | 0.46800 | 34.783333 | 4.75 | 3.37 |
3 | 1968 | Arkansas | 1.25/day(b) | 0.15625 | 0.15625 | 34.783333 | 1.12 | 1.12 |
Slicing the data by columns#
To subset by columns, we can use df[['columns_we_want_1', 'columns_we_want_2']]
.
columns_you_want = ['State', 'Table_Data']
df[columns_you_want].head()
State | Table_Data | |
---|---|---|
0 | Alabama | NaN |
1 | Alaska | 2.1 |
2 | Arizona | 18.72 - 26.40/wk(b) |
3 | Arkansas | 1.25/day(b) |
4 | California | 1.65(b) |
Let’s bring in a larger dataset for this analysis to be meaningful. This will be a panel dataset of states, so there will be 50 rows for every year
df = pd.read_excel("data/Minimum_Wage_Data.xlsx")
print("These are the first five observations: \n")
print(df[['Year', 'State']].head(5))
print("\n")
print("These are the last five observations: \n")
print(df[['Year', 'State']].tail(5))
These are the first five observations:
Year State
0 1968 Alabama
1 1968 Alaska
2 1968 Arizona
3 1968 Arkansas
4 1968 California
These are the last five observations:
Year State
2745 2017 Virginia
2746 2017 Washington
2747 2017 West Virginia
2748 2017 Wisconsin
2749 2017 Wyoming
Viewing the size and columns of the data:
print(len(df))
print(df.columns)
2750
Index(['Year', 'State', 'Table_Data', 'High_Value', 'Low_Value', 'CPI_Average',
'High_2018', 'Low_2018'],
dtype='object')
We have 8 variables: Year
, State
, Table_Data
, High_Value
, Low_Value
, CPI_Average
, High_2018
, and Low_2018
. Sometimes in our analysis we only want to keep certain years. For this, the traditional boolean logic mixed with Pandas slices the data into the segments we want.
Slicing based on conditions#
# Rows past 2015
df[df.Year > 2015].head(3)
Year | State | Table_Data | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
---|---|---|---|---|---|---|---|---|
2640 | 2016 | Alabama | ... | 0.00 | 0.00 | 240.007167 | 0.00 | 0.00 |
2641 | 2016 | Alaska | 9.75 | 9.75 | 9.75 | 240.007167 | 10.17 | 10.17 |
2642 | 2016 | Arizona | 8.05 | 8.05 | 8.05 | 240.007167 | 8.40 | 8.40 |
Segmenting on multiple conditions is also desirable. Most programs allow for an “AND” operator and an “OR” operator.
# California AND 2010
print(df[(df.Year == 2010) & (df.State == 'California')].head(3))
print('\n')
Year State Table_Data High_Value Low_Value CPI_Average \
2314 2010 California 8 8.0 8.0 218.0555
High_2018 Low_2018
2314 9.19 9.19
# Alabama OR before 2015
df[(df.State == "Alabama") | (df.Year < 2015)].head(3)
Year | State | Table_Data | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
---|---|---|---|---|---|---|---|---|
0 | 1968 | Alabama | NaN | 0.00 | 0.000 | 34.783333 | 0.00 | 0.00 |
1 | 1968 | Alaska | 2.1 | 2.10 | 2.100 | 34.783333 | 15.12 | 15.12 |
2 | 1968 | Arizona | 18.72 - 26.40/wk(b) | 0.66 | 0.468 | 34.783333 | 4.75 | 3.37 |
The traditional Python index slicers are also applicable for DataFrames with the .loc[]
method.
print(df.iloc[99])
print('\n')
print(df.iloc[[1, 50, 300]])
Year 1969
State South Dakota
Table_Data 17.00 - 20.00/wk
High_Value 0.5
Low_Value 0.425
CPI_Average 36.683333
High_2018 3.41
Low_2018 2.9
Name: 99, dtype: object
Year State Table_Data High_Value Low_Value CPI_Average \
1 1968 Alaska 2.1 2.1 2.10 34.783333
50 1968 Virginia ... 0.0 0.00 34.783333
300 1973 Minnesota .75 - 1.60 1.6 0.75 44.400000
High_2018 Low_2018
1 15.12 15.12
50 0.00 0.00
300 9.02 4.23
The .loc[]
method is also highly useful for subsetting rows and columns at the same time
df.loc[1:3, 'State']
1 Alaska
2 Arizona
3 Arkansas
Name: State, dtype: object
df.loc[:2, ['Year', 'State']]
Year | State | |
---|---|---|
0 | 1968 | Alabama |
1 | 1968 | Alaska |
2 | 1968 | Arizona |
Manipulating Data#
Summary Statistics#
Exporting summary stats in a different document type can help visualize results and understand data. Also note the different steps we can take to print out summary stats in a more visually intuitive way.
df.describe()
Year | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
---|---|---|---|---|---|---|
count | 2750.000000 | 2739.000000 | 2739.000000 | 2750.000000 | 2739.000000 | 2739.000000 |
mean | 1992.500000 | 3.653761 | 3.533555 | 138.828983 | 6.441486 | 6.200252 |
std | 14.433494 | 2.560308 | 2.539424 | 65.823807 | 3.025140 | 3.017818 |
min | 1968.000000 | 0.000000 | 0.000000 | 34.783333 | 0.000000 | 0.000000 |
25% | 1980.000000 | 1.600000 | 1.600000 | 82.408333 | 5.980000 | 5.230000 |
50% | 1992.500000 | 3.350000 | 3.350000 | 142.387500 | 7.370000 | 7.170000 |
75% | 2005.000000 | 5.150000 | 5.150000 | 195.291667 | 8.280000 | 8.070000 |
max | 2017.000000 | 11.500000 | 11.500000 | 245.119583 | 15.120000 | 15.120000 |
np.round(df.describe(), 2)
Year | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
---|---|---|---|---|---|---|
count | 2750.00 | 2739.00 | 2739.00 | 2750.00 | 2739.00 | 2739.00 |
mean | 1992.50 | 3.65 | 3.53 | 138.83 | 6.44 | 6.20 |
std | 14.43 | 2.56 | 2.54 | 65.82 | 3.03 | 3.02 |
min | 1968.00 | 0.00 | 0.00 | 34.78 | 0.00 | 0.00 |
25% | 1980.00 | 1.60 | 1.60 | 82.41 | 5.98 | 5.23 |
50% | 1992.50 | 3.35 | 3.35 | 142.39 | 7.37 | 7.17 |
75% | 2005.00 | 5.15 | 5.15 | 195.29 | 8.28 | 8.07 |
max | 2017.00 | 11.50 | 11.50 | 245.12 | 15.12 | 15.12 |
print(np.round(df.describe(), 2).T)
df.describe().transpose().to_csv('exports/summary_stats.csv', sep=',')
count mean std min 25% 50% 75% \
Year 2750.0 1992.50 14.43 1968.00 1980.00 1992.50 2005.00
High_Value 2739.0 3.65 2.56 0.00 1.60 3.35 5.15
Low_Value 2739.0 3.53 2.54 0.00 1.60 3.35 5.15
CPI_Average 2750.0 138.83 65.82 34.78 82.41 142.39 195.29
High_2018 2739.0 6.44 3.03 0.00 5.98 7.37 8.28
Low_2018 2739.0 6.20 3.02 0.00 5.23 7.17 8.07
max
Year 2017.00
High_Value 11.50
Low_Value 11.50
CPI_Average 245.12
High_2018 15.12
Low_2018 15.12
Creating a new variable#
The standard way to create a new Pandas column or replace an old one is to call the Series (whether it exists or not) on the left hand side and set it equal to the expression that expresses that value you want to create. For example:
df['7'] = 7
print(df.head())
Year State Table_Data High_Value Low_Value CPI_Average \
0 1968 Alabama NaN 0.00000 0.00000 34.783333
1 1968 Alaska 2.1 2.10000 2.10000 34.783333
2 1968 Arizona 18.72 - 26.40/wk(b) 0.66000 0.46800 34.783333
3 1968 Arkansas 1.25/day(b) 0.15625 0.15625 34.783333
4 1968 California 1.65(b) 1.65000 1.65000 34.783333
High_2018 Low_2018 7
0 0.00 0.00 7
1 15.12 15.12 7
2 4.75 3.37 7
3 1.12 1.12 7
4 11.88 11.88 7
Or, we can replace an old variable in a similar way.
Replacing a variable#
df['7'] = 8
print(df.head())
Year State Table_Data High_Value Low_Value CPI_Average \
0 1968 Alabama NaN 0.00000 0.00000 34.783333
1 1968 Alaska 2.1 2.10000 2.10000 34.783333
2 1968 Arizona 18.72 - 26.40/wk(b) 0.66000 0.46800 34.783333
3 1968 Arkansas 1.25/day(b) 0.15625 0.15625 34.783333
4 1968 California 1.65(b) 1.65000 1.65000 34.783333
High_2018 Low_2018 7
0 0.00 0.00 8
1 15.12 15.12 8
2 4.75 3.37 8
3 1.12 1.12 8
4 11.88 11.88 8
We can also rename variables. In this case we will rename Year
to Date
and Table_Data
to Values.
Renaming Variables#
df.rename(columns={"Year": "Date", "Table_Data": "Values"}).head()
Date | State | Values | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | 7 | |
---|---|---|---|---|---|---|---|---|---|
0 | 1968 | Alabama | NaN | 0.00000 | 0.00000 | 34.783333 | 0.00 | 0.00 | 8 |
1 | 1968 | Alaska | 2.1 | 2.10000 | 2.10000 | 34.783333 | 15.12 | 15.12 | 8 |
2 | 1968 | Arizona | 18.72 - 26.40/wk(b) | 0.66000 | 0.46800 | 34.783333 | 4.75 | 3.37 | 8 |
3 | 1968 | Arkansas | 1.25/day(b) | 0.15625 | 0.15625 | 34.783333 | 1.12 | 1.12 | 8 |
4 | 1968 | California | 1.65(b) | 1.65000 | 1.65000 | 34.783333 | 11.88 | 11.88 | 8 |
Dropping Variables#
We can also drop a variable with df.drop()
. This is one of many functions where you’ll need to specify an axis. axis=0
would indicate that you want to apply the operation along the rows, whereas axis=1
indicates that we want to apply this operation to a column.
# Dropping a variable
df = df.drop("7", axis=1)
print(df.head())
Year State Table_Data High_Value Low_Value CPI_Average \
0 1968 Alabama NaN 0.00000 0.00000 34.783333
1 1968 Alaska 2.1 2.10000 2.10000 34.783333
2 1968 Arizona 18.72 - 26.40/wk(b) 0.66000 0.46800 34.783333
3 1968 Arkansas 1.25/day(b) 0.15625 0.15625 34.783333
4 1968 California 1.65(b) 1.65000 1.65000 34.783333
High_2018 Low_2018
0 0.00 0.00
1 15.12 15.12
2 4.75 3.37
3 1.12 1.12
4 11.88 11.88
Arithmetic Operations#
Basic math operations are also easily applied in Pandas.
df['Difference'] = df['High_Value'] - df['Low_Value']
df['High*2'] = df['High_Value']*2
df['Low*2'] = df['Low_Value']*2
print(df.head())
Year State Table_Data High_Value Low_Value CPI_Average \
0 1968 Alabama NaN 0.00000 0.00000 34.783333
1 1968 Alaska 2.1 2.10000 2.10000 34.783333
2 1968 Arizona 18.72 - 26.40/wk(b) 0.66000 0.46800 34.783333
3 1968 Arkansas 1.25/day(b) 0.15625 0.15625 34.783333
4 1968 California 1.65(b) 1.65000 1.65000 34.783333
High_2018 Low_2018 Difference High*2 Low*2
0 0.00 0.00 0.000 0.0000 0.0000
1 15.12 15.12 0.000 4.2000 4.2000
2 4.75 3.37 0.192 1.3200 0.9360
3 1.12 1.12 0.000 0.3125 0.3125
4 11.88 11.88 0.000 3.3000 3.3000
Complex Operations with .map()
#
More complex operations can be solved through the .map()
method.
In the example below, .map()
uses a dictionary to change results, which can help keep code clean when replacing a large amount of values. Here we create an abbrev
column for state abbreviations (we will create it and drop the variable after).
# Using Data.map
state_2 = {'OH': 'Ohio', 'Illinois': 'IL', 'California': 'CA', 'Texas': 'TX', 'Alabama':'AL'}
df['abbrev'] = df['State'].map(state_2)
df[['State', 'abbrev']].head()
State | abbrev | |
---|---|---|
0 | Alabama | AL |
1 | Alaska | NaN |
2 | Arizona | NaN |
3 | Arkansas | NaN |
4 | California | CA |
# Another simpler map example
alist = [1,2,3,4]
def function_to_be_mapped(x):
return x+3
result = map(function_to_be_mapped, alist)
print(list(result))
[4, 5, 6, 7]
df.sort_values('abbrev', ascending=True).head()
df = df.drop("abbrev", axis=1)
Using Functions#
You can use .apply()
to apply a function to a Series. You can either specify the function or use the lambda anonymous function:
Specfying a function:
def add(x):
x = x + 1
return x
print(df['Year'].head())
print("\n")
print(df['Year'].apply(add).head())
0 1968
1 1968
2 1968
3 1968
4 1968
Name: Year, dtype: int64
0 1969
1 1969
2 1969
3 1969
4 1969
Name: Year, dtype: int64
Lambda Functions#
Lambda Functions (skipping defining a function if the function is simple enough):
print(df['Year'].head())
print("\n")
print((df['Year'].apply(lambda x: x + 1).head()))
0 1968
1 1968
2 1968
3 1968
4 1968
Name: Year, dtype: int64
0 1969
1 1969
2 1969
3 1969
4 1969
Name: Year, dtype: int64
Missing Values#
Dealing with missing values is an important part of data cleaning and something to always be mindful of. Pandas codes missing values as numpy NaN
values.
print(df['Table_Data'].head(10))
print("\n")
print(df['Year'].head(10))
0 NaN
1 2.1
2 18.72 - 26.40/wk(b)
3 1.25/day(b)
4 1.65(b)
5 1.00 - 1.25(b)
6 1.4
7 1.25
8 1.25 - 1.40
9 $1.15 & $1.60
Name: Table_Data, dtype: object
0 1968
1 1968
2 1968
3 1968
4 1968
5 1968
6 1968
7 1968
8 1968
9 1968
Name: Year, dtype: int64
The following syntax can help identify missing values:
print(df['Table_Data'].isnull().values.any())
print(df['Year'].isnull().values.any())
True
False
Duplicate Values#
We can use .drop_duplicates(keep = "first")
to drop all but the first observations of duplicates.
data = {'col_1': [3, 3, 3, 3], 'col_2': ['a', 'a', 'a', 'a']}
data_dup = pd.DataFrame.from_dict(data)
print(data_dup)
print("\n")
data_dup = data_dup.drop_duplicates(keep="last")
print(data_dup)
col_1 col_2
0 3 a
1 3 a
2 3 a
3 3 a
col_1 col_2
3 3 a
Reshaping Data#
Data can be reshaped using a variety of Pandas functions. Going from wide to long has a built in function wide_to_long()
. We will load in a new dataset for this:
df_ex = pd.DataFrame({
'Unique Family Identifier': [1000, 1000, 1000, 1001, 1001, 1001, 1002, 324, 234],
'Order': [1, 2, 3, 1, 2, 3, 1, 2, 3],
'az1': [28, 82, 23, 234, 234, 324, 546, 546, 5464],
'az2': [2342, 2342, 54645, 56765, 65765, 65756, 3453, 56756, 3453]})
reshape = pd.wide_to_long(df_ex, stubnames='az', i=['Unique Family Identifier', 'Order'], j='age')
df_ex.head()
Unique Family Identifier | Order | az1 | az2 | |
---|---|---|---|---|
0 | 1000 | 1 | 28 | 2342 |
1 | 1000 | 2 | 82 | 2342 |
2 | 1000 | 3 | 23 | 54645 |
3 | 1001 | 1 | 234 | 56765 |
4 | 1001 | 2 | 234 | 65765 |
reshape
az | |||
---|---|---|---|
Unique Family Identifier | Order | age | |
1000 | 1 | 1 | 28 |
2 | 2342 | ||
2 | 1 | 82 | |
2 | 2342 | ||
3 | 1 | 23 | |
2 | 54645 | ||
1001 | 1 | 1 | 234 |
2 | 56765 | ||
2 | 1 | 234 | |
2 | 65765 | ||
3 | 1 | 324 | |
2 | 65756 | ||
1002 | 1 | 1 | 546 |
2 | 3453 | ||
324 | 2 | 1 | 546 |
2 | 56756 | ||
234 | 3 | 1 | 5464 |
2 | 3453 |
Going from long to wide requires the use of unstack()
.
normal = reshape.unstack()
normal.columns = normal.columns.map('{0[0]}{0[1]}'.format)
normal.reset_index()
normal.head()
az1 | az2 | ||
---|---|---|---|
Unique Family Identifier | Order | ||
234 | 3 | 5464 | 3453 |
324 | 2 | 546 | 56756 |
1000 | 1 | 28 | 2342 |
2 | 82 | 2342 | |
3 | 23 | 54645 |
Merging Data#
Merging data uses pd.merge(data_1, data_2, on = identifier)
.
left_frame = pd.DataFrame({'key': range(5),
'left_value': ['a', 'b', 'c', 'd', 'e']})
right_frame = pd.DataFrame({'key': range(2, 7),
'right_value': ['f', 'g', 'h', 'i', 'j']})
print(left_frame)
print('\n')
print(right_frame)
key left_value
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
key right_value
0 2 f
1 3 g
2 4 h
3 5 i
4 6 j
pd.merge(left_frame, right_frame, on='key', how='inner')
key | left_value | right_value | |
---|---|---|---|
0 | 2 | c | f |
1 | 3 | d | g |
2 | 4 | e | h |
pd.merge(left_frame, right_frame, on='key', how='left')
key | left_value | right_value | |
---|---|---|---|
0 | 0 | a | NaN |
1 | 1 | b | NaN |
2 | 2 | c | f |
3 | 3 | d | g |
4 | 4 | e | h |
pd.merge(left_frame, right_frame, on='key', how='right')
key | left_value | right_value | |
---|---|---|---|
0 | 2 | c | f |
1 | 3 | d | g |
2 | 4 | e | h |
3 | 5 | NaN | i |
4 | 6 | NaN | j |
Appending Data#
Note: Here, axis=0
is implied as the default, so it will append along the rows. But columns or Series can also be appended to each other by specifying axis=1
(provided that they are the same length).
pd.concat([left_frame, right_frame], sort=True)
key | left_value | right_value | |
---|---|---|---|
0 | 0 | a | NaN |
1 | 1 | b | NaN |
2 | 2 | c | NaN |
3 | 3 | d | NaN |
4 | 4 | e | NaN |
0 | 2 | NaN | f |
1 | 3 | NaN | g |
2 | 4 | NaN | h |
3 | 5 | NaN | i |
4 | 6 | NaN | j |
Collapsing Data#
Collapsing data is accomplished by the .groupby()
function. We will collapse by Year
.
by_year = df.groupby('Year')
print(by_year.count().head()) # NOT NULL records within each column
State Table_Data High_Value Low_Value CPI_Average High_2018 \
Year
1968 55 53 54 54 55 54
1969 55 54 54 54 55 54
1970 55 54 54 54 55 54
1971 55 54 54 54 55 54
1972 55 54 54 54 55 54
Low_2018 Difference High*2 Low*2
Year
1968 54 54 54 54
1969 54 54 54 54
1970 54 54 54 54
1971 54 54 54 54
1972 54 54 54 54
Note: This only generates counts for each year. If we want averages, sums or any other summary statistics we have to specify that:
print(by_year.sum()[20:25])
print('\n')
print(by_year.mean()[20:25])
print('\n')
print(by_year.median()[20:25])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[61], line 1
----> 1 print(by_year.sum()[20:25])
2 print('\n')
3 print(by_year.mean()[20:25])
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:2263, in GroupBy.sum(self, numeric_only, min_count, engine, engine_kwargs)
2258 else:
2259 # If we are grouping on categoricals we want unobserved categories to
2260 # return zero, rather than the default of NaN which the reindexing in
2261 # _agg_general() returns. GH #31422
2262 with com.temp_setattr(self, "observed", True):
-> 2263 result = self._agg_general(
2264 numeric_only=numeric_only,
2265 min_count=min_count,
2266 alias="sum",
2267 npfunc=np.sum,
2268 )
2270 return self._reindex_output(result, fill_value=0)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1422, in GroupBy._agg_general(self, numeric_only, min_count, alias, npfunc)
1413 @final
1414 def _agg_general(
1415 self,
(...)
1420 npfunc: Callable,
1421 ):
-> 1422 result = self._cython_agg_general(
1423 how=alias,
1424 alt=npfunc,
1425 numeric_only=numeric_only,
1426 min_count=min_count,
1427 )
1428 return result.__finalize__(self.obj, method="groupby")
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1507, in GroupBy._cython_agg_general(self, how, alt, numeric_only, min_count, **kwargs)
1503 result = self._agg_py_fallback(values, ndim=data.ndim, alt=alt)
1505 return result
-> 1507 new_mgr = data.grouped_reduce(array_func)
1508 res = self._wrap_agged_manager(new_mgr)
1509 out = self._wrap_aggregated_output(res)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/internals/managers.py:1503, in BlockManager.grouped_reduce(self, func)
1499 if blk.is_object:
1500 # split on object-dtype blocks bc some columns may raise
1501 # while others do not.
1502 for sb in blk._split():
-> 1503 applied = sb.apply(func)
1504 result_blocks = extend_blocks(applied, result_blocks)
1505 else:
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/internals/blocks.py:329, in Block.apply(self, func, **kwargs)
323 @final
324 def apply(self, func, **kwargs) -> list[Block]:
325 """
326 apply the function to my values; return a block if we are not
327 one
328 """
--> 329 result = func(self.values, **kwargs)
331 return self._split_op_result(result)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1490, in GroupBy._cython_agg_general.<locals>.array_func(values)
1488 def array_func(values: ArrayLike) -> ArrayLike:
1489 try:
-> 1490 result = self.grouper._cython_operation(
1491 "aggregate",
1492 values,
1493 how,
1494 axis=data.ndim - 1,
1495 min_count=min_count,
1496 **kwargs,
1497 )
1498 except NotImplementedError:
1499 # generally if we have numeric_only=False
1500 # and non-applicable functions
1501 # try to python agg
1502 # TODO: shouldn't min_count matter?
1503 result = self._agg_py_fallback(values, ndim=data.ndim, alt=alt)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/ops.py:959, in BaseGrouper._cython_operation(self, kind, values, how, axis, min_count, **kwargs)
957 ids, _, _ = self.group_info
958 ngroups = self.ngroups
--> 959 return cy_op.cython_operation(
960 values=values,
961 axis=axis,
962 min_count=min_count,
963 comp_ids=ids,
964 ngroups=ngroups,
965 **kwargs,
966 )
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/ops.py:657, in WrappedCythonOp.cython_operation(self, values, axis, min_count, comp_ids, ngroups, **kwargs)
647 if not isinstance(values, np.ndarray):
648 # i.e. ExtensionArray
649 return self._ea_wrap_cython_operation(
650 values,
651 min_count=min_count,
(...)
654 **kwargs,
655 )
--> 657 return self._cython_op_ndim_compat(
658 values,
659 min_count=min_count,
660 ngroups=ngroups,
661 comp_ids=comp_ids,
662 mask=None,
663 **kwargs,
664 )
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/ops.py:497, in WrappedCythonOp._cython_op_ndim_compat(self, values, min_count, ngroups, comp_ids, mask, result_mask, **kwargs)
494 # otherwise we have OHLC
495 return res.T
--> 497 return self._call_cython_op(
498 values,
499 min_count=min_count,
500 ngroups=ngroups,
501 comp_ids=comp_ids,
502 mask=mask,
503 result_mask=result_mask,
504 **kwargs,
505 )
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/ops.py:549, in WrappedCythonOp._call_cython_op(self, values, min_count, ngroups, comp_ids, mask, result_mask, **kwargs)
547 counts = np.zeros(ngroups, dtype=np.int64)
548 if self.how in ["min", "max", "mean", "last", "first", "sum"]:
--> 549 func(
550 out=result,
551 counts=counts,
552 values=values,
553 labels=comp_ids,
554 min_count=min_count,
555 mask=mask,
556 result_mask=result_mask,
557 is_datetimelike=is_datetimelike,
558 )
559 elif self.how in ["var", "ohlc", "prod", "median"]:
560 func(
561 result,
562 counts,
(...)
568 **kwargs,
569 )
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/groupby.pyx:717, in pandas._libs.groupby.group_sum()
TypeError: unsupported operand type(s) for +: 'float' and 'str'