import pandas Intro to Pandas
While built-in Python functions suffice for general programming, data analysis requires additional functions and objects. Pandas is a popular data analysis package that is simple to use the moment you start Python.
The primary reason Pandas is popular for data science and analysis is that it introduces three useful objects that mirror similar data structures from other statistical packages and don’t add complexity to the simple Pythonic syntax. These objects are:
- The DataFrame
- The Series
- The Index
This guide is not meant to be comprehensive, the most comprehensive documentation for Pandas is found here. The purpose of this guide is to inform a potential user of the functionality of Pandas and a general overview of how to accomplish basic data analysis tools.
Getting Started
Importing Pandas
Since Pandas is not native to Python, you will have to install it. Fortunately, Pandas has grown so much in popularity, that most downloads (including Anaconda) will contain Pandas, but you will still have to load the package. This code will do the trick:
However, because Pandas is used so often, the Python community loads pandas as pd for ease of access:
import pandas as pdIt’s also useful to import a few other packages, mostly notably Numpy.
import numpy as npDataFrame
The DataFrame object is one of the main contributions of Pandas. The DataFrame is a 2-dimensional labeled structure with columns and rows. The columns and the rows represent one dimension each. The DataFrame is analogous to the R and Stata DataFrame and the Excel spreadsheet. Or put in more technical terms, the DataFrame is a tabular data structure. The code below defines a DataFrame and then prints it out to view:
d = pd.DataFrame({'one': [1., 2., 3., 4.], 'two': [4., 3., 2., 1.]})
d| one | two | |
|---|---|---|
| 0 | 1.0 | 4.0 |
| 1 | 2.0 | 3.0 |
| 2 | 3.0 | 2.0 |
| 3 | 4.0 | 1.0 |
The name of our DataFrame is d, and it has two columns, one and two, and four rows, 0-3. (Python is 0 indexed, meaning it counts from 0…1…2…and so on.)
Each data point can be referenced through its Index (which corresponds to its row, the far most left value) and the column signifies what the value means. We can call a single Pandas column with this command:
d['one']0 1.0
1 2.0
2 3.0
3 4.0
Name: one, dtype: float64
The values on the left represent the Index we saw earlier. Notice: A column’s type is itself a Pandas object, the Series. More on this object below.
Series
A Series is one-dimensional indexed array that can contain any Python data type. To create a Series you use the function:
series_ex = pd.Series([1,2,3,4])A Series in Pandas is similar visually to a list. But there are key distinctions in how they operate. As mentioned, Pandas is used for data analysis, so a Series has functions that allow for data analysis to be done easily, while a list would require either a for loop or list comprehension for the same operations. Example of this below:
series_ex = series_ex*2
print("This is a Series multiplied by two")
print(series_ex)This is a Series multiplied by two
0 2
1 4
2 6
3 8
dtype: int64
alist = [1,2,3,4]
alist = alist*2
print("This is a List multiplied by two")
print(alist)This is a List multiplied by two
[1, 2, 3, 4, 1, 2, 3, 4]
Index
Both the Series and the DataFrame have an Index that signifies order and allows for referencing specific points. The Index itself is an object.
ind = pd.Index([2, 3, 5, 7, 11])
indInt64Index([2, 3, 5, 7, 11], dtype='int64')
Importing Data
Pandas has the ability to read and export multiple data format types: be it a CSV, DTA, SAS file, JSON file, SQL and many more. Almost all data reading will be in the format:
pd.read_(NAME OF DATA FORMAT)('FILE_NAME')
Let’s take a look at a few examples.
Stata Files (DTA)
# Importing data from Stata
pd.read_stata('data/State_ETO_short.dta')| state_abbv | ETO | ETW | ETOW | ET_cat | ET_Work_cat | ET_ET_Work_cat | |
|---|---|---|---|---|---|---|---|
| 0 | VA | 2.0 | 62.5 | 64.5 | 0 - 5% | 30% < | 30% < |
| 1 | TN | 5.5 | 20.8 | 26.3 | 5 - 10% | 10 - 30% | 20 - 30% |
| 2 | VT | 19.9 | 0.7 | 20.6 | 15 - 20% | 0 - 5% | 20 - 30% |
| 3 | ID | 6.0 | 11.4 | 17.4 | 5 - 10% | 10 - 30% | 10 - 20% |
| 4 | OH | 1.9 | 15.0 | 16.9 | 0 - 5% | 10 - 30% | 10 - 20% |
This is how we read in a DataFrame, but we still need to store and name it. This can be accomplished in one line:
df = pd.read_stata('data/State_ETO_short.dta')Now, when we call df, we’ll get the DataFrame that corresponds to the data referenced in pd.read_stata('data/State_ETO.dta').
df| state_abbv | ETO | ETW | ETOW | ET_cat | ET_Work_cat | ET_ET_Work_cat | |
|---|---|---|---|---|---|---|---|
| 0 | VA | 2.0 | 62.5 | 64.5 | 0 - 5% | 30% < | 30% < |
| 1 | TN | 5.5 | 20.8 | 26.3 | 5 - 10% | 10 - 30% | 20 - 30% |
| 2 | VT | 19.9 | 0.7 | 20.6 | 15 - 20% | 0 - 5% | 20 - 30% |
| 3 | ID | 6.0 | 11.4 | 17.4 | 5 - 10% | 10 - 30% | 10 - 20% |
| 4 | OH | 1.9 | 15.0 | 16.9 | 0 - 5% | 10 - 30% | 10 - 20% |
Excel Files (XLSX)
(Note: May need to pip install openpyxl for this to work!)
# Importing data from excel into pandas
df = pd.read_excel('data/Minimum_Wage_Data_Short.xlsx')
df| Year | State | Table_Data | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1968 | Alabama | NaN | 0.00000 | 0.00000 | 34.783333 | 0.00 | 0.00 |
| 1 | 1968 | Alaska | 2.1 | 2.10000 | 2.10000 | 34.783333 | 15.12 | 15.12 |
| 2 | 1968 | Arizona | 18.72 - 26.40/wk(b) | 0.66000 | 0.46800 | 34.783333 | 4.75 | 3.37 |
| 3 | 1968 | Arkansas | 1.25/day(b) | 0.15625 | 0.15625 | 34.783333 | 1.12 | 1.12 |
| 4 | 1968 | California | 1.65(b) | 1.65000 | 1.65000 | 34.783333 | 11.88 | 11.88 |
| 5 | 1968 | Colorado | 1.00 - 1.25(b) | 1.25000 | 1.00000 | 34.783333 | 9.00 | 7.20 |
Other Data Types
Similar variants for uploading exists for each common data type - CSV, SAS, and so on.
Exporting Data
Exporting data is very simple as well and follows a pattern similar to exporting.
CSV
df.to_csv('exports/Minimum_Wage_Data.csv')Converting Data
Now that we know how to load in data, it will be useful to examine ways to convert already existing data structures into DataFrames.
List
my_list = [1,2,3,4,5,6,7,8,9]
columns = ['a', 'b', 'c']
pd.DataFrame(np.array(my_list).reshape(3,3), columns = columns)| a | b | c | |
|---|---|---|---|
| 0 | 1 | 2 | 3 |
| 1 | 4 | 5 | 6 |
| 2 | 7 | 8 | 9 |
An important thing to note: pd.DataFrame has a multitude of options. The most import of which is what follows right after the first parentheses which is the data that is to be transformed. Here are transform the list [1,2,3,4,5,6,7,7,8,9] into an np.array with the shape of:
[[1,2,3],
[4,5,6],
[7.8.9]]
Then we transform the data to a Pandas DataFrame which gives us:
0 1 2 3
1 4 5 6
2 7 8 9
Finally, we add a list of column name with the option columns = columns to get the final dataframe.
a b c
1 4 5 6
2 7 8 9
Dictionary
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
pd.DataFrame.from_dict(data)| col_1 | col_2 | |
|---|---|---|
| 0 | 3 | a |
| 1 | 2 | b |
| 2 | 1 | c |
| 3 | 0 | d |
Numpy Array
dtype = [('Col1','int32'), ('Col2','float32'), ('Col3','float32')]
values = np.zeros(5, dtype=dtype)
index = ['Row'+str(i) for i in range(1, len(values)+1)]
pd.DataFrame(values, index=index).head()| Col1 | Col2 | Col3 | |
|---|---|---|---|
| Row1 | 0 | 0.0 | 0.0 |
| Row2 | 0 | 0.0 | 0.0 |
| Row3 | 0 | 0.0 | 0.0 |
| Row4 | 0 | 0.0 | 0.0 |
| Row5 | 0 | 0.0 | 0.0 |
Now that we know how to load in our DataFrame, we will try to view and manipulate our data before our analysis is run.
Viewing Data
We know that we can view our data by simply printing the dataframe, but what if our data is too large?
The .head() function prints out the first 5 rows (inside the parentheticals you can specify the first N observations you want to see).
df.head()| Year | State | Table_Data | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1968 | Alabama | NaN | 0.00000 | 0.00000 | 34.783333 | 0.00 | 0.00 |
| 1 | 1968 | Alaska | 2.1 | 2.10000 | 2.10000 | 34.783333 | 15.12 | 15.12 |
| 2 | 1968 | Arizona | 18.72 - 26.40/wk(b) | 0.66000 | 0.46800 | 34.783333 | 4.75 | 3.37 |
| 3 | 1968 | Arkansas | 1.25/day(b) | 0.15625 | 0.15625 | 34.783333 | 1.12 | 1.12 |
| 4 | 1968 | California | 1.65(b) | 1.65000 | 1.65000 | 34.783333 | 11.88 | 11.88 |
df.head(2)| Year | State | Table_Data | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1968 | Alabama | NaN | 0.0 | 0.0 | 34.783333 | 0.00 | 0.00 |
| 1 | 1968 | Alaska | 2.1 | 2.1 | 2.1 | 34.783333 | 15.12 | 15.12 |
You can also do all of these methods on a single Series.
df['Year'].head()0 1968
1 1968
2 1968
3 1968
4 1968
Name: Year, dtype: int64
Or, you can do multiple columns at one time.
df[['State', 'Table_Data']].head()| State | Table_Data | |
|---|---|---|
| 0 | Alabama | NaN |
| 1 | Alaska | 2.1 |
| 2 | Arizona | 18.72 - 26.40/wk(b) |
| 3 | Arkansas | 1.25/day(b) |
| 4 | California | 1.65(b) |
It’s also good to view general information on the dataframe from .info() and to understand the datatypes of each of the columns.
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year 6 non-null int64
1 State 6 non-null object
2 Table_Data 5 non-null object
3 High_Value 6 non-null float64
4 Low_Value 6 non-null float64
5 CPI_Average 6 non-null float64
6 High_2018 6 non-null float64
7 Low_2018 6 non-null float64
dtypes: float64(5), int64(1), object(2)
memory usage: 512.0+ bytes
df.dtypesYear int64
State object
Table_Data object
High_Value float64
Low_Value float64
CPI_Average float64
High_2018 float64
Low_2018 float64
dtype: object
Similar to our use of head(), we can use tail() to examine the last few data points.
df.tail(2)| Year | State | Table_Data | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
|---|---|---|---|---|---|---|---|---|
| 4 | 1968 | California | 1.65(b) | 1.65 | 1.65 | 34.783333 | 11.88 | 11.88 |
| 5 | 1968 | Colorado | 1.00 - 1.25(b) | 1.25 | 1.00 | 34.783333 | 9.00 | 7.20 |
Subsetting Data
Slicing the data by rows
We can do traditional slicing through the index [start, end]. This will subset by rows.
df[1:4]| Year | State | Table_Data | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
|---|---|---|---|---|---|---|---|---|
| 1 | 1968 | Alaska | 2.1 | 2.10000 | 2.10000 | 34.783333 | 15.12 | 15.12 |
| 2 | 1968 | Arizona | 18.72 - 26.40/wk(b) | 0.66000 | 0.46800 | 34.783333 | 4.75 | 3.37 |
| 3 | 1968 | Arkansas | 1.25/day(b) | 0.15625 | 0.15625 | 34.783333 | 1.12 | 1.12 |
Slicing the data by columns
To subset by columns, we can use df[['columns_we_want_1', 'columns_we_want_2']].
columns_you_want = ['State', 'Table_Data']
df[columns_you_want].head()| State | Table_Data | |
|---|---|---|
| 0 | Alabama | NaN |
| 1 | Alaska | 2.1 |
| 2 | Arizona | 18.72 - 26.40/wk(b) |
| 3 | Arkansas | 1.25/day(b) |
| 4 | California | 1.65(b) |
Let’s bring in a larger dataset for this analysis to be meaningful. This will be a panel dataset of states, so there will be 50 rows for every year
df = pd.read_excel("data/Minimum_Wage_Data.xlsx")print("These are the first five observations: \n")
print(df[['Year', 'State']].head(5))
print("\n")
print("These are the last five observations: \n")
print(df[['Year', 'State']].tail(5))These are the first five observations:
Year State
0 1968 Alabama
1 1968 Alaska
2 1968 Arizona
3 1968 Arkansas
4 1968 California
These are the last five observations:
Year State
2745 2017 Virginia
2746 2017 Washington
2747 2017 West Virginia
2748 2017 Wisconsin
2749 2017 Wyoming
Viewing the size and columns of the data:
print(len(df))
print(df.columns)2750
Index(['Year', 'State', 'Table_Data', 'High_Value', 'Low_Value', 'CPI_Average',
'High_2018', 'Low_2018'],
dtype='object')
We have 8 variables: Year, State, Table_Data, High_Value, Low_Value, CPI_Average, High_2018, and Low_2018. Sometimes in our analysis we only want to keep certain years. For this, the traditional boolean logic mixed with Pandas slices the data into the segments we want.
Slicing based on conditions
# Rows past 2015
df[df.Year > 2015].head(3)| Year | State | Table_Data | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
|---|---|---|---|---|---|---|---|---|
| 2640 | 2016 | Alabama | ... | 0.00 | 0.00 | 240.007167 | 0.00 | 0.00 |
| 2641 | 2016 | Alaska | 9.75 | 9.75 | 9.75 | 240.007167 | 10.17 | 10.17 |
| 2642 | 2016 | Arizona | 8.05 | 8.05 | 8.05 | 240.007167 | 8.40 | 8.40 |
Segmenting on multiple conditions is also desirable. Most programs allow for an “AND” operator and an “OR” operator.
# California AND 2010
print(df[(df.Year == 2010) & (df.State == 'California')].head(3))
print('\n') Year State Table_Data High_Value Low_Value CPI_Average \
2314 2010 California 8 8.0 8.0 218.0555
High_2018 Low_2018
2314 9.19 9.19
# Alabama OR before 2015
df[(df.State == "Alabama") | (df.Year < 2015)].head(3)| Year | State | Table_Data | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1968 | Alabama | NaN | 0.00 | 0.000 | 34.783333 | 0.00 | 0.00 |
| 1 | 1968 | Alaska | 2.1 | 2.10 | 2.100 | 34.783333 | 15.12 | 15.12 |
| 2 | 1968 | Arizona | 18.72 - 26.40/wk(b) | 0.66 | 0.468 | 34.783333 | 4.75 | 3.37 |
The traditional Python index slicers are also applicable for DataFrames with the .loc[] method.
print(df.iloc[99])
print('\n')
print(df.iloc[[1, 50, 300]])Year 1969
State South Dakota
Table_Data 17.00 - 20.00/wk
High_Value 0.5
Low_Value 0.425
CPI_Average 36.683333
High_2018 3.41
Low_2018 2.9
Name: 99, dtype: object
Year State Table_Data High_Value Low_Value CPI_Average \
1 1968 Alaska 2.1 2.1 2.10 34.783333
50 1968 Virginia ... 0.0 0.00 34.783333
300 1973 Minnesota .75 - 1.60 1.6 0.75 44.400000
High_2018 Low_2018
1 15.12 15.12
50 0.00 0.00
300 9.02 4.23
The .loc[] method is also highly useful for subsetting rows and columns at the same time
df.loc[1:3, 'State']1 Alaska
2 Arizona
3 Arkansas
Name: State, dtype: object
df.loc[:2, ['Year', 'State']]| Year | State | |
|---|---|---|
| 0 | 1968 | Alabama |
| 1 | 1968 | Alaska |
| 2 | 1968 | Arizona |
Manipulating Data
Summary Statistics
Exporting summary stats in a different document type can help visualize results and understand data. Also note the different steps we can take to print out summary stats in a more visually intuitive way.
df.describe()| Year | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
|---|---|---|---|---|---|---|
| count | 2750.000000 | 2739.000000 | 2739.000000 | 2750.000000 | 2739.000000 | 2739.000000 |
| mean | 1992.500000 | 3.653761 | 3.533555 | 138.828983 | 6.441486 | 6.200252 |
| std | 14.433494 | 2.560308 | 2.539424 | 65.823807 | 3.025140 | 3.017818 |
| min | 1968.000000 | 0.000000 | 0.000000 | 34.783333 | 0.000000 | 0.000000 |
| 25% | 1980.000000 | 1.600000 | 1.600000 | 82.408333 | 5.980000 | 5.230000 |
| 50% | 1992.500000 | 3.350000 | 3.350000 | 142.387500 | 7.370000 | 7.170000 |
| 75% | 2005.000000 | 5.150000 | 5.150000 | 195.291667 | 8.280000 | 8.070000 |
| max | 2017.000000 | 11.500000 | 11.500000 | 245.119583 | 15.120000 | 15.120000 |
np.round(df.describe(), 2)| Year | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | |
|---|---|---|---|---|---|---|
| count | 2750.00 | 2739.00 | 2739.00 | 2750.00 | 2739.00 | 2739.00 |
| mean | 1992.50 | 3.65 | 3.53 | 138.83 | 6.44 | 6.20 |
| std | 14.43 | 2.56 | 2.54 | 65.82 | 3.03 | 3.02 |
| min | 1968.00 | 0.00 | 0.00 | 34.78 | 0.00 | 0.00 |
| 25% | 1980.00 | 1.60 | 1.60 | 82.41 | 5.98 | 5.23 |
| 50% | 1992.50 | 3.35 | 3.35 | 142.39 | 7.37 | 7.17 |
| 75% | 2005.00 | 5.15 | 5.15 | 195.29 | 8.28 | 8.07 |
| max | 2017.00 | 11.50 | 11.50 | 245.12 | 15.12 | 15.12 |
print(np.round(df.describe(), 2).T)
df.describe().transpose().to_csv('exports/summary_stats.csv', sep=',') count mean std min 25% 50% 75% \
Year 2750.0 1992.50 14.43 1968.00 1980.00 1992.50 2005.00
High_Value 2739.0 3.65 2.56 0.00 1.60 3.35 5.15
Low_Value 2739.0 3.53 2.54 0.00 1.60 3.35 5.15
CPI_Average 2750.0 138.83 65.82 34.78 82.41 142.39 195.29
High_2018 2739.0 6.44 3.03 0.00 5.98 7.37 8.28
Low_2018 2739.0 6.20 3.02 0.00 5.23 7.17 8.07
max
Year 2017.00
High_Value 11.50
Low_Value 11.50
CPI_Average 245.12
High_2018 15.12
Low_2018 15.12
Creating a new variable
The standard way to create a new Pandas column or replace an old one is to call the Series (whether it exists or not) on the left hand side and set it equal to the expression that expresses that value you want to create. For example:
df['7'] = 7
print(df.head()) Year State Table_Data High_Value Low_Value CPI_Average \
0 1968 Alabama NaN 0.00000 0.00000 34.783333
1 1968 Alaska 2.1 2.10000 2.10000 34.783333
2 1968 Arizona 18.72 - 26.40/wk(b) 0.66000 0.46800 34.783333
3 1968 Arkansas 1.25/day(b) 0.15625 0.15625 34.783333
4 1968 California 1.65(b) 1.65000 1.65000 34.783333
High_2018 Low_2018 7
0 0.00 0.00 7
1 15.12 15.12 7
2 4.75 3.37 7
3 1.12 1.12 7
4 11.88 11.88 7
Or, we can replace an old variable in a similar way.
Replacing a variable
df['7'] = 8
print(df.head()) Year State Table_Data High_Value Low_Value CPI_Average \
0 1968 Alabama NaN 0.00000 0.00000 34.783333
1 1968 Alaska 2.1 2.10000 2.10000 34.783333
2 1968 Arizona 18.72 - 26.40/wk(b) 0.66000 0.46800 34.783333
3 1968 Arkansas 1.25/day(b) 0.15625 0.15625 34.783333
4 1968 California 1.65(b) 1.65000 1.65000 34.783333
High_2018 Low_2018 7
0 0.00 0.00 8
1 15.12 15.12 8
2 4.75 3.37 8
3 1.12 1.12 8
4 11.88 11.88 8
We can also rename variables. In this case we will rename Year to Date and Table_Data to Values.
Renaming Variables
df.rename(columns={"Year": "Date", "Table_Data": "Values"}).head()| Date | State | Values | High_Value | Low_Value | CPI_Average | High_2018 | Low_2018 | 7 | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1968 | Alabama | NaN | 0.00000 | 0.00000 | 34.783333 | 0.00 | 0.00 | 8 |
| 1 | 1968 | Alaska | 2.1 | 2.10000 | 2.10000 | 34.783333 | 15.12 | 15.12 | 8 |
| 2 | 1968 | Arizona | 18.72 - 26.40/wk(b) | 0.66000 | 0.46800 | 34.783333 | 4.75 | 3.37 | 8 |
| 3 | 1968 | Arkansas | 1.25/day(b) | 0.15625 | 0.15625 | 34.783333 | 1.12 | 1.12 | 8 |
| 4 | 1968 | California | 1.65(b) | 1.65000 | 1.65000 | 34.783333 | 11.88 | 11.88 | 8 |
Dropping Variables
We can also drop a variable with df.drop(). This is one of many functions where you’ll need to specify an axis. axis=0 would indicate that you want to apply the operation along the rows, whereas axis=1 indicates that we want to apply this operation to a column.
# Dropping a variable
df = df.drop("7", axis=1)
print(df.head()) Year State Table_Data High_Value Low_Value CPI_Average \
0 1968 Alabama NaN 0.00000 0.00000 34.783333
1 1968 Alaska 2.1 2.10000 2.10000 34.783333
2 1968 Arizona 18.72 - 26.40/wk(b) 0.66000 0.46800 34.783333
3 1968 Arkansas 1.25/day(b) 0.15625 0.15625 34.783333
4 1968 California 1.65(b) 1.65000 1.65000 34.783333
High_2018 Low_2018
0 0.00 0.00
1 15.12 15.12
2 4.75 3.37
3 1.12 1.12
4 11.88 11.88
Arithmetic Operations
Basic math operations are also easily applied in Pandas.
df['Difference'] = df['High_Value'] - df['Low_Value']
df['High*2'] = df['High_Value']*2
df['Low*2'] = df['Low_Value']*2
print(df.head()) Year State Table_Data High_Value Low_Value CPI_Average \
0 1968 Alabama NaN 0.00000 0.00000 34.783333
1 1968 Alaska 2.1 2.10000 2.10000 34.783333
2 1968 Arizona 18.72 - 26.40/wk(b) 0.66000 0.46800 34.783333
3 1968 Arkansas 1.25/day(b) 0.15625 0.15625 34.783333
4 1968 California 1.65(b) 1.65000 1.65000 34.783333
High_2018 Low_2018 Difference High*2 Low*2
0 0.00 0.00 0.000 0.0000 0.0000
1 15.12 15.12 0.000 4.2000 4.2000
2 4.75 3.37 0.192 1.3200 0.9360
3 1.12 1.12 0.000 0.3125 0.3125
4 11.88 11.88 0.000 3.3000 3.3000
Complex Operations with .map()
More complex operations can be solved through the .map() method.
In the example below, .map() uses a dictionary to change results, which can help keep code clean when replacing a large amount of values. Here we create an abbrev column for state abbreviations (we will create it and drop the variable after).
# Using Data.map
state_2 = {'OH': 'Ohio', 'Illinois': 'IL', 'California': 'CA', 'Texas': 'TX', 'Alabama':'AL'}
df['abbrev'] = df['State'].map(state_2)
df[['State', 'abbrev']].head()| State | abbrev | |
|---|---|---|
| 0 | Alabama | AL |
| 1 | Alaska | NaN |
| 2 | Arizona | NaN |
| 3 | Arkansas | NaN |
| 4 | California | CA |
# Another simpler map example
alist = [1,2,3,4]
def function_to_be_mapped(x):
return x+3
result = map(function_to_be_mapped, alist)
print(list(result))[4, 5, 6, 7]
df.sort_values('abbrev', ascending=True).head()
df = df.drop("abbrev", axis=1)Using Functions
You can use .apply() to apply a function to a Series. You can either specify the function or use the lambda anonymous function:
Specfying a function:
def add(x):
x = x + 1
return x
print(df['Year'].head())
print("\n")
print(df['Year'].apply(add).head())0 1968
1 1968
2 1968
3 1968
4 1968
Name: Year, dtype: int64
0 1969
1 1969
2 1969
3 1969
4 1969
Name: Year, dtype: int64
Lambda Functions
Lambda Functions (skipping defining a function if the function is simple enough):
print(df['Year'].head())
print("\n")
print((df['Year'].apply(lambda x: x + 1).head()))0 1968
1 1968
2 1968
3 1968
4 1968
Name: Year, dtype: int64
0 1969
1 1969
2 1969
3 1969
4 1969
Name: Year, dtype: int64
Missing Values
Dealing with missing values is an important part of data cleaning and something to always be mindful of. Pandas codes missing values as numpy NaN values.
print(df['Table_Data'].head(10))
print("\n")
print(df['Year'].head(10))0 NaN
1 2.1
2 18.72 - 26.40/wk(b)
3 1.25/day(b)
4 1.65(b)
5 1.00 - 1.25(b)
6 1.4
7 1.25
8 1.25 - 1.40
9 $1.15 & $1.60
Name: Table_Data, dtype: object
0 1968
1 1968
2 1968
3 1968
4 1968
5 1968
6 1968
7 1968
8 1968
9 1968
Name: Year, dtype: int64
The following syntax can help identify missing values:
print(df['Table_Data'].isnull().values.any())
print(df['Year'].isnull().values.any())True
False
Duplicate Values
We can use .drop_duplicates(keep = "first") to drop all but the first observations of duplicates.
data = {'col_1': [3, 3, 3, 3], 'col_2': ['a', 'a', 'a', 'a']}
data_dup = pd.DataFrame.from_dict(data)
print(data_dup)
print("\n")
data_dup = data_dup.drop_duplicates(keep="last")
print(data_dup) col_1 col_2
0 3 a
1 3 a
2 3 a
3 3 a
col_1 col_2
3 3 a
Reshaping Data
Data can be reshaped using a variety of Pandas functions. Going from wide to long has a built in function wide_to_long(). We will load in a new dataset for this:
df_ex = pd.DataFrame({
'Unique Family Identifier': [1000, 1000, 1000, 1001, 1001, 1001, 1002, 324, 234],
'Order': [1, 2, 3, 1, 2, 3, 1, 2, 3],
'az1': [28, 82, 23, 234, 234, 324, 546, 546, 5464],
'az2': [2342, 2342, 54645, 56765, 65765, 65756, 3453, 56756, 3453]})
reshape = pd.wide_to_long(df_ex, stubnames='az', i=['Unique Family Identifier', 'Order'], j='age')
df_ex.head()| Unique Family Identifier | Order | az1 | az2 | |
|---|---|---|---|---|
| 0 | 1000 | 1 | 28 | 2342 |
| 1 | 1000 | 2 | 82 | 2342 |
| 2 | 1000 | 3 | 23 | 54645 |
| 3 | 1001 | 1 | 234 | 56765 |
| 4 | 1001 | 2 | 234 | 65765 |
reshape| az | |||
|---|---|---|---|
| Unique Family Identifier | Order | age | |
| 1000 | 1 | 1 | 28 |
| 2 | 2342 | ||
| 2 | 1 | 82 | |
| 2 | 2342 | ||
| 3 | 1 | 23 | |
| 2 | 54645 | ||
| 1001 | 1 | 1 | 234 |
| 2 | 56765 | ||
| 2 | 1 | 234 | |
| 2 | 65765 | ||
| 3 | 1 | 324 | |
| 2 | 65756 | ||
| 1002 | 1 | 1 | 546 |
| 2 | 3453 | ||
| 324 | 2 | 1 | 546 |
| 2 | 56756 | ||
| 234 | 3 | 1 | 5464 |
| 2 | 3453 |
Going from long to wide requires the use of unstack().
normal = reshape.unstack()
normal.columns = normal.columns.map('{0[0]}{0[1]}'.format)
normal.reset_index()
normal.head()| az1 | az2 | ||
|---|---|---|---|
| Unique Family Identifier | Order | ||
| 234 | 3 | 5464 | 3453 |
| 324 | 2 | 546 | 56756 |
| 1000 | 1 | 28 | 2342 |
| 2 | 82 | 2342 | |
| 3 | 23 | 54645 |
Merging Data
Merging data uses pd.merge(data_1, data_2, on = identifier).
left_frame = pd.DataFrame({'key': range(5),
'left_value': ['a', 'b', 'c', 'd', 'e']})
right_frame = pd.DataFrame({'key': range(2, 7),
'right_value': ['f', 'g', 'h', 'i', 'j']})
print(left_frame)
print('\n')
print(right_frame) key left_value
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
key right_value
0 2 f
1 3 g
2 4 h
3 5 i
4 6 j
pd.merge(left_frame, right_frame, on='key', how='inner')| key | left_value | right_value | |
|---|---|---|---|
| 0 | 2 | c | f |
| 1 | 3 | d | g |
| 2 | 4 | e | h |
pd.merge(left_frame, right_frame, on='key', how='left')| key | left_value | right_value | |
|---|---|---|---|
| 0 | 0 | a | NaN |
| 1 | 1 | b | NaN |
| 2 | 2 | c | f |
| 3 | 3 | d | g |
| 4 | 4 | e | h |
pd.merge(left_frame, right_frame, on='key', how='right')| key | left_value | right_value | |
|---|---|---|---|
| 0 | 2 | c | f |
| 1 | 3 | d | g |
| 2 | 4 | e | h |
| 3 | 5 | NaN | i |
| 4 | 6 | NaN | j |
Appending Data
Note: Here, axis=0 is implied as the default, so it will append along the rows. But columns or Series can also be appended to each other by specifying axis=1 (provided that they are the same length).
pd.concat([left_frame, right_frame], sort=True)| key | left_value | right_value | |
|---|---|---|---|
| 0 | 0 | a | NaN |
| 1 | 1 | b | NaN |
| 2 | 2 | c | NaN |
| 3 | 3 | d | NaN |
| 4 | 4 | e | NaN |
| 0 | 2 | NaN | f |
| 1 | 3 | NaN | g |
| 2 | 4 | NaN | h |
| 3 | 5 | NaN | i |
| 4 | 6 | NaN | j |
Collapsing Data
Collapsing data is accomplished by the .groupby() function. We will collapse by Year.
by_year = df.groupby('Year')
print(by_year.count().head()) # NOT NULL records within each column State Table_Data High_Value Low_Value CPI_Average High_2018 \
Year
1968 55 53 54 54 55 54
1969 55 54 54 54 55 54
1970 55 54 54 54 55 54
1971 55 54 54 54 55 54
1972 55 54 54 54 55 54
Low_2018 Difference High*2 Low*2
Year
1968 54 54 54 54
1969 54 54 54 54
1970 54 54 54 54
1971 54 54 54 54
1972 54 54 54 54
Note: This only generates counts for each year. If we want averages, sums or any other summary statistics we have to specify that:
print(by_year.sum()[20:25])
print('\n')
print(by_year.mean()[20:25])
print('\n')
print(by_year.median()[20:25]) High_Value Low_Value CPI_Average High_2018 Low_2018 Difference \
Year
1988 146.75 142.75 6504.208331 310.65 302.19 4.0
1989 146.75 142.75 6818.166669 296.49 288.41 4.0
1990 146.75 142.75 7186.208332 281.27 273.61 4.0
1991 179.20 175.00 7490.541669 329.52 321.80 4.2
1992 194.17 189.57 7717.416669 346.40 338.19 4.6
High*2 Low*2
Year
1988 293.50 285.50
1989 293.50 285.50
1990 293.50 285.50
1991 358.40 350.00
1992 388.34 379.14
High_Value Low_Value CPI_Average High_2018 Low_2018 Difference \
Year
1988 2.668182 2.595455 118.258333 5.648182 5.494364 0.072727
1989 2.668182 2.595455 123.966667 5.390727 5.243818 0.072727
1990 2.668182 2.595455 130.658333 5.114000 4.974727 0.072727
1991 3.258182 3.181818 136.191667 5.991273 5.850909 0.076364
1992 3.530364 3.446727 140.316667 6.298182 6.148909 0.083636
High*2 Low*2
Year
1988 5.336364 5.190909
1989 5.336364 5.190909
1990 5.336364 5.190909
1991 6.516364 6.363636
1992 7.060727 6.893455
High_Value Low_Value CPI_Average High_2018 Low_2018 Difference \
Year
1988 3.35 3.35 118.258333 7.09 7.09 0.0
1989 3.35 3.35 123.966667 6.77 6.77 0.0
1990 3.35 3.35 130.658333 6.42 6.42 0.0
1991 3.80 3.80 136.191667 6.99 6.99 0.0
1992 4.25 4.25 140.316667 7.58 7.58 0.0
High*2 Low*2
Year
1988 6.7 6.7
1989 6.7 6.7
1990 6.7 6.7
1991 7.6 7.6
1992 8.5 8.5