Intro to Pandas#

While built-in Python functions suffice for general programming, data analysis requires additional functions and objects. Pandas is a popular data analysis package that is simple to use the moment you start Python.

The primary reason Pandas is popular for data science and analysis is that it introduces three useful objects that mirror similar data structures from other statistical packages and don’t add complexity to the simple Pythonic syntax. These objects are:

  1. The DataFrame

  2. The Series

  3. The Index

This guide is not meant to be comprehensive; the most comprehensive documentation for Pandas is found here. The purpose of this guide is to inform a potential user of the functionality of Pandas and a general overview of how to accomplish basic data analysis tools.

Getting Started#

Importing Pandas#

Since Pandas is not native to Python, you will have to install it. Fortunately, Pandas has grown so much in popularity, that most downloads (including Anaconda) will contain Pandas, but you will still have to load the package. This code will do the trick:

import pandas 

However, because Pandas is used so often, the Python community loads pandas as pd for ease of access:

import pandas as pd

It’s also useful to import a few other packages, mostly notably Numpy.

import numpy as np

DataFrame#

The DataFrame object is one of the main contributions of Pandas. The DataFrame is a 2-dimensional labeled structure with columns and rows. The columns and the rows represent one dimension each. The DataFrame is analogous to the R and Stata DataFrame and the Excel spreadsheet. Or put in more technical terms, the DataFrame is a tabular data structure. The code below defines a DataFrame and then prints it out to view:

d = pd.DataFrame({'one': [1., 2., 3., 4.], 'two': [4., 3., 2., 1.]})

d
one two
0 1.0 4.0
1 2.0 3.0
2 3.0 2.0
3 4.0 1.0

The name of our DataFrame is d, and it has two columns, one and two, and four rows, 0-3. (Python is 0 indexed, meaning it counts from 0…1…2…and so on.)

Each data point can be referenced through its Index (which corresponds to its row, the far most left value) and the column signifies what the value means. We can call a single Pandas column with this command:

d['one']
0    1.0
1    2.0
2    3.0
3    4.0
Name: one, dtype: float64

The values on the left represent the Index we saw earlier. Notice: A column’s type is itself a Pandas object, the Series. More on this object below.

Series#

A Series is one-dimensional indexed array that can contain any Python data type. To create a Series you use the function:

series_ex = pd.Series([1,2,3,4])

A Series in Pandas is similar visually to a list. But there are key distinctions in how they operate. As mentioned, Pandas is used for data analysis, so a Series has functions that allow for data analysis to be done easily, while a list would require either a for loop or list comprehension for the same operations. Example of this below:

series_ex = series_ex*2
print("This is a Series multiplied by two")
print(series_ex)
This is a Series multiplied by two
0    2
1    4
2    6
3    8
dtype: int64
alist = [1,2,3,4]
alist = alist*2
print("This is a List multiplied by two")
print(alist)
This is a List multiplied by two
[1, 2, 3, 4, 1, 2, 3, 4]

Index#

Both the Series and the DataFrame have an Index that signifies order and allows for referencing specific points. The Index itself is an object.

ind = pd.Index([2, 3, 5, 7, 11])
ind
Index([2, 3, 5, 7, 11], dtype='int64')

Importing Data#

Pandas has the ability to read and export multiple data format types: be it a CSV, DTA, SAS file, JSON file, SQL and many more. Almost all data reading will be in the format:

pd.read_(NAME OF DATA FORMAT)('FILE_NAME')

Let’s take a look at a few examples.

Stata Files (DTA)#

# Importing data from Stata 
pd.read_stata('data/State_ETO_short.dta')
state_abbv ETO ETW ETOW ET_cat ET_Work_cat ET_ET_Work_cat
0 VA 2.0 62.5 64.5 0 - 5% 30% < 30% <
1 TN 5.5 20.8 26.3 5 - 10% 10 - 30% 20 - 30%
2 VT 19.9 0.7 20.6 15 - 20% 0 - 5% 20 - 30%
3 ID 6.0 11.4 17.4 5 - 10% 10 - 30% 10 - 20%
4 OH 1.9 15.0 16.9 0 - 5% 10 - 30% 10 - 20%

This is how we read in a DataFrame, but we still need to store and name it. This can be accomplished in one line:

df = pd.read_stata('data/State_ETO_short.dta')

Now, when we call df, we’ll get the DataFrame that corresponds to the data referenced in pd.read_stata('data/State_ETO.dta').

df
state_abbv ETO ETW ETOW ET_cat ET_Work_cat ET_ET_Work_cat
0 VA 2.0 62.5 64.5 0 - 5% 30% < 30% <
1 TN 5.5 20.8 26.3 5 - 10% 10 - 30% 20 - 30%
2 VT 19.9 0.7 20.6 15 - 20% 0 - 5% 20 - 30%
3 ID 6.0 11.4 17.4 5 - 10% 10 - 30% 10 - 20%
4 OH 1.9 15.0 16.9 0 - 5% 10 - 30% 10 - 20%

Excel Files (XLSX)#

(Note: May need to pip install openpyxl for this to work!)

# Importing data from excel into pandas
df = pd.read_excel('data/Minimum_Wage_Data_Short.xlsx')
df
Year State Table_Data High_Value Low_Value CPI_Average High_2018 Low_2018
0 1968 Alabama NaN 0.00000 0.00000 34.783333 0.00 0.00
1 1968 Alaska 2.1 2.10000 2.10000 34.783333 15.12 15.12
2 1968 Arizona 18.72 - 26.40/wk(b) 0.66000 0.46800 34.783333 4.75 3.37
3 1968 Arkansas 1.25/day(b) 0.15625 0.15625 34.783333 1.12 1.12
4 1968 California 1.65(b) 1.65000 1.65000 34.783333 11.88 11.88
5 1968 Colorado 1.00 - 1.25(b) 1.25000 1.00000 34.783333 9.00 7.20

Other Data Types#

Similar variants for uploading exists for each common data type - CSV, SAS, and so on.

Exporting Data#

Exporting data is very simple as well and follows a pattern similar to exporting.

CSV#

df.to_csv('exports/Minimum_Wage_Data.csv')

Converting Data#

Now that we know how to load in data, it will be useful to examine ways to convert already existing data structures into DataFrames.

List#

my_list = [1,2,3,4,5,6,7,8,9]
columns = ['a', 'b', 'c']

pd.DataFrame(np.array(my_list).reshape(3,3), columns = columns)
a b c
0 1 2 3
1 4 5 6
2 7 8 9

An important thing to note: pd.DataFrame has a multitude of options. The most import of which is what follows right after the first parentheses which is the data that is to be transformed. Here are transform the list [1,2,3,4,5,6,7,7,8,9] into an np.array with the shape of:

[[1,2,3],
[4,5,6], 
[7.8.9]]

Then we transform the data to a Pandas DataFrame which gives us:

0	1	2	3
1	4	5	6
2	7	8	9

Finally, we add a list of column name with the option columns = columns to get the final dataframe.

	a	b	c
1	4	5	6
2	7	8	9

Dictionary#

data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
pd.DataFrame.from_dict(data)
col_1 col_2
0 3 a
1 2 b
2 1 c
3 0 d

Numpy Array#

dtype = [('Col1','int32'), ('Col2','float32'), ('Col3','float32')]
values = np.zeros(5, dtype=dtype)
index = ['Row'+str(i) for i in range(1, len(values)+1)]

pd.DataFrame(values, index=index).head()
Col1 Col2 Col3
Row1 0 0.0 0.0
Row2 0 0.0 0.0
Row3 0 0.0 0.0
Row4 0 0.0 0.0
Row5 0 0.0 0.0

Now that we know how to load in our DataFrame, we will try to view and manipulate our data before our analysis is run.

Viewing Data#

We know that we can view our data by simply printing the dataframe, but what if our data is too large?

The .head() function prints out the first 5 rows (inside the parentheticals you can specify the first N observations you want to see).

df.head()
Year State Table_Data High_Value Low_Value CPI_Average High_2018 Low_2018
0 1968 Alabama NaN 0.00000 0.00000 34.783333 0.00 0.00
1 1968 Alaska 2.1 2.10000 2.10000 34.783333 15.12 15.12
2 1968 Arizona 18.72 - 26.40/wk(b) 0.66000 0.46800 34.783333 4.75 3.37
3 1968 Arkansas 1.25/day(b) 0.15625 0.15625 34.783333 1.12 1.12
4 1968 California 1.65(b) 1.65000 1.65000 34.783333 11.88 11.88
df.head(2)
Year State Table_Data High_Value Low_Value CPI_Average High_2018 Low_2018
0 1968 Alabama NaN 0.0 0.0 34.783333 0.00 0.00
1 1968 Alaska 2.1 2.1 2.1 34.783333 15.12 15.12

You can also do all of these methods on a single Series.

df['Year'].head()
0    1968
1    1968
2    1968
3    1968
4    1968
Name: Year, dtype: int64

Or, you can do multiple columns at one time.

df[['State', 'Table_Data']].head()
State Table_Data
0 Alabama NaN
1 Alaska 2.1
2 Arizona 18.72 - 26.40/wk(b)
3 Arkansas 1.25/day(b)
4 California 1.65(b)

It’s also good to view general information on the dataframe from .info() and to understand the datatypes of each of the columns.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Year         6 non-null      int64  
 1   State        6 non-null      object 
 2   Table_Data   5 non-null      object 
 3   High_Value   6 non-null      float64
 4   Low_Value    6 non-null      float64
 5   CPI_Average  6 non-null      float64
 6   High_2018    6 non-null      float64
 7   Low_2018     6 non-null      float64
dtypes: float64(5), int64(1), object(2)
memory usage: 512.0+ bytes
df.dtypes
Year             int64
State           object
Table_Data      object
High_Value     float64
Low_Value      float64
CPI_Average    float64
High_2018      float64
Low_2018       float64
dtype: object

Similar to our use of head(), we can use tail() to examine the last few data points.

df.tail(2)
Year State Table_Data High_Value Low_Value CPI_Average High_2018 Low_2018
4 1968 California 1.65(b) 1.65 1.65 34.783333 11.88 11.88
5 1968 Colorado 1.00 - 1.25(b) 1.25 1.00 34.783333 9.00 7.20

Subsetting Data#

Slicing the data by rows#

We can do traditional slicing through the index [start, end]. This will subset by rows.

df[1:4]
Year State Table_Data High_Value Low_Value CPI_Average High_2018 Low_2018
1 1968 Alaska 2.1 2.10000 2.10000 34.783333 15.12 15.12
2 1968 Arizona 18.72 - 26.40/wk(b) 0.66000 0.46800 34.783333 4.75 3.37
3 1968 Arkansas 1.25/day(b) 0.15625 0.15625 34.783333 1.12 1.12

Slicing the data by columns#

To subset by columns, we can use df[['columns_we_want_1', 'columns_we_want_2']].

columns_you_want = ['State', 'Table_Data'] 
df[columns_you_want].head()
State Table_Data
0 Alabama NaN
1 Alaska 2.1
2 Arizona 18.72 - 26.40/wk(b)
3 Arkansas 1.25/day(b)
4 California 1.65(b)

Let’s bring in a larger dataset for this analysis to be meaningful. This will be a panel dataset of states, so there will be 50 rows for every year

df = pd.read_excel("data/Minimum_Wage_Data.xlsx")
print("These are the first five observations: \n")
print(df[['Year', 'State']].head(5))
print("\n")

print("These are the last five observations: \n")
print(df[['Year', 'State']].tail(5))
These are the first five observations: 

   Year       State
0  1968     Alabama
1  1968      Alaska
2  1968     Arizona
3  1968    Arkansas
4  1968  California


These are the last five observations: 

      Year          State
2745  2017       Virginia
2746  2017     Washington
2747  2017  West Virginia
2748  2017      Wisconsin
2749  2017        Wyoming

Viewing the size and columns of the data:

print(len(df))
print(df.columns)
2750
Index(['Year', 'State', 'Table_Data', 'High_Value', 'Low_Value', 'CPI_Average',
       'High_2018', 'Low_2018'],
      dtype='object')

We have 8 variables: Year, State, Table_Data, High_Value, Low_Value, CPI_Average, High_2018, and Low_2018. Sometimes in our analysis we only want to keep certain years. For this, the traditional boolean logic mixed with Pandas slices the data into the segments we want.

Slicing based on conditions#

# Rows past 2015
df[df.Year > 2015].head(3)
Year State Table_Data High_Value Low_Value CPI_Average High_2018 Low_2018
2640 2016 Alabama ... 0.00 0.00 240.007167 0.00 0.00
2641 2016 Alaska 9.75 9.75 9.75 240.007167 10.17 10.17
2642 2016 Arizona 8.05 8.05 8.05 240.007167 8.40 8.40

Segmenting on multiple conditions is also desirable. Most programs allow for an “AND” operator and an “OR” operator.

# California AND 2010
print(df[(df.Year == 2010) & (df.State == 'California')].head(3))
print('\n')
      Year       State Table_Data  High_Value  Low_Value  CPI_Average  \
2314  2010  California          8         8.0        8.0     218.0555   

      High_2018  Low_2018  
2314       9.19      9.19  
# Alabama OR before 2015
df[(df.State == "Alabama") | (df.Year < 2015)].head(3)
Year State Table_Data High_Value Low_Value CPI_Average High_2018 Low_2018
0 1968 Alabama NaN 0.00 0.000 34.783333 0.00 0.00
1 1968 Alaska 2.1 2.10 2.100 34.783333 15.12 15.12
2 1968 Arizona 18.72 - 26.40/wk(b) 0.66 0.468 34.783333 4.75 3.37

The traditional Python index slicers are also applicable for DataFrames with the .loc[] method.

print(df.iloc[99])
print('\n')
print(df.iloc[[1, 50, 300]])
Year                       1969
State              South Dakota
Table_Data     17.00 - 20.00/wk
High_Value                  0.5
Low_Value                 0.425
CPI_Average           36.683333
High_2018                  3.41
Low_2018                    2.9
Name: 99, dtype: object


     Year      State  Table_Data  High_Value  Low_Value  CPI_Average  \
1    1968     Alaska         2.1         2.1       2.10    34.783333   
50   1968   Virginia         ...         0.0       0.00    34.783333   
300  1973  Minnesota  .75 - 1.60         1.6       0.75    44.400000   

     High_2018  Low_2018  
1        15.12     15.12  
50        0.00      0.00  
300       9.02      4.23  

The .loc[] method is also highly useful for subsetting rows and columns at the same time

df.loc[1:3, 'State']
1      Alaska
2     Arizona
3    Arkansas
Name: State, dtype: object
df.loc[:2, ['Year', 'State']]
Year State
0 1968 Alabama
1 1968 Alaska
2 1968 Arizona

Manipulating Data#

Summary Statistics#

Exporting summary stats in a different document type can help visualize results and understand data. Also note the different steps we can take to print out summary stats in a more visually intuitive way.

df.describe()
Year High_Value Low_Value CPI_Average High_2018 Low_2018
count 2750.000000 2739.000000 2739.000000 2750.000000 2739.000000 2739.000000
mean 1992.500000 3.653761 3.533555 138.828983 6.441486 6.200252
std 14.433494 2.560308 2.539424 65.823807 3.025140 3.017818
min 1968.000000 0.000000 0.000000 34.783333 0.000000 0.000000
25% 1980.000000 1.600000 1.600000 82.408333 5.980000 5.230000
50% 1992.500000 3.350000 3.350000 142.387500 7.370000 7.170000
75% 2005.000000 5.150000 5.150000 195.291667 8.280000 8.070000
max 2017.000000 11.500000 11.500000 245.119583 15.120000 15.120000
np.round(df.describe(), 2)
Year High_Value Low_Value CPI_Average High_2018 Low_2018
count 2750.00 2739.00 2739.00 2750.00 2739.00 2739.00
mean 1992.50 3.65 3.53 138.83 6.44 6.20
std 14.43 2.56 2.54 65.82 3.03 3.02
min 1968.00 0.00 0.00 34.78 0.00 0.00
25% 1980.00 1.60 1.60 82.41 5.98 5.23
50% 1992.50 3.35 3.35 142.39 7.37 7.17
75% 2005.00 5.15 5.15 195.29 8.28 8.07
max 2017.00 11.50 11.50 245.12 15.12 15.12
print(np.round(df.describe(), 2).T)
df.describe().transpose().to_csv('exports/summary_stats.csv', sep=',')
              count     mean    std      min      25%      50%      75%  \
Year         2750.0  1992.50  14.43  1968.00  1980.00  1992.50  2005.00   
High_Value   2739.0     3.65   2.56     0.00     1.60     3.35     5.15   
Low_Value    2739.0     3.53   2.54     0.00     1.60     3.35     5.15   
CPI_Average  2750.0   138.83  65.82    34.78    82.41   142.39   195.29   
High_2018    2739.0     6.44   3.03     0.00     5.98     7.37     8.28   
Low_2018     2739.0     6.20   3.02     0.00     5.23     7.17     8.07   

                 max  
Year         2017.00  
High_Value     11.50  
Low_Value      11.50  
CPI_Average   245.12  
High_2018      15.12  
Low_2018       15.12  

Creating a new variable#

The standard way to create a new Pandas column or replace an old one is to call the Series (whether it exists or not) on the left hand side and set it equal to the expression that expresses that value you want to create. For example:

df['7'] = 7
print(df.head())
   Year       State           Table_Data  High_Value  Low_Value  CPI_Average  \
0  1968     Alabama                  NaN     0.00000    0.00000    34.783333   
1  1968      Alaska                  2.1     2.10000    2.10000    34.783333   
2  1968     Arizona  18.72 - 26.40/wk(b)     0.66000    0.46800    34.783333   
3  1968    Arkansas          1.25/day(b)     0.15625    0.15625    34.783333   
4  1968  California              1.65(b)     1.65000    1.65000    34.783333   

   High_2018  Low_2018  7  
0       0.00      0.00  7  
1      15.12     15.12  7  
2       4.75      3.37  7  
3       1.12      1.12  7  
4      11.88     11.88  7  

Or, we can replace an old variable in a similar way.

Replacing a variable#

df['7'] = 8
print(df.head())
   Year       State           Table_Data  High_Value  Low_Value  CPI_Average  \
0  1968     Alabama                  NaN     0.00000    0.00000    34.783333   
1  1968      Alaska                  2.1     2.10000    2.10000    34.783333   
2  1968     Arizona  18.72 - 26.40/wk(b)     0.66000    0.46800    34.783333   
3  1968    Arkansas          1.25/day(b)     0.15625    0.15625    34.783333   
4  1968  California              1.65(b)     1.65000    1.65000    34.783333   

   High_2018  Low_2018  7  
0       0.00      0.00  8  
1      15.12     15.12  8  
2       4.75      3.37  8  
3       1.12      1.12  8  
4      11.88     11.88  8  

We can also rename variables. In this case we will rename Year to Date and Table_Data to Values.

Renaming Variables#

df.rename(columns={"Year": "Date", "Table_Data": "Values"}).head()
Date State Values High_Value Low_Value CPI_Average High_2018 Low_2018 7
0 1968 Alabama NaN 0.00000 0.00000 34.783333 0.00 0.00 8
1 1968 Alaska 2.1 2.10000 2.10000 34.783333 15.12 15.12 8
2 1968 Arizona 18.72 - 26.40/wk(b) 0.66000 0.46800 34.783333 4.75 3.37 8
3 1968 Arkansas 1.25/day(b) 0.15625 0.15625 34.783333 1.12 1.12 8
4 1968 California 1.65(b) 1.65000 1.65000 34.783333 11.88 11.88 8

Dropping Variables#

We can also drop a variable with df.drop(). This is one of many functions where you’ll need to specify an axis. axis=0 would indicate that you want to apply the operation along the rows, whereas axis=1 indicates that we want to apply this operation to a column.

# Dropping a variable
df = df.drop("7", axis=1)
print(df.head())
   Year       State           Table_Data  High_Value  Low_Value  CPI_Average  \
0  1968     Alabama                  NaN     0.00000    0.00000    34.783333   
1  1968      Alaska                  2.1     2.10000    2.10000    34.783333   
2  1968     Arizona  18.72 - 26.40/wk(b)     0.66000    0.46800    34.783333   
3  1968    Arkansas          1.25/day(b)     0.15625    0.15625    34.783333   
4  1968  California              1.65(b)     1.65000    1.65000    34.783333   

   High_2018  Low_2018  
0       0.00      0.00  
1      15.12     15.12  
2       4.75      3.37  
3       1.12      1.12  
4      11.88     11.88  

Arithmetic Operations#

Basic math operations are also easily applied in Pandas.

df['Difference'] = df['High_Value'] - df['Low_Value']
df['High*2'] = df['High_Value']*2
df['Low*2'] = df['Low_Value']*2
print(df.head())
   Year       State           Table_Data  High_Value  Low_Value  CPI_Average  \
0  1968     Alabama                  NaN     0.00000    0.00000    34.783333   
1  1968      Alaska                  2.1     2.10000    2.10000    34.783333   
2  1968     Arizona  18.72 - 26.40/wk(b)     0.66000    0.46800    34.783333   
3  1968    Arkansas          1.25/day(b)     0.15625    0.15625    34.783333   
4  1968  California              1.65(b)     1.65000    1.65000    34.783333   

   High_2018  Low_2018  Difference  High*2   Low*2  
0       0.00      0.00       0.000  0.0000  0.0000  
1      15.12     15.12       0.000  4.2000  4.2000  
2       4.75      3.37       0.192  1.3200  0.9360  
3       1.12      1.12       0.000  0.3125  0.3125  
4      11.88     11.88       0.000  3.3000  3.3000  

Complex Operations with .map()#

More complex operations can be solved through the .map() method.

In the example below, .map() uses a dictionary to change results, which can help keep code clean when replacing a large amount of values. Here we create an abbrev column for state abbreviations (we will create it and drop the variable after).

# Using Data.map
state_2 = {'OH': 'Ohio', 'Illinois': 'IL', 'California': 'CA', 'Texas': 'TX', 'Alabama':'AL'}

df['abbrev'] = df['State'].map(state_2)
df[['State', 'abbrev']].head()
State abbrev
0 Alabama AL
1 Alaska NaN
2 Arizona NaN
3 Arkansas NaN
4 California CA
# Another simpler map example
alist = [1,2,3,4]
def function_to_be_mapped(x):
    return x+3

result = map(function_to_be_mapped, alist)
print(list(result))
[4, 5, 6, 7]
df.sort_values('abbrev', ascending=True).head()
df = df.drop("abbrev", axis=1)

Using Functions#

You can use .apply() to apply a function to a Series. You can either specify the function or use the lambda anonymous function:

Specfying a function:

def add(x):
    x = x + 1
    return x

print(df['Year'].head())
print("\n")
print(df['Year'].apply(add).head())
0    1968
1    1968
2    1968
3    1968
4    1968
Name: Year, dtype: int64


0    1969
1    1969
2    1969
3    1969
4    1969
Name: Year, dtype: int64

Lambda Functions#

Lambda Functions (skipping defining a function if the function is simple enough):

print(df['Year'].head())
print("\n")
print((df['Year'].apply(lambda x: x + 1).head()))
0    1968
1    1968
2    1968
3    1968
4    1968
Name: Year, dtype: int64


0    1969
1    1969
2    1969
3    1969
4    1969
Name: Year, dtype: int64

Missing Values#

Dealing with missing values is an important part of data cleaning and something to always be mindful of. Pandas codes missing values as numpy NaN values.

print(df['Table_Data'].head(10))
print("\n")
print(df['Year'].head(10))
0                    NaN
1                    2.1
2    18.72 - 26.40/wk(b)
3            1.25/day(b)
4                1.65(b)
5         1.00 - 1.25(b)
6                    1.4
7                   1.25
8            1.25 - 1.40
9          $1.15 & $1.60
Name: Table_Data, dtype: object


0    1968
1    1968
2    1968
3    1968
4    1968
5    1968
6    1968
7    1968
8    1968
9    1968
Name: Year, dtype: int64

The following syntax can help identify missing values:

print(df['Table_Data'].isnull().values.any())
print(df['Year'].isnull().values.any())
True
False

Duplicate Values#

We can use .drop_duplicates(keep = "first") to drop all but the first observations of duplicates.

data = {'col_1': [3, 3, 3, 3], 'col_2': ['a', 'a', 'a', 'a']}
data_dup = pd.DataFrame.from_dict(data)
print(data_dup)

print("\n")

data_dup = data_dup.drop_duplicates(keep="last")

print(data_dup)
   col_1 col_2
0      3     a
1      3     a
2      3     a
3      3     a


   col_1 col_2
3      3     a

Reshaping Data#

Data can be reshaped using a variety of Pandas functions. Going from wide to long has a built in function wide_to_long(). We will load in a new dataset for this:

df_ex = pd.DataFrame({
    'Unique Family Identifier': [1000, 1000, 1000, 1001, 1001, 1001, 1002, 324, 234],
    'Order': [1, 2, 3, 1, 2, 3, 1, 2, 3], 
    'az1': [28, 82, 23, 234, 234, 324, 546, 546, 5464],
    'az2': [2342, 2342, 54645, 56765, 65765, 65756, 3453, 56756, 3453]})


reshape = pd.wide_to_long(df_ex, stubnames='az', i=['Unique Family Identifier', 'Order'], j='age')

df_ex.head()
Unique Family Identifier Order az1 az2
0 1000 1 28 2342
1 1000 2 82 2342
2 1000 3 23 54645
3 1001 1 234 56765
4 1001 2 234 65765
reshape
az
Unique Family Identifier Order age
1000 1 1 28
2 2342
2 1 82
2 2342
3 1 23
2 54645
1001 1 1 234
2 56765
2 1 234
2 65765
3 1 324
2 65756
1002 1 1 546
2 3453
324 2 1 546
2 56756
234 3 1 5464
2 3453

Going from long to wide requires the use of unstack().

normal = reshape.unstack()
normal.columns = normal.columns.map('{0[0]}{0[1]}'.format)
normal.reset_index()
normal.head()
az1 az2
Unique Family Identifier Order
234 3 5464 3453
324 2 546 56756
1000 1 28 2342
2 82 2342
3 23 54645

Merging Data#

Merging data uses pd.merge(data_1, data_2, on = identifier).

left_frame = pd.DataFrame({'key': range(5), 
                           'left_value': ['a', 'b', 'c', 'd', 'e']})
right_frame = pd.DataFrame({'key': range(2, 7), 
                           'right_value': ['f', 'g', 'h', 'i', 'j']})
print(left_frame)
print('\n')
print(right_frame)
   key left_value
0    0          a
1    1          b
2    2          c
3    3          d
4    4          e


   key right_value
0    2           f
1    3           g
2    4           h
3    5           i
4    6           j
pd.merge(left_frame, right_frame, on='key', how='inner')
key left_value right_value
0 2 c f
1 3 d g
2 4 e h
pd.merge(left_frame, right_frame, on='key', how='left')
key left_value right_value
0 0 a NaN
1 1 b NaN
2 2 c f
3 3 d g
4 4 e h
pd.merge(left_frame, right_frame, on='key', how='right')
key left_value right_value
0 2 c f
1 3 d g
2 4 e h
3 5 NaN i
4 6 NaN j

Appending Data#

Note: Here, axis=0 is implied as the default, so it will append along the rows. But columns or Series can also be appended to each other by specifying axis=1 (provided that they are the same length).

pd.concat([left_frame, right_frame], sort=True)
key left_value right_value
0 0 a NaN
1 1 b NaN
2 2 c NaN
3 3 d NaN
4 4 e NaN
0 2 NaN f
1 3 NaN g
2 4 NaN h
3 5 NaN i
4 6 NaN j

Collapsing Data#

Collapsing data is accomplished by the .groupby() function. We will collapse by Year.

by_year = df.groupby('Year')
print(by_year.count().head()) # NOT NULL records within each column
      State  Table_Data  High_Value  Low_Value  CPI_Average  High_2018  \
Year                                                                     
1968     55          53          54         54           55         54   
1969     55          54          54         54           55         54   
1970     55          54          54         54           55         54   
1971     55          54          54         54           55         54   
1972     55          54          54         54           55         54   

      Low_2018  Difference  High*2  Low*2  
Year                                       
1968        54          54      54     54  
1969        54          54      54     54  
1970        54          54      54     54  
1971        54          54      54     54  
1972        54          54      54     54  

Note: This only generates counts for each year. If we want averages, sums or any other summary statistics we have to specify that:

print(by_year.sum()[20:25])
print('\n')
print(by_year.mean()[20:25]) 
print('\n')
print(by_year.median()[20:25]) 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[61], line 1
----> 1 print(by_year.sum()[20:25])
      2 print('\n')
      3 print(by_year.mean()[20:25]) 

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:2263, in GroupBy.sum(self, numeric_only, min_count, engine, engine_kwargs)
   2258 else:
   2259     # If we are grouping on categoricals we want unobserved categories to
   2260     # return zero, rather than the default of NaN which the reindexing in
   2261     # _agg_general() returns. GH #31422
   2262     with com.temp_setattr(self, "observed", True):
-> 2263         result = self._agg_general(
   2264             numeric_only=numeric_only,
   2265             min_count=min_count,
   2266             alias="sum",
   2267             npfunc=np.sum,
   2268         )
   2270     return self._reindex_output(result, fill_value=0)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1422, in GroupBy._agg_general(self, numeric_only, min_count, alias, npfunc)
   1413 @final
   1414 def _agg_general(
   1415     self,
   (...)
   1420     npfunc: Callable,
   1421 ):
-> 1422     result = self._cython_agg_general(
   1423         how=alias,
   1424         alt=npfunc,
   1425         numeric_only=numeric_only,
   1426         min_count=min_count,
   1427     )
   1428     return result.__finalize__(self.obj, method="groupby")

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1507, in GroupBy._cython_agg_general(self, how, alt, numeric_only, min_count, **kwargs)
   1503         result = self._agg_py_fallback(values, ndim=data.ndim, alt=alt)
   1505     return result
-> 1507 new_mgr = data.grouped_reduce(array_func)
   1508 res = self._wrap_agged_manager(new_mgr)
   1509 out = self._wrap_aggregated_output(res)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/internals/managers.py:1503, in BlockManager.grouped_reduce(self, func)
   1499 if blk.is_object:
   1500     # split on object-dtype blocks bc some columns may raise
   1501     #  while others do not.
   1502     for sb in blk._split():
-> 1503         applied = sb.apply(func)
   1504         result_blocks = extend_blocks(applied, result_blocks)
   1505 else:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/internals/blocks.py:329, in Block.apply(self, func, **kwargs)
    323 @final
    324 def apply(self, func, **kwargs) -> list[Block]:
    325     """
    326     apply the function to my values; return a block if we are not
    327     one
    328     """
--> 329     result = func(self.values, **kwargs)
    331     return self._split_op_result(result)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1490, in GroupBy._cython_agg_general.<locals>.array_func(values)
   1488 def array_func(values: ArrayLike) -> ArrayLike:
   1489     try:
-> 1490         result = self.grouper._cython_operation(
   1491             "aggregate",
   1492             values,
   1493             how,
   1494             axis=data.ndim - 1,
   1495             min_count=min_count,
   1496             **kwargs,
   1497         )
   1498     except NotImplementedError:
   1499         # generally if we have numeric_only=False
   1500         # and non-applicable functions
   1501         # try to python agg
   1502         # TODO: shouldn't min_count matter?
   1503         result = self._agg_py_fallback(values, ndim=data.ndim, alt=alt)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/ops.py:959, in BaseGrouper._cython_operation(self, kind, values, how, axis, min_count, **kwargs)
    957 ids, _, _ = self.group_info
    958 ngroups = self.ngroups
--> 959 return cy_op.cython_operation(
    960     values=values,
    961     axis=axis,
    962     min_count=min_count,
    963     comp_ids=ids,
    964     ngroups=ngroups,
    965     **kwargs,
    966 )

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/ops.py:657, in WrappedCythonOp.cython_operation(self, values, axis, min_count, comp_ids, ngroups, **kwargs)
    647 if not isinstance(values, np.ndarray):
    648     # i.e. ExtensionArray
    649     return self._ea_wrap_cython_operation(
    650         values,
    651         min_count=min_count,
   (...)
    654         **kwargs,
    655     )
--> 657 return self._cython_op_ndim_compat(
    658     values,
    659     min_count=min_count,
    660     ngroups=ngroups,
    661     comp_ids=comp_ids,
    662     mask=None,
    663     **kwargs,
    664 )

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/ops.py:497, in WrappedCythonOp._cython_op_ndim_compat(self, values, min_count, ngroups, comp_ids, mask, result_mask, **kwargs)
    494     # otherwise we have OHLC
    495     return res.T
--> 497 return self._call_cython_op(
    498     values,
    499     min_count=min_count,
    500     ngroups=ngroups,
    501     comp_ids=comp_ids,
    502     mask=mask,
    503     result_mask=result_mask,
    504     **kwargs,
    505 )

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/groupby/ops.py:549, in WrappedCythonOp._call_cython_op(self, values, min_count, ngroups, comp_ids, mask, result_mask, **kwargs)
    547 counts = np.zeros(ngroups, dtype=np.int64)
    548 if self.how in ["min", "max", "mean", "last", "first", "sum"]:
--> 549     func(
    550         out=result,
    551         counts=counts,
    552         values=values,
    553         labels=comp_ids,
    554         min_count=min_count,
    555         mask=mask,
    556         result_mask=result_mask,
    557         is_datetimelike=is_datetimelike,
    558     )
    559 elif self.how in ["var", "ohlc", "prod", "median"]:
    560     func(
    561         result,
    562         counts,
   (...)
    568         **kwargs,
    569     )

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/groupby.pyx:717, in pandas._libs.groupby.group_sum()

TypeError: unsupported operand type(s) for +: 'float' and 'str'