Intro to Pandas

While built-in Python functions suffice for general programming, data analysis requires additional functions and objects. Pandas is a popular data analysis package that is simple to use the moment you start Python.

The primary reason Pandas is popular for data science and analysis is that it introduces three useful objects that mirror similar data structures from other statistical packages and don’t add complexity to the simple Pythonic syntax. These objects are:

The DataFrame
The Series
The Index

This guide is not meant to be comprehensive, the most comprehensive documentation for Pandas is found here. The purpose of this guide is to inform a potential user of the functionality of Pandas and a general overview of how to accomplish basic data analysis tools.

Getting Started

Importing Pandas

Since Pandas is not native to Python, you will have to install it. Fortunately, Pandas has grown so much in popularity, that most downloads (including Anaconda) will contain Pandas, but you will still have to load the package. This code will do the trick:

import pandas

However, because Pandas is used so often, the Python community loads pandas as pd for ease of access:

import pandas as pd

It’s also useful to import a few other packages, mostly notably Numpy.

import numpy as np

DataFrame

The DataFrame object is one of the main contributions of Pandas. The DataFrame is a 2-dimensional labeled structure with columns and rows. The columns and the rows represent one dimension each. The DataFrame is analogous to the R and Stata DataFrame and the Excel spreadsheet. Or put in more technical terms, the DataFrame is a tabular data structure. The code below defines a DataFrame and then prints it out to view:

d = pd.DataFrame({'one': [1., 2., 3., 4.], 'two': [4., 3., 2., 1.]})

d

	one	two
0	1.0	4.0
1	2.0	3.0
2	3.0	2.0
3	4.0	1.0

The name of our DataFrame is d, and it has two columns, one and two, and four rows, 0-3. (Python is 0 indexed, meaning it counts from 0…1…2…and so on.)

Each data point can be referenced through its Index (which corresponds to its row, the far most left value) and the column signifies what the value means. We can call a single Pandas column with this command:

d['one']

0    1.0
1    2.0
2    3.0
3    4.0
Name: one, dtype: float64

The values on the left represent the Index we saw earlier. Notice: A column’s type is itself a Pandas object, the Series. More on this object below.

Series

A Series is one-dimensional indexed array that can contain any Python data type. To create a Series you use the function:

series_ex = pd.Series([1,2,3,4])

A Series in Pandas is similar visually to a list. But there are key distinctions in how they operate. As mentioned, Pandas is used for data analysis, so a Series has functions that allow for data analysis to be done easily, while a list would require either a for loop or list comprehension for the same operations. Example of this below:

series_ex = series_ex*2
print("This is a Series multiplied by two")
print(series_ex)

This is a Series multiplied by two
0    2
1    4
2    6
3    8
dtype: int64

alist = [1,2,3,4]
alist = alist*2
print("This is a List multiplied by two")
print(alist)

This is a List multiplied by two
[1, 2, 3, 4, 1, 2, 3, 4]

Index

Both the Series and the DataFrame have an Index that signifies order and allows for referencing specific points. The Index itself is an object.

ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

Importing Data

Pandas has the ability to read and export multiple data format types: be it a CSV, DTA, SAS file, JSON file, SQL and many more. Almost all data reading will be in the format:

pd.read_(NAME OF DATA FORMAT)('FILE_NAME')

Let’s take a look at a few examples.

Stata Files (DTA)

# Importing data from Stata 
pd.read_stata('data/State_ETO_short.dta')

	state_abbv	ETO	ETW	ETOW	ET_cat	ET_Work_cat	ET_ET_Work_cat
0	VA	2.0	62.5	64.5	0 - 5%	30% <	30% <
1	TN	5.5	20.8	26.3	5 - 10%	10 - 30%	20 - 30%
2	VT	19.9	0.7	20.6	15 - 20%	0 - 5%	20 - 30%
3	ID	6.0	11.4	17.4	5 - 10%	10 - 30%	10 - 20%
4	OH	1.9	15.0	16.9	0 - 5%	10 - 30%	10 - 20%

This is how we read in a DataFrame, but we still need to store and name it. This can be accomplished in one line:

df = pd.read_stata('data/State_ETO_short.dta')

Now, when we call df, we’ll get the DataFrame that corresponds to the data referenced in pd.read_stata('data/State_ETO.dta').

df

	state_abbv	ETO	ETW	ETOW	ET_cat	ET_Work_cat	ET_ET_Work_cat
0	VA	2.0	62.5	64.5	0 - 5%	30% <	30% <
1	TN	5.5	20.8	26.3	5 - 10%	10 - 30%	20 - 30%
2	VT	19.9	0.7	20.6	15 - 20%	0 - 5%	20 - 30%
3	ID	6.0	11.4	17.4	5 - 10%	10 - 30%	10 - 20%
4	OH	1.9	15.0	16.9	0 - 5%	10 - 30%	10 - 20%

Excel Files (XLSX)

(Note: May need to pip install openpyxl for this to work!)

# Importing data from excel into pandas
df = pd.read_excel('data/Minimum_Wage_Data_Short.xlsx')
df

	Year	State	Table_Data	High_Value	Low_Value	CPI_Average	High_2018	Low_2018
0	1968	Alabama	NaN	0.00000	0.00000	34.783333	0.00	0.00
1	1968	Alaska	2.1	2.10000	2.10000	34.783333	15.12	15.12
2	1968	Arizona	18.72 - 26.40/wk(b)	0.66000	0.46800	34.783333	4.75	3.37
3	1968	Arkansas	1.25/day(b)	0.15625	0.15625	34.783333	1.12	1.12
4	1968	California	1.65(b)	1.65000	1.65000	34.783333	11.88	11.88
5	1968	Colorado	1.00 - 1.25(b)	1.25000	1.00000	34.783333	9.00	7.20

Other Data Types

Similar variants for uploading exists for each common data type - CSV, SAS, and so on.

Exporting Data

Exporting data is very simple as well and follows a pattern similar to exporting.

CSV

df.to_csv('exports/Minimum_Wage_Data.csv')

Converting Data

Now that we know how to load in data, it will be useful to examine ways to convert already existing data structures into DataFrames.

List

my_list = [1,2,3,4,5,6,7,8,9]
columns = ['a', 'b', 'c']

pd.DataFrame(np.array(my_list).reshape(3,3), columns = columns)

	a	b	c
0	1	2	3
1	4	5	6
2	7	8	9

An important thing to note: pd.DataFrame has a multitude of options. The most import of which is what follows right after the first parentheses which is the data that is to be transformed. Here are transform the list [1,2,3,4,5,6,7,7,8,9] into an np.array with the shape of:

[[1,2,3],
[4,5,6], 
[7.8.9]]

Then we transform the data to a Pandas DataFrame which gives us:

0   1   2   3
1   4   5   6
2   7   8   9

Finally, we add a list of column name with the option columns = columns to get the final dataframe.

    a   b   c
1   4   5   6
2   7   8   9

Dictionary

data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
pd.DataFrame.from_dict(data)

	col_1	col_2
0	3	a
1	2	b
2	1	c
3	0	d

Numpy Array

dtype = [('Col1','int32'), ('Col2','float32'), ('Col3','float32')]
values = np.zeros(5, dtype=dtype)
index = ['Row'+str(i) for i in range(1, len(values)+1)]

pd.DataFrame(values, index=index).head()

	Col1	Col2	Col3
Row1	0	0.0	0.0
Row2	0	0.0	0.0
Row3	0	0.0	0.0
Row4	0	0.0	0.0
Row5	0	0.0	0.0

Now that we know how to load in our DataFrame, we will try to view and manipulate our data before our analysis is run.

Viewing Data

We know that we can view our data by simply printing the dataframe, but what if our data is too large?

The .head() function prints out the first 5 rows (inside the parentheticals you can specify the first N observations you want to see).

df.head()

	Year	State	Table_Data	High_Value	Low_Value	CPI_Average	High_2018	Low_2018
0	1968	Alabama	NaN	0.00000	0.00000	34.783333	0.00	0.00
1	1968	Alaska	2.1	2.10000	2.10000	34.783333	15.12	15.12
2	1968	Arizona	18.72 - 26.40/wk(b)	0.66000	0.46800	34.783333	4.75	3.37
3	1968	Arkansas	1.25/day(b)	0.15625	0.15625	34.783333	1.12	1.12
4	1968	California	1.65(b)	1.65000	1.65000	34.783333	11.88	11.88

df.head(2)

	Year	State	Table_Data	High_Value	Low_Value	CPI_Average	High_2018	Low_2018
0	1968	Alabama	NaN	0.0	0.0	34.783333	0.00	0.00
1	1968	Alaska	2.1	2.1	2.1	34.783333	15.12	15.12

You can also do all of these methods on a single Series.

df['Year'].head()

0    1968
1    1968
2    1968
3    1968
4    1968
Name: Year, dtype: int64

Or, you can do multiple columns at one time.

df[['State', 'Table_Data']].head()

	State	Table_Data
0	Alabama	NaN
1	Alaska	2.1
2	Arizona	18.72 - 26.40/wk(b)
3	Arkansas	1.25/day(b)
4	California	1.65(b)

It’s also good to view general information on the dataframe from .info() and to understand the datatypes of each of the columns.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Year         6 non-null      int64  
 1   State        6 non-null      object 
 2   Table_Data   5 non-null      object 
 3   High_Value   6 non-null      float64
 4   Low_Value    6 non-null      float64
 5   CPI_Average  6 non-null      float64
 6   High_2018    6 non-null      float64
 7   Low_2018     6 non-null      float64
dtypes: float64(5), int64(1), object(2)
memory usage: 512.0+ bytes

df.dtypes

Year             int64
State           object
Table_Data      object
High_Value     float64
Low_Value      float64
CPI_Average    float64
High_2018      float64
Low_2018       float64
dtype: object

Similar to our use of head(), we can use tail() to examine the last few data points.

df.tail(2)

	Year	State	Table_Data	High_Value	Low_Value	CPI_Average	High_2018	Low_2018
4	1968	California	1.65(b)	1.65	1.65	34.783333	11.88	11.88
5	1968	Colorado	1.00 - 1.25(b)	1.25	1.00	34.783333	9.00	7.20

Subsetting Data

Slicing the data by rows

We can do traditional slicing through the index [start, end]. This will subset by rows.

df[1:4]

	Year	State	Table_Data	High_Value	Low_Value	CPI_Average	High_2018	Low_2018
1	1968	Alaska	2.1	2.10000	2.10000	34.783333	15.12	15.12
2	1968	Arizona	18.72 - 26.40/wk(b)	0.66000	0.46800	34.783333	4.75	3.37
3	1968	Arkansas	1.25/day(b)	0.15625	0.15625	34.783333	1.12	1.12

Slicing the data by columns

To subset by columns, we can use df[['columns_we_want_1', 'columns_we_want_2']].

columns_you_want = ['State', 'Table_Data'] 
df[columns_you_want].head()

	State	Table_Data
0	Alabama	NaN
1	Alaska	2.1
2	Arizona	18.72 - 26.40/wk(b)
3	Arkansas	1.25/day(b)
4	California	1.65(b)

Let’s bring in a larger dataset for this analysis to be meaningful. This will be a panel dataset of states, so there will be 50 rows for every year

df = pd.read_excel("data/Minimum_Wage_Data.xlsx")

print("These are the first five observations: \n")
print(df[['Year', 'State']].head(5))
print("\n")

print("These are the last five observations: \n")
print(df[['Year', 'State']].tail(5))

These are the first five observations: 

   Year       State
0  1968     Alabama
1  1968      Alaska
2  1968     Arizona
3  1968    Arkansas
4  1968  California


These are the last five observations: 

      Year          State
2745  2017       Virginia
2746  2017     Washington
2747  2017  West Virginia
2748  2017      Wisconsin
2749  2017        Wyoming

Viewing the size and columns of the data:

print(len(df))
print(df.columns)

2750
Index(['Year', 'State', 'Table_Data', 'High_Value', 'Low_Value', 'CPI_Average',
       'High_2018', 'Low_2018'],
      dtype='object')

We have 8 variables: Year, State, Table_Data, High_Value, Low_Value, CPI_Average, High_2018, and Low_2018. Sometimes in our analysis we only want to keep certain years. For this, the traditional boolean logic mixed with Pandas slices the data into the segments we want.

Slicing based on conditions

# Rows past 2015
df[df.Year > 2015].head(3)

	Year	State	Table_Data	High_Value	Low_Value	CPI_Average	High_2018	Low_2018
2640	2016	Alabama	...	0.00	0.00	240.007167	0.00	0.00
2641	2016	Alaska	9.75	9.75	9.75	240.007167	10.17	10.17
2642	2016	Arizona	8.05	8.05	8.05	240.007167	8.40	8.40

Segmenting on multiple conditions is also desirable. Most programs allow for an “AND” operator and an “OR” operator.

# California AND 2010
print(df[(df.Year == 2010) & (df.State == 'California')].head(3))
print('\n')

      Year       State Table_Data  High_Value  Low_Value  CPI_Average  \
2314  2010  California          8         8.0        8.0     218.0555   

      High_2018  Low_2018  
2314       9.19      9.19

# Alabama OR before 2015
df[(df.State == "Alabama") | (df.Year < 2015)].head(3)

	Year	State	Table_Data	High_Value	Low_Value	CPI_Average	High_2018	Low_2018
0	1968	Alabama	NaN	0.00	0.000	34.783333	0.00	0.00
1	1968	Alaska	2.1	2.10	2.100	34.783333	15.12	15.12
2	1968	Arizona	18.72 - 26.40/wk(b)	0.66	0.468	34.783333	4.75	3.37

The traditional Python index slicers are also applicable for DataFrames with the .loc[] method.

print(df.iloc[99])
print('\n')
print(df.iloc[[1, 50, 300]])

Year                       1969
State              South Dakota
Table_Data     17.00 - 20.00/wk
High_Value                  0.5
Low_Value                 0.425
CPI_Average           36.683333
High_2018                  3.41
Low_2018                    2.9
Name: 99, dtype: object


     Year      State  Table_Data  High_Value  Low_Value  CPI_Average  \
1    1968     Alaska         2.1         2.1       2.10    34.783333   
50   1968   Virginia         ...         0.0       0.00    34.783333   
300  1973  Minnesota  .75 - 1.60         1.6       0.75    44.400000   

     High_2018  Low_2018  
1        15.12     15.12  
50        0.00      0.00  
300       9.02      4.23

The .loc[] method is also highly useful for subsetting rows and columns at the same time

df.loc[1:3, 'State']

1      Alaska
2     Arizona
3    Arkansas
Name: State, dtype: object

df.loc[:2, ['Year', 'State']]

	Year	State
0	1968	Alabama
1	1968	Alaska
2	1968	Arizona

Manipulating Data

Summary Statistics

Exporting summary stats in a different document type can help visualize results and understand data. Also note the different steps we can take to print out summary stats in a more visually intuitive way.

df.describe()

	Year	High_Value	Low_Value	CPI_Average	High_2018	Low_2018
count	2750.000000	2739.000000	2739.000000	2750.000000	2739.000000	2739.000000
mean	1992.500000	3.653761	3.533555	138.828983	6.441486	6.200252
std	14.433494	2.560308	2.539424	65.823807	3.025140	3.017818
min	1968.000000	0.000000	0.000000	34.783333	0.000000	0.000000
25%	1980.000000	1.600000	1.600000	82.408333	5.980000	5.230000
50%	1992.500000	3.350000	3.350000	142.387500	7.370000	7.170000
75%	2005.000000	5.150000	5.150000	195.291667	8.280000	8.070000
max	2017.000000	11.500000	11.500000	245.119583	15.120000	15.120000

np.round(df.describe(), 2)

	Year	High_Value	Low_Value	CPI_Average	High_2018	Low_2018
count	2750.00	2739.00	2739.00	2750.00	2739.00	2739.00
mean	1992.50	3.65	3.53	138.83	6.44	6.20
std	14.43	2.56	2.54	65.82	3.03	3.02
min	1968.00	0.00	0.00	34.78	0.00	0.00
25%	1980.00	1.60	1.60	82.41	5.98	5.23
50%	1992.50	3.35	3.35	142.39	7.37	7.17
75%	2005.00	5.15	5.15	195.29	8.28	8.07
max	2017.00	11.50	11.50	245.12	15.12	15.12

print(np.round(df.describe(), 2).T)
df.describe().transpose().to_csv('exports/summary_stats.csv', sep=',')

              count     mean    std      min      25%      50%      75%  \
Year         2750.0  1992.50  14.43  1968.00  1980.00  1992.50  2005.00   
High_Value   2739.0     3.65   2.56     0.00     1.60     3.35     5.15   
Low_Value    2739.0     3.53   2.54     0.00     1.60     3.35     5.15   
CPI_Average  2750.0   138.83  65.82    34.78    82.41   142.39   195.29   
High_2018    2739.0     6.44   3.03     0.00     5.98     7.37     8.28   
Low_2018     2739.0     6.20   3.02     0.00     5.23     7.17     8.07   

                 max  
Year         2017.00  
High_Value     11.50  
Low_Value      11.50  
CPI_Average   245.12  
High_2018      15.12  
Low_2018       15.12

Creating a new variable

The standard way to create a new Pandas column or replace an old one is to call the Series (whether it exists or not) on the left hand side and set it equal to the expression that expresses that value you want to create. For example:

df['7'] = 7
print(df.head())

   Year       State           Table_Data  High_Value  Low_Value  CPI_Average  \
0  1968     Alabama                  NaN     0.00000    0.00000    34.783333   
1  1968      Alaska                  2.1     2.10000    2.10000    34.783333   
2  1968     Arizona  18.72 - 26.40/wk(b)     0.66000    0.46800    34.783333   
3  1968    Arkansas          1.25/day(b)     0.15625    0.15625    34.783333   
4  1968  California              1.65(b)     1.65000    1.65000    34.783333   

   High_2018  Low_2018  7  
0       0.00      0.00  7  
1      15.12     15.12  7  
2       4.75      3.37  7  
3       1.12      1.12  7  
4      11.88     11.88  7

Or, we can replace an old variable in a similar way.

Replacing a variable

df['7'] = 8
print(df.head())

   Year       State           Table_Data  High_Value  Low_Value  CPI_Average  \
0  1968     Alabama                  NaN     0.00000    0.00000    34.783333   
1  1968      Alaska                  2.1     2.10000    2.10000    34.783333   
2  1968     Arizona  18.72 - 26.40/wk(b)     0.66000    0.46800    34.783333   
3  1968    Arkansas          1.25/day(b)     0.15625    0.15625    34.783333   
4  1968  California              1.65(b)     1.65000    1.65000    34.783333   

   High_2018  Low_2018  7  
0       0.00      0.00  8  
1      15.12     15.12  8  
2       4.75      3.37  8  
3       1.12      1.12  8  
4      11.88     11.88  8

We can also rename variables. In this case we will rename Year to Date and Table_Data to Values.

Renaming Variables

df.rename(columns={"Year": "Date", "Table_Data": "Values"}).head()

	Date	State	Values	High_Value	Low_Value	CPI_Average	High_2018	Low_2018	7
0	1968	Alabama	NaN	0.00000	0.00000	34.783333	0.00	0.00	8
1	1968	Alaska	2.1	2.10000	2.10000	34.783333	15.12	15.12	8
2	1968	Arizona	18.72 - 26.40/wk(b)	0.66000	0.46800	34.783333	4.75	3.37	8
3	1968	Arkansas	1.25/day(b)	0.15625	0.15625	34.783333	1.12	1.12	8
4	1968	California	1.65(b)	1.65000	1.65000	34.783333	11.88	11.88	8

Dropping Variables

We can also drop a variable with df.drop(). This is one of many functions where you’ll need to specify an axis. axis=0 would indicate that you want to apply the operation along the rows, whereas axis=1 indicates that we want to apply this operation to a column.

# Dropping a variable
df = df.drop("7", axis=1)
print(df.head())

   Year       State           Table_Data  High_Value  Low_Value  CPI_Average  \
0  1968     Alabama                  NaN     0.00000    0.00000    34.783333   
1  1968      Alaska                  2.1     2.10000    2.10000    34.783333   
2  1968     Arizona  18.72 - 26.40/wk(b)     0.66000    0.46800    34.783333   
3  1968    Arkansas          1.25/day(b)     0.15625    0.15625    34.783333   
4  1968  California              1.65(b)     1.65000    1.65000    34.783333   

   High_2018  Low_2018  
0       0.00      0.00  
1      15.12     15.12  
2       4.75      3.37  
3       1.12      1.12  
4      11.88     11.88

Arithmetic Operations

Basic math operations are also easily applied in Pandas.

df['Difference'] = df['High_Value'] - df['Low_Value']
df['High*2'] = df['High_Value']*2
df['Low*2'] = df['Low_Value']*2
print(df.head())

   Year       State           Table_Data  High_Value  Low_Value  CPI_Average  \
0  1968     Alabama                  NaN     0.00000    0.00000    34.783333   
1  1968      Alaska                  2.1     2.10000    2.10000    34.783333   
2  1968     Arizona  18.72 - 26.40/wk(b)     0.66000    0.46800    34.783333   
3  1968    Arkansas          1.25/day(b)     0.15625    0.15625    34.783333   
4  1968  California              1.65(b)     1.65000    1.65000    34.783333   

   High_2018  Low_2018  Difference  High*2   Low*2  
0       0.00      0.00       0.000  0.0000  0.0000  
1      15.12     15.12       0.000  4.2000  4.2000  
2       4.75      3.37       0.192  1.3200  0.9360  
3       1.12      1.12       0.000  0.3125  0.3125  
4      11.88     11.88       0.000  3.3000  3.3000

Complex Operations with `.map()`

More complex operations can be solved through the .map() method.

In the example below, .map() uses a dictionary to change results, which can help keep code clean when replacing a large amount of values. Here we create an abbrev column for state abbreviations (we will create it and drop the variable after).

# Using Data.map
state_2 = {'OH': 'Ohio', 'Illinois': 'IL', 'California': 'CA', 'Texas': 'TX', 'Alabama':'AL'}

df['abbrev'] = df['State'].map(state_2)
df[['State', 'abbrev']].head()

	State	abbrev
0	Alabama	AL
1	Alaska	NaN
2	Arizona	NaN
3	Arkansas	NaN
4	California	CA

# Another simpler map example
alist = [1,2,3,4]
def function_to_be_mapped(x):
    return x+3

result = map(function_to_be_mapped, alist)
print(list(result))

[4, 5, 6, 7]

df.sort_values('abbrev', ascending=True).head()
df = df.drop("abbrev", axis=1)

Using Functions

You can use .apply() to apply a function to a Series. You can either specify the function or use the lambda anonymous function:

Specfying a function:

def add(x):
    x = x + 1
    return x

print(df['Year'].head())
print("\n")
print(df['Year'].apply(add).head())

0    1968
1    1968
2    1968
3    1968
4    1968
Name: Year, dtype: int64


0    1969
1    1969
2    1969
3    1969
4    1969
Name: Year, dtype: int64

Lambda Functions

Lambda Functions (skipping defining a function if the function is simple enough):

print(df['Year'].head())
print("\n")
print((df['Year'].apply(lambda x: x + 1).head()))

0    1968
1    1968
2    1968
3    1968
4    1968
Name: Year, dtype: int64


0    1969
1    1969
2    1969
3    1969
4    1969
Name: Year, dtype: int64

Missing Values

Dealing with missing values is an important part of data cleaning and something to always be mindful of. Pandas codes missing values as numpy NaN values.

print(df['Table_Data'].head(10))
print("\n")
print(df['Year'].head(10))

0                    NaN
1                    2.1
2    18.72 - 26.40/wk(b)
3            1.25/day(b)
4                1.65(b)
5         1.00 - 1.25(b)
6                    1.4
7                   1.25
8            1.25 - 1.40
9          $1.15 & $1.60
Name: Table_Data, dtype: object


0    1968
1    1968
2    1968
3    1968
4    1968
5    1968
6    1968
7    1968
8    1968
9    1968
Name: Year, dtype: int64

The following syntax can help identify missing values:

print(df['Table_Data'].isnull().values.any())
print(df['Year'].isnull().values.any())

True
False

Duplicate Values

We can use .drop_duplicates(keep = "first") to drop all but the first observations of duplicates.

data = {'col_1': [3, 3, 3, 3], 'col_2': ['a', 'a', 'a', 'a']}
data_dup = pd.DataFrame.from_dict(data)
print(data_dup)

print("\n")

data_dup = data_dup.drop_duplicates(keep="last")

print(data_dup)

   col_1 col_2
0      3     a
1      3     a
2      3     a
3      3     a


   col_1 col_2
3      3     a

Reshaping Data

Data can be reshaped using a variety of Pandas functions. Going from wide to long has a built in function wide_to_long(). We will load in a new dataset for this:

df_ex = pd.DataFrame({
    'Unique Family Identifier': [1000, 1000, 1000, 1001, 1001, 1001, 1002, 324, 234],
    'Order': [1, 2, 3, 1, 2, 3, 1, 2, 3], 
    'az1': [28, 82, 23, 234, 234, 324, 546, 546, 5464],
    'az2': [2342, 2342, 54645, 56765, 65765, 65756, 3453, 56756, 3453]})


reshape = pd.wide_to_long(df_ex, stubnames='az', i=['Unique Family Identifier', 'Order'], j='age')

df_ex.head()

	Unique Family Identifier	Order	az1	az2
0	1000	1	28	2342
1	1000	2	82	2342
2	1000	3	23	54645
3	1001	1	234	56765
4	1001	2	234	65765

reshape

			az
Unique Family Identifier	Order	age
1000	1	1	28
	1	2	2342
	2	1	82
	2	2	2342
	3	1	23
	3	2	54645
1001	1	1	234
	1	2	56765
	2	1	234
	2	2	65765
	3	1	324
	3	2	65756
1002	1	1	546
1002	1	2	3453
324	2	1	546
324	2	2	56756
234	3	1	5464
234	3	2	3453

Going from long to wide requires the use of unstack().

normal = reshape.unstack()
normal.columns = normal.columns.map('{0[0]}{0[1]}'.format)
normal.reset_index()
normal.head()

		az1	az2
Unique Family Identifier	Order
234	3	5464	3453
324	2	546	56756
1000	1	28	2342
	2	82	2342
	3	23	54645

Merging Data

Merging data uses pd.merge(data_1, data_2, on = identifier).

left_frame = pd.DataFrame({'key': range(5), 
                           'left_value': ['a', 'b', 'c', 'd', 'e']})
right_frame = pd.DataFrame({'key': range(2, 7), 
                           'right_value': ['f', 'g', 'h', 'i', 'j']})
print(left_frame)
print('\n')
print(right_frame)

   key left_value
0    0          a
1    1          b
2    2          c
3    3          d
4    4          e


   key right_value
0    2           f
1    3           g
2    4           h
3    5           i
4    6           j

pd.merge(left_frame, right_frame, on='key', how='inner')

	key	left_value	right_value
0	2	c	f
1	3	d	g
2	4	e	h

pd.merge(left_frame, right_frame, on='key', how='left')

	key	left_value	right_value
0	0	a	NaN
1	1	b	NaN
2	2	c	f
3	3	d	g
4	4	e	h

pd.merge(left_frame, right_frame, on='key', how='right')

	key	left_value	right_value
0	2	c	f
1	3	d	g
2	4	e	h
3	5	NaN	i
4	6	NaN	j

Appending Data

Note: Here, axis=0 is implied as the default, so it will append along the rows. But columns or Series can also be appended to each other by specifying axis=1 (provided that they are the same length).

pd.concat([left_frame, right_frame], sort=True)

	key	left_value	right_value
0	0	a	NaN
1	1	b	NaN
2	2	c	NaN
3	3	d	NaN
4	4	e	NaN
0	2	NaN	f
1	3	NaN	g
2	4	NaN	h
3	5	NaN	i
4	6	NaN	j

Collapsing Data

Collapsing data is accomplished by the .groupby() function. We will collapse by Year.

by_year = df.groupby('Year')
print(by_year.count().head()) # NOT NULL records within each column

      State  Table_Data  High_Value  Low_Value  CPI_Average  High_2018  \
Year                                                                     
1968     55          53          54         54           55         54   
1969     55          54          54         54           55         54   
1970     55          54          54         54           55         54   
1971     55          54          54         54           55         54   
1972     55          54          54         54           55         54   

      Low_2018  Difference  High*2  Low*2  
Year                                       
1968        54          54      54     54  
1969        54          54      54     54  
1970        54          54      54     54  
1971        54          54      54     54  
1972        54          54      54     54

Note: This only generates counts for each year. If we want averages, sums or any other summary statistics we have to specify that:

print(by_year.sum()[20:25])
print('\n')
print(by_year.mean()[20:25]) 
print('\n')
print(by_year.median()[20:25])

      High_Value  Low_Value  CPI_Average  High_2018  Low_2018  Difference  \
Year                                                                        
1988      146.75     142.75  6504.208331     310.65    302.19         4.0   
1989      146.75     142.75  6818.166669     296.49    288.41         4.0   
1990      146.75     142.75  7186.208332     281.27    273.61         4.0   
1991      179.20     175.00  7490.541669     329.52    321.80         4.2   
1992      194.17     189.57  7717.416669     346.40    338.19         4.6   

      High*2   Low*2  
Year                  
1988  293.50  285.50  
1989  293.50  285.50  
1990  293.50  285.50  
1991  358.40  350.00  
1992  388.34  379.14  


      High_Value  Low_Value  CPI_Average  High_2018  Low_2018  Difference  \
Year                                                                        
1988    2.668182   2.595455   118.258333   5.648182  5.494364    0.072727   
1989    2.668182   2.595455   123.966667   5.390727  5.243818    0.072727   
1990    2.668182   2.595455   130.658333   5.114000  4.974727    0.072727   
1991    3.258182   3.181818   136.191667   5.991273  5.850909    0.076364   
1992    3.530364   3.446727   140.316667   6.298182  6.148909    0.083636   

        High*2     Low*2  
Year                      
1988  5.336364  5.190909  
1989  5.336364  5.190909  
1990  5.336364  5.190909  
1991  6.516364  6.363636  
1992  7.060727  6.893455  


      High_Value  Low_Value  CPI_Average  High_2018  Low_2018  Difference  \
Year                                                                        
1988        3.35       3.35   118.258333       7.09      7.09         0.0   
1989        3.35       3.35   123.966667       6.77      6.77         0.0   
1990        3.35       3.35   130.658333       6.42      6.42         0.0   
1991        3.80       3.80   136.191667       6.99      6.99         0.0   
1992        4.25       4.25   140.316667       7.58      7.58         0.0   

      High*2  Low*2  
Year                 
1988     6.7    6.7  
1989     6.7    6.7  
1990     6.7    6.7  
1991     7.6    7.6  
1992     8.5    8.5