- Table of contents
- 1. Introduction
- 2. Updating Rows and Columns
- 3. apply() Function
- 4. Add or Remove Row Column and DataFrame
- 5. Sorting Data
- 6. EDA
- 2 fundamental data structures in Pandas:
Series
andDataFrame
- One dimensional array-like:
pd.Series(data, index=index)
- 2 attributes:
data
andindex
- data: List, Numpy Array, Scalar, Dictionary (keys to be index, values to be values)
- Index: can be non-sequential, duplicates, non-numbers
# Creating a series with values: [0.1 0.2 0.3] and index: RangeIndex(start=0, stop=3, step=1)
# List
data = pd.Series([0.1, 0.2, 0.3])
0 0.1
1 0.2
2 0.3
dtype: float64
# NumPy array
print(pd.Series(np.array([11,8,3,1])))
# Scalar
print(pd.Series(100)) print(pd.Series(100, index=range(3)))
0 100
1 100
2 100
dtype: int64
# Index Example
val = np.arange(0, 5)
ind = val * 2 # non-sequential index
data = pd.Series(val, index=ind)
0 0
2 1
4 2
6 3
8 4
dtype: int32
ind = [3, 9, 1, 10, 3] #duplicate index
data = pd.Series(val, index=ind)
ind = ['a', 'b', 'c', 'd', 'e'] #non-number index
data = pd.Series(val, index=ind)
- Two-dimensional: a row index and a column index
- Each column is a pandas
Series
- “holes” in Series are filled with
NaN
missing value
#Example of pd.DataFrame
years = [2000, 2001, 2002, 2003, 2004]
population = {'area_1': [1.1, 1.2, 1.5, 1.7, 2.0],
'area_2': [0.4, 0.5, 0.7, 0.6, 0.7]}
df = pd.DataFrame(population, index = years)
# Example of set_index() function
marks = { 'Names': ['James', 'Andrew', 'Bob', 'Carlson'],
'English': [58, 62, 77, 65],
'Chinese': [55, 32, 64, 80],
'Math': [61, 70, 81, 54]}
df = pd.DataFrame(marks)
df.set_index("Names", inplace = True)
# Example of parameter "columns" function
marks = {'James': [58, 55, 61],
'Andrew': [62, 32, 70],
'Bob': [77, 64, 81],
'Carlson': [65, 80, 54]}
df = pd.DataFrame(marks.values(),
index=marks.keys(),
columns=['English', 'Chinese', 'Math'])
# English Chinese Math
#James 58 55 61
#Andrew 62 32 70
#Bob 77 64 81
#Carlson 65 80 54
- Column Index:
df.columns
- Column Access:
- Single column:
df[col_name]
ordf.col_name
(no space in col_name) - Multiple columns: column names in a list
df[[col_name1, col_name2, ...]]
- Single column:
# Single-column access
print(df['area_1'])
print(df.area_2)
# Multiple-column access
print(df[['area_1', 'area_2']])
- Column Addition:
df[new_col] = seq
where seq can be List/ or Pandas Series
- Row Index:
df.index
- Row Access with .iloc:
.iloc
indexer: integer position-baseddf.iloc[row_num]
df.iloc[[row_nums]]
select multiple rowsdf.iloc[start_row_num:end_row_num]
not including the end_row_num
- Row Access with .loc:
.loc
indexer: label-based, which means that you have to specify rows and columns based on their row and column labels.- Note: different from iloc, end index are included, not excluded
df.loc[row_num]
df.loc[[row_nums]]
select multiple rowsdf.loc[start_row_num:end_row_num]
including the end_row_num
- Select a portion of data using
.iloc[rows, cols]
(integer position-based) &loc[rows, cols]
(label-based)
#iloc examples
df.iloc[:3, -1] #First 3 rows of the last columns
df.iloc[[0, 2], :3] #row 0th and row 2nd from first 3 cols
#loc examples:
df.loc[[2000, 2001, 2003], :'area_2'] #row 2000, 2001 and 2003 with col 'area_2'
df.loc[2001:2002, 'area_3']
Data selection with Filter Mask
- Filter Mask:
df[filter mask]
- Filter Mask:
df.loc[filter mask,[columns]]
columns is OPTIONAL - String:
df[df["col"].isin([value1, value2, value3])]
- String:
df[df["col"].str.contains(r"value1|value2")]
df[(df['age'] >= 10) & (df['height'] >= 10)]
df.loc[df['height'] >= 10, ['age', 'weight']]
#isin
data[data['Crime Code'].isin(['624', '510', '230'])
#.str.contains
df[df["Name"].str.contains("Jonkheer")]['Name']
df[df["Name"].str.contains(r'Tyler|Goodwin')] #.str.contains regex
- columns and index: both index objects
- immutable: cannot change part of it
- replaceable: replace with completely new index range, or use
set_index(col_name)
to convert a column to row_index
df.index[0] = 1999 #Error as Index is immutable
df.index = range(len(df)) # len(df) gives the number of rows of DataFrame
df.columns = ['a1', 'a2', 'a3'] #replace the new col indexs
df.set_index('year', inplace = True) #remove the col 'year' and set it as the index of df
- Updating rows: always use the
loc/iloc
indexer. Otherwise, the change only applies to a copy/view
#SettingWithCopyWarning: Change only applies to a copy/view, NOT the df itself
df[df['last name'] == 'Smith']['last name'] = 'Anderson' #have to use .loc/.iloc in this case. See below solution
#Solution:
df.loc[df['last name'] == 'Smith', ['last name']] = 'Anderson' #change will be updated accordingly
- Updating columns:
- Assign a sequence to the column, or
.str
class under Series class provides many utilities.lower
,upper
,title
,capitalize
isupper
,islower
,isalpha
,isalnum
find
,replace
,contains
df['last name'] = list(df['last name'])[::-1] #Assign a sequence to the column
#Make use of str class under Series class
df['last name'] = df['last name'].str.lower()
df["Name"].str.contains("Mrs. Martin")
- Syntax:
pandas.Series.apply(func)
, dealing with one item in the Series but not the Series itself - No parentheses required when passing the function name
# .apply(provided function)
df['last name'] = df['last name'].apply(str.title)
# .apply(self-defined function)
def change_Josh(fname):
return 'Joe' if fname == 'Josh' else fname
df['first name'] = df['first name'].apply(change_Josh)
# .apply(lambda function)
import re
p = re.compile(r"\b(\w+)\.")
df["Title"] = df["Name"].apply(lambda s: p.search(s).group(1)) #Apply regex to split the "Title" out of "Name"
- Syntax:
pandas.DataFrame.apply(func)
, apply a function along an axis of the DataFrame. - Axis parameters: by default, axies = 0 i.e applies to columns
- 0 or ‘index’: apply function to each column.
- 1 or ‘columns’: apply function to each row.
#Given below DataFrame
A B
0 4 9
1 4 9
2 4 9
#Apply function np.sqrt to all columnns in df
df.apply(np.sqrt)
A B
0 2.0 3.0
1 2.0 3.0
2 2.0 3.0
#Apply function np.sum to all rows in df
df.apply(np.sum, axis=1)
0 13
1 13
2 13
pd.concat([df1,df2, df3])
- axis = 0 (By Default): append next df by row, similar to
append
- axis = 1: append next df by column
axis=1, join="inner"
Take the union of them all (Default)axis=1, join="outer"
Take the intersection
- axis = 0 (By Default): append next df by row, similar to
#By default, axis = 0 => Similar to append by row
result = pd.concat([df1, df2, df3])
#By default, axis=1 => join="inner" - Union
result = pd.concat([df1, df4], axis=1)
#join="inner" = Intersect - the second picture below
result = pd.concat([df1, df4], axis=1, join="inner")
- A useful shortcut to concat() are the
append()
instance methods on Series and DataFrame. - These methods actually predated concat. They concatenate along axis=0, namely the index:
df = df.append(df2, ignore_index=True)
: Appending a DataFrame to Another- resulting DataFrame must be assigned to the original or a new df since no inplace parameter to apply the change to the original DataFrame.
- ignore_index=True: to reset the index
result = df1.append([df2, df3, ignore_index=True]) #Append multiple dataframes to df1
result = df1.append(df4, sort=False) #Append df1 & df4, since they have different columns, so whatever missing values will be filled with NaN as shown below
pandas.Series.drop(index)
- remove items with given indices from a Series
pandas.DataFrame.drop(index, axis=0, inplace=True)
- remove rows or columns from a DataFrame
- remove columns: axis = 1 or axis = ‘columns’
df = df[~df[df['gender'] == 'male']]
# Series
df['gender'].drop([3, 4]) #drop index =3, and =4 of Series 'gender'
# DataFrame
df.drop([3, 4]) #drop row 3 and row 4
df.drop(df[df['gender'] == 'male'].index) #drop the row with the index of column "df[df['gender'] == 'male']"
df.drop('year', axis = 1, inplace=True) #drop the col "year"
#Drop column using ~
df = df[~df['your column'].isin(['list of strings', 'A'])]
pandas.DataFrame.sort_values(by=col_names, ascending=True, inplace=True)
sort the DataFrame by given columns
df.sort_values(by='last name')
df.sort_values(by=['last name', 'first name']) #Sort by Two Cols
df.sort_values(by=['last name', 'first name'], ascending=[True, False])
pandas.DataFrame.sort_index(inplace=True)
sort DataFrame by row index
df.sort_index(inplace=True)
df["SibSp"].value_counts().sort_index()
- Check the type of columns, and convert those categorical columns to type
'category'
.astype('category')
convert the data type
df["Survived"] = df["Survived"].astype('category')
- List of Numerical Columns
numerical_col = [c for c in df.columns if df[c].dtype in ('float64', 'int64')]
- NaN or Null values can be counted with the
.isna()
or.isnull()
method.
df["Age"].isnull().sum()
df['Age'].isna().sum()
df.dropna(axis = 1)
- axis = 0 (Default) to drop Row contain NaN
- axis = 1 to drop Column contain NaN
df['Age'].fillna(df["Age"].mean())
fill the missing values
- Need to understand how many unique values in a column
df["Ticket"].nunique() #681
.value_counts(normalize=True, sort=False)
can help us find out the distribution of the two values in the column..describe()
to understand the overall distribution.quantile(np.arange(1,11)*0.1)
function helps to get the quantile information, parameter should be between 0 and 1.plot.hist()
to plot Histogram distribution
df["Survived"].value_counts()
df["Fare"].describe()
df["Age"].quantile(np.arange(1,11)*0.1) # np.arange(1, 11) gives an np.array containing 1 to 10.
#0.1 14.0
#0.2 19.0 most dense area in the distribution is between 19 to 31
#0.3 22.0
#0.4 25.0
#0.5 28.0
#0.6 31.8
#0.7 36.0
#0.8 41.0
#0.9 50.0
#1.0 80.0
df["Age"].plot.hist() # plot a histogram
-
df.groupby()
creates groups within a DataFrame by a key, i.e., the dateframe is separated into groups according the values of the key.__str__()
doesn’t give you much information asgroupby()
lazy in nature. It doesn’t really do any operations to produce a useful result until you say so.
-
Various methods of
.groupby()
:- Meta methods & Properties:
.get_group(value in each group)
- Aggregation methods:
.value_counts(normalize=True, sort=False)
and other aggregation methods we can apply on the grouped dataset as shown in below tableagg()
method allows us to apply multiple such methods at once.- i.e:
grouped_by_df.agg(['count', 'mean', 'median', 'std', 'max', 'min'])
- i.e:
Aggregation Method Description count Number of non-NA values in the group sum Sum of non-NA values mean Mean of non-NA values median Arithmetic median of non-NA values std, var Sample standard deviation and variance min, max Minimum and Maximum of non-NA values - Meta methods & Properties:
by_state = df.groupby("state")
# Example of __str__()
print(by_state) #__str__() doesn’t give you much information
#<pandas.core.groupby.generic.DataFrameGroupBy object at 0x107293278>
# Example of .get_group()
by_state.get_group("PA")
# last_name first_name birthday gender type state party
# 4 Clymer George 1739-03-16 M rep PA NaN
#19 Maclay William 1737-07-20 M sen PA Anti-Administration
#21 Morris Robert 1734-01-20 M sen PA Pro-Administration
- For Categorical Columns Analysis:
.value_counts()
gend = df.groupby(by='Survived')['Sex'].value_counts(sort=False)
pclass_df = df.groupby(by='Survived')["Pclass"].value_counts(sort=False)
pclass_pct_df = df.groupby(by='Survived')["Pclass"].value_counts(sort=False, normalize=True)
pd.concat([pclass_df, pclass_pct_df], axis = 1)
- For Numerical Columns Analysis:
agg()
fare_gp = df.groupby(by="Survived")["Fare"]
fare_gp.mean() #Calculate the mean of fare for each group in "Survived" col
#Survived
#0 22.117887
#1 48.395408
fare_gp.agg(['count', 'mean', 'median', 'std', 'max', 'min'])
pd.cut(col, interval)
bin values into discrete intervals.- For example, bucketize value of
df['Fare']
into following intervals:[(0, 20] < (20, 100] < (100, 513]]
- For example, bucketize value of
buckets = [0,20,100, 513]
df['Fare_buckets'] = pd.cut(df['Fare'], buckets)
PassengerId
1 (0, 20]
2 (20, 100]
3 (100, 513]