Rohan Verma

Learning Machine Learning and other programming stuff

View on GitHub

Getting Started with pandas

Pandas

It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python. pandas is often used in tandem with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and scikit-learn, and data visualization libraries like matplotlib. pandas adopts significant parts of NumPy’s idiomatic style of array-based computing, especially array-based functions and a preference for data processing without for loops. While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.

import pandas as pd

Introduction to pandas Data Structures

To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applications.

Series

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data :

In [11]: obj = pd.Series([4, 7, -5, 3])
In [12]: obj

0 4
1 7
2 -5
3 3
dtype: int64

You can get the array representation and index object of the Series via its values and index attributes, respectively :

In [13]: obj.values

array([ 4, 7, -5, 3])

In [14]: obj.index  # like range(4)

RangeIndex(start=0, stop=4, step=1)

Often it will be desirable to create a Series with an index identifying each data point with a label :

In [15]: obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
In [16]: obj2

d 4
b 7
a -5
c 3
dtype: int64

Compared with NumPy arrays, you can use labels in the index when selecting single values or a set of values :

In [18]: obj2['a']

-5

Using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link :

In [21]: obj2[obj2 > 0]

d 6
b 7
c 3
dtype: int64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be used in many contexts where you might use a dict :

In [24]: 'b' in obj2

True

Should you have data contained in a Python dict, you can create a Series from it by passing the dict :

In [26]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
In [27]: obj3 = pd.Series(sdata)
In [28]: obj3

Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64

The isnull and notnull functions in pandas should be used to detect missing data :

In [32]: pd.isnull(obj3)

Ohio False
Oregon False
Texas False
dtype: bool

A useful Series feature for many applications is that it automatically aligns by index label in arithmetic operations :

In [37]: obj3 + obj3

Ohio 70000
Oregon 32000
Texas 142000
Utah 10000

Both the Series object itself and its index have a name attribute, which integrates with other key areas of pandas functionality :

In [38]: obj3.name = 'population'
In [39]: obj3.index.name = 'state'
In [40]: obj3

state Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
Name: population, dtype: float64

DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index. There are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays :

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

The resulting DataFrame will have its index assigned automatically as with Series, and the columns are placed in sorted order :

In [45]: frame

pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
5 3.2 Nevada 2003

For large DataFrames, the head method selects only the first five rows :

In [46]: frame.head()

pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002

If you specify a sequence of columns, the DataFrame’s columns will be arranged in that order :

In [47]: pd.DataFrame(data, columns=['year', 'state', 'pop'])

year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
5 2003 Nevada 3.2

A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute :

In [51]: frame2['state']
In [52]: frame2.year

Columns can be modified by assignment. For example, the empty ‘debt’ column could be assigned a scalar value or an array of values :

In [54]: frame2['debt'] = 16.5
In [55]: frame2

year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
six 2003 Nevada 3.2 16.5

Assigning a column that doesn’t exist will create a new column. The del keyword will delete columns as with a dict.

In [61]: frame2['eastern'] = frame2.state == 'Ohio'
In [62]: frame2

year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
six 2003 Nevada 3.2 NaN False

The del method can then be used to remove this column :

In [63]: del frame2['eastern']
In [64]: frame2.columns

Index([‘year’, ‘state’, ‘pop’, ‘debt’], dtype=’object’)

Another common form of data is a nested dict of dicts :

In [65]: pop = {'Nevada': {2001: 2.4, 2002: 2.9},
   ....:        'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

If the nested dict is passed to the DataFrame, pandas will interpret the outer dict keys as the columns and the inner keys as the row indices :

In [66]: frame3 = pd.DataFrame(pop)
In [67]: frame3

Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6

Possible data inputs to DataFrame constructor

Type Notes
2D ndarray A matrix of data, passing optional row and column labels
dict of arrays, lists, or tuples Each sequence becomes a column in the DataFrame; all sequences must be the same length
NumPy structured/record array Treated as the “dict of arrays” case
dict of Series Each value becomes a column; indexes from each Series are unioned together to form the result’s row index if no explicit index is passed
dict of dicts Each inner dict becomes a column; keys are unioned to form the row index as in the “dict of Series” case
List of dicts or Series Each item becomes a row in the DataFrame; union of dict keys or Series indexes become the DataFrame’s column labels
List of lists or tuples Treated as the “2D ndarray” case
Another DataFrame The DataFrame’s indexes are used unless different ones are passed
NumPy MaskedArray Like the “2D ndarray” case except masked values become NA/missing in the DataFrame result

Index Objects

pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names). Any array or other sequence of labels you use when constructing a Series or DataFrame is internally converted to an Index :

In [76]: obj = pd.Series(range(3), index=['a', 'b', 'c'])
In [77]: index = obj.index
In [78]: index

Index([‘a’, ‘b’, ‘c’], dtype=’object’)

In [79]: index[1:]

Index([‘b’, ‘c’], dtype=’object’)

Index objects are immutable and thus can’t be modified by the user. In addition to being array-like, an Index also behaves like a fixed-size set. Unlike Python sets, a pandas Index can contain duplicate labels. Selections with duplicate labels will select all occurrences of that label. Some index methods and properties :

Method Description
append Concatenate with additional Index objects, producing a new Index
difference Compute set difference as an Index
intersection Compute set intersection
union Compute set union
isin Compute boolean array indicating whether each value is contained in the passed collection
delete Compute new Index with element at index i deleted
drop Compute new Index by deleting passed values
insert Compute new Index by inserting element at index i
is_monotonic Returns True if each element is greater than or equal to the previous element
is_unique Returns True if the Index has no duplicate values
unique Compute the array of unique values in the Index

Essential Functionality

Reindexing

An important method on pandas objects is reindex, which means to create a new object with the data conformed to a new index. Consider an example :

In [91]: obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
In [92]: obj

d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64

Calling reindex on this Series rearranges the data according to the new index, introducing missing values if any index values were not already present :

In [93]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
In [94]: obj2

a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64

For ordered data like time series, it may be desirable to do some interpolation or filling of values when reindexing. The method option allows us to do this, using a method such as ffill, which forward-fills the values :

In [95]: obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
In [96]: obj3

0 blue
2 purple
4 yellow
dtype: object

In [97]: obj3.reindex(range(6), method='ffill')

0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object

Reindex Function arguments :

Method Description
index New sequence to use as index. Can be Index instance or any other sequence-like Python data structure. An Index will be used exactly as is without any copying.
method Interpolation (fill) method; ‘ffill’ fills forward, while ‘bfill’ fills backward.
fill_value Substitute value to use when introducing missing data by reindexing.
limit When forward- or backfilling, maximum size gap (in number of elements) to fill.
tolerance When forward- or backfilling, maximum size gap (in absolute numeric distance) to fill for inexact matches.
level Match simple Index on level of MultiIndex; otherwise select subset of.
copy If True, always copy underlying data even if new index is equivalent to old index; if False, do not copy the data when the indexes are equivalent.

Dropping Entries from an Axis

Dropping one or more entries from an axis is easy if you already have an index array or list without those entries.

In [105]: obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
In [107]: new_obj = obj.drop('c')
In [108]: new_obj

a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64

With DataFrame, index values can be deleted from either axis. To illustrate this, we first create an example DataFrame :

In [110]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
   .....:                     index=['Ohio', 'Colorado', 'Utah', 'New York'],
   .....:                     columns=['one', 'two', 'three', 'four'])
In [111]: data

one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

Calling drop with a sequence of labels will drop values from the row labels (axis 0) :

In [112]: data.drop(['Colorado', 'Ohio'])

one two three four
Utah 8 9 10 11
New York 12 13 14 15

You can drop values from the columns by passing axis=1 or axis=’columns’ :

In [113]: data.drop('two', axis=1)

one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15

Indexing, Selection and Filtering

Series indexing (obj[…]) works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers. Here are some examples of this :

In [117]: obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
In [118]: obj

a 0.0
b 1.0
c 2.0
d 3.0
dtype: float64

In [121]: obj[2:4]

c 2.0
d 3.0
dtype: float64

In [125]: obj['b':'c']

b 1.0
c 2.0
dtype: float64

Setting using these methods modifies the corresponding section of the Series :

In [126]: obj['b':'c'] = 5
In [127]: obj

a 0.0
b 5.0
c 5.0
d 3.0
dtype: float64

Indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence :

In [128]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
   .....:                     index=['Ohio', 'Colorado', 'Utah', 'New York'],
   .....:                     columns=['one', 'two', 'three', 'four'])

In [129]: data

one two three four Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

In [131]: data[['three', 'one']]

three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12

In [133]: data[data['three'] > 5]

one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

Another use case is in indexing with a boolean DataFrame, such as one produced by a scalar comparison :

In [134]: data < 5

one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False

For DataFrame label-indexing on the rows, we introduce the special indexing operators loc and iloc. They enable you to select a subset of the rows and columns from a DataFrame with NumPy-like notation using either axis labels (loc) or integers (iloc). As a preliminary example, let’s select a single row and multiple columns by label :

In [137]: data.loc['Colorado', ['two', 'three']]

two 5
three 6
Name: Colorado, dtype: int64

We’ll then perform some similar selections with integers using iloc :

In [138]: data.iloc[2, [3, 0, 1]]

four 11
one 8
two 9
Name: Utah, dtype: int64

Indexing options with DataFrame :

Type Notes
df[val] Select single column or sequence of columns from the DataFrame; special case conveniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion)
df.loc[val] Selects single row or subset of rows from the DataFrame by label
df.loc[:, val] Selects single column or subset of columns by label
df.loc[val1, val2] Select both rows and columns by label
df.iloc[where] Selects single row or subset of rows from the DataFrame by integer position
df.iloc[:, where] Selects single column or subset of columns by integer position
df.iloc[where_i, where_j] Select both rows and columns by integer position
df.at[label_i, label_j] Select a single scalar value by row and column label
df.iat[i, j] Select a single scalar value by row and column position (integers)
reindex method Select either rows or columns by labels
get_value, set_value methods Select single value by row and column label

Integer Indexes

Working with pandas objects indexed by integers is something that often trips up new users due to some differences with indexing semantics on built-in Python data structures like lists and tuples. For example, you might not expect the following code to generate an error :

ser = pd.Series(np.arange(3.))
ser
ser[-1]

In this case, pandas could “fall back” on integer indexing, but it’s difficult to do this in general without introducing subtle bugs. Here we have an index containing 0, 1, 2, but inferring what the user wants (label-based indexing or position-based) is difficult. On the other hand, with a non-integer index, there is no potential for ambiguity :

In [145]: ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
In [146]: ser2[-1]

2.0

Arithmetic and Data Alignment

An important pandas feature for some applications is the behavior of arithmetic between objects with different indexes. When you are adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. For users with database experience, this is similar to an automatic outer join on the index labels. Let’s look at an example :

In [150]: s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
In [151]: s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
   .....:                index=['a', 'c', 'e', 'f', 'g'])
In [154]: s1 + s2

a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64

If you add DataFrame objects with no column or row labels in common, the result will contain all nulls :

In [160]: df1 = pd.DataFrame({'A': [1, 2]})
In [161]: df2 = pd.DataFrame({'B': [3, 4]})
In [164]: df1 - df2

A B
0 NaN NaN
1 NaN NaN

In arithmetic operations between differently indexed objects, you might want to fill with a special value, like 0, when an axis label is found in one object but not the other. Using the add method on df1, we pass df2 and an argument to fill_value :

In [164]: df1 - df2

A B
0 1 2
1 3 4

Flexible arithmetic methods :

Method Description
add, radd Methods for addition (+)
sub, rsub Methods for subtraction (-)
div, rdiv Methods for division (/)
floordiv, rfloordiv Methods for floor division (//)
mul, rmul Methods for multiplication (*)
pow, rpow Methods for exponentiation (**)

Function Application and Mapping

NumPy ufuncs (element-wise array methods) also work with pandas objects :

In [190]: frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
   .....:                      index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [191]: frame

b d e
Utah -0.204708 0.478943 -0.519439
Ohio -0.555730 1.965781 1.393406
Texas 0.092908 0.281746 0.769023
Oregon 1.246435 1.007189 -1.296221

In [192]: np.abs(frame)

b d e
Utah 0.204708 0.478943 0.519439
Ohio 0.555730 1.965781 1.393406
Texas 0.092908 0.281746 0.769023
Oregon 1.246435 1.007189 1.296221

Another frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrame’s apply method does exactly this :

In [193]: f = lambda x: x.max() - x.min()
In [194]: frame.apply(f)

b 1.802165
d 1.684034
e 2.689627
dtype: float64

Here the function f, which computes the difference between the maximum and minimum of a Series, is invoked once on each column in frame. The result is a Series having the columns of frame as its index. If you pass axis=’columns’ to apply, the function will be invoked once per row instead. Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating-point value in frame. You can do this with apply map :

In [198]: format = lambda x: '%.2f' % x
In [199]: frame.applymap(format)

b d e
Utah -0.20 0.48 -0.52
Ohio -0.56 1.97 1.39
Texas 0.09 0.28 0.77
Oregon 1.25 1.01 -1.30

Sorting and Ranking

Sorting a dataset by some criterion is another important built-in operation. To sort lexicographically by row or column index, use the sort_index method, which returns a new, sorted object :

In [201]: obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
In [202]: obj.sort_index()

a 1
b 2
c 3
d 0
dtype: int64

To sort a Series by its values, use its sort_values method :

In [207]: obj = pd.Series([4, 7, -3, 2])
In [208]: obj.sort_values()

2 -3
3 2
0 4
1 7
dtype: int64

When sorting a DataFrame, you can use the data in one or more columns as the sort keys. To do so, pass one or more column names to the by option of sort_values :

In [211]: frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
In [212]: frame

a b
0 0 4
1 1 7
2 0 -3
3 1 2

In [213]: frame.sort_values(by='b')

a b
2 0 -3
3 1 2
0 0 4
1 1 7

Ranking assigns ranks from one through the number of valid data points in an array. The rank methods for Series and DataFrame are the place to look; by default rank breaks ties by assigning each group the mean rank :

In [215]: obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
In [216]: obj.rank()

0 6.5
1 1.0
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64

Axis Indexes with Duplicate Labels

Up until now all of the examples we’ve looked at have had unique axis labels (index values). While many pandas functions (like reindex) require that the labels be unique, it’s not mandatory. Let’s consider a small Series with duplicate indices :

In [222]: obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
In [223]: obj

a 0
a 1
b 2
b 3
c 4
dtype: int64

The index’s is_unique property can tell you whether its labels are unique or not :

In [224]: obj.index.is_unique

False

Data selection is one of the main things that behaves differently with duplicates. Indexing a label with multiple entries returns a Series, while single entries return a scalar value. This can make your code more complicated, as the output type from indexing can vary based on whether a label is repeated or not. The same logic extends to indexing rows in a DataFrame.

Summarizing and Computing Descriptive Statistics

pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series or a Series of values from the rows or columns of a DataFrame. Compared with the similar methods found on NumPy arrays, they have built-in handling for missing data. Consider a small DataFrame :

In [230]: df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
   .....:                    [np.nan, np.nan], [0.75, -1.3]],
   .....:                   index=['a', 'b', 'c', 'd'],
   .....:                   columns=['one', 'two'])
In [231]: df

one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3

Calling DataFrame’s sum method returns a Series containing column sums :

In [232]: df.sum()

one 9.25
two -5.80
dtype: float64

NA values are excluded unless the entire slice (row or column in this case) is NA. This can be disabled with the skipna option :

In [234]: df.mean(axis='columns', skipna=False)

a NaN
b 1.300
c NaN
d -0.275
dtype: float64

Another type of method is neither a reduction nor an accumulation. describe is one such example, producing multiple summary statistics in one shot :

In [237]: df.describe()

one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000

Some Functions :

Method Description
count Number of non-NA values
describe Compute set of summary statistics for Series or each DataFrame column
min, max Compute minimum and maximum values
argmin, argmax Compute index locations (integers) at which minimum or maximum value obtained, respectively
idxmin, idxmax Compute index labels at which minimum or maximum value obtained, respectively
quantile Compute sample quantile ranging from 0 to 1
sum Sum of values
mean Mean of values
median Arithmetic median (50% quantile) of values
mad Mean absolute deviation from mean value
prod Product of all values
var Sample variance of values
std Sample standard deviation of values
skew Sample skewness (third moment) of values
kurt Sample kurtosis (fourth moment) of values
cumsum Cumulative sum of values
cummin, cummax Cumulative minimum or maximum of values, respectively
cumprod Cumulative product of values
diff Compute first arithmetic difference (useful for time series)
pct_change Compute percent changes

Correlation and Covariance

The corr method of Series computes the correlation of the overlapping, non-NA, aligned-by-index values in two Series. Relatedly, cov computes the covariance :

In [244]: returns['MSFT'].corr(returns['IBM'])
In [245]: returns['MSFT'].cov(returns['IBM'])

DataFrame’s corr and cov methods, on the other hand, return a full correlation or covariance matrix as a DataFrame, respectively.

Unique Values, Value Counts, and Membership

Another class of related methods extracts information about the values contained in a one-dimensional Series. To illustrate these, consider this example.

In [251]: obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

The first function is unique, which gives you an array of the unique values in a Series:

In [252]: uniques = obj.unique()
In [253]: uniques

array([‘c’, ‘a’, ‘d’, ‘b’], dtype=object)

Relatedly, value_counts computes a Series containing value frequencies :

In [254]: obj.value_counts()

c 3
a 3
b 2
d 1
dtype: int64

isin performs a vectorized set membership check and can be useful in filtering a dataset down to a subset of values in a Series or column in a DataFrame :

In [257]: mask = obj.isin(['b', 'c'])
In [258]: mask

0 True
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
dtype: bool