Rohan Verma

Learning Machine Learning and other programming stuff

View on GitHub

Data Cleaning and Preparation

During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% or more of an analyst’s time.

Handling Missing Data

Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data by default. For numeric data, pandas uses the floating-point value NaN (Not a Number) to represent missing data. We call this a sentinel value that can be easily detected :

In [10]: string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
In [11]: string_data

0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object

The built-in Python None value is also treated as NA in object arrays :

In [13]: string_data[0] = None
In [14]: string_data.isnull()

0 True
1 False
2 True
3 False
dtype: bool

NA handling methods :

Argument Description
dropna Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
fillna Fill in missing data with some value or using an interpolation method such as ‘ffill’ or ‘bfill’.
isnull Return boolean values indicating which values are missing/NA.
notnull Negation of isnull.

Filtering Out Missing Data

There are a few ways to filter out missing data. While you always have the option to do it by hand using pandas.isnull and boolean indexing, the dropna can be helpful. On a Series, it returns the Series with only the non-null data and index values :

In [15]: from numpy import nan as NA
In [16]: data = pd.Series([1, NA, 3.5, NA, 7])
In [17]: data.dropna()

0 1.0
2 3.5
4 7.0
dtype: float64

With DataFrame objects, things are a bit more complex. You may want to drop rows or columns that are all NA or only those containing any NAs. dropna by default drops any row containing a missing value :

In [19]: data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
   ....:                      [NA, NA, NA], [NA, 6.5, 3.]])
In [20]: cleaned = data.dropna()
In [22]: cleaned

0 1 2
0 1.0 6.5 3.0

Passing how=’all’ will only drop rows that are all NA. To drop columns in the same way, pass axis=1.

Filling In Missing Data

Rather than filtering out missing data (and potentially discarding other data along with it), you may want to fill in the “holes” in any number of ways. For most purposes, the fillna method is the workhorse function to use. Calling fillna with a constant replaces missing values with that value :

In [33]: df.fillna(0)

0 1 2
0 -0.204708 0.000000 0.000000
1 -0.555730 0.000000 0.000000
2 0.092908 0.000000 0.769023
3 1.246435 0.000000 -1.296221

Calling fillna with a dict, you can use a different fill value for each column :

In [34]: df.fillna({1: 0.5, 2: 0})

0 1 2
0 -0.204708 0.500000 0.000000
1 -0.555730 0.500000 0.000000
2 0.092908 0.500000 0.769023
3 1.246435 0.500000 -1.296221

fillna function arguments :

Argument Description
value Scalar value or dict-like object to use to fill missing values
method Interpolation; by default ‘ffill’ if function called with no other arguments
axis Axis to fill on; default axis=0
inplace Modify the calling object without producing a copy
limit For forward and backward filling, maximum number of consecutive periods to fill

Data Transformation

Removing Duplicates

Duplicate rows may be found in a DataFrame for any number of reasons. Here is an example :

In [45]: data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
   ....:                      'k2': [1, 1, 2, 3, 3, 4, 4]})
In [46]: data

k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4

The DataFrame method duplicated returns a boolean Series indicating whether each row is a duplicate (has been observed in a previous row) or not :

In [47]: data.duplicated()

0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool

Alternatively, you can specify any subset of them to detect duplicates. Suppose we had an additional column of values and wanted to filter duplicates only based on the ‘k1’ column :

In [49]: data['v1'] = range(7)
In [50]: data.drop_duplicates(['k1'])

k1 k2 v1
0 one 1 0
1 two 1 1

Transforming Data Using a Function or Mapping

For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame. Consider the following hypothetical data collected about various kinds of meat :

In [52]: data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
   ....:                               'Pastrami', 'corned beef', 'Bacon',
   ....:                               'pastrami', 'honey ham', 'nova lox'],
   ....:                      'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
In [53]: data

food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0

Suppose you wanted to add a column indicating the type of animal that each food came from. Let’s write down a mapping of each distinct meat type to the kind of animal :

meat_to_animal = {
  'bacon': 'pig',
  'pulled pork': 'pig',
  'pastrami': 'cow',
  'corned beef': 'cow',
  'honey ham': 'pig',
  'nova lox': 'salmon'
}

The map method on a Series accepts a function or dict-like object containing a mapping, but here we have a small problem in that some of the meats are capitalized and others are not. Thus, we need to convert each value to lowercase using the str.lower Series method :

In [55]: lowercased = data['food'].str.lower()
In [56]: lowercased

0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object

In [57]: data['animal'] = lowercased.map(meat_to_animal)
In [58]: data

food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon

Replacing Values

Filling in missing data with the fillna method is a special case of more general value replacement. As you’ve already seen, map can be used to modify a subset of values in an object but replace provides a simpler and more flexible way to do so. Let’s consider this Series :

In [60]: data = pd.Series([1., -999., 2., -999., -1000., 3.])
In [61]: data

0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64

The -999 values might be sentinel values for missing data. To replace these with NA values that pandas understands, we can use replace, producing a new Series (unless you pass inplace=True) :

In [62]: data.replace(-999, np.nan)

0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64

Renaming Axis Indexes

Like values in a Series, axis labels can be similarly transformed by a function or mapping of some form to produce new, differently labeled objects. You can also modify the axes in-place without creating a new data structure. Here’s a simple example :

In [66]: data = pd.DataFrame(np.arange(12).reshape((3, 4)),
   ....:                     index=['Ohio', 'Colorado', 'New York'],
   ....:                     columns=['one', 'two', 'three', 'four'])

Like a Series, the axis indexes have a map method :

In [67]: transform = lambda x: x[:4].upper()
In [68]: data.index.map(transform)

Index([‘OHIO’, ‘COLO’, ‘NEW ‘], dtype=’object’)

In [69]: data.index = data.index.map(transform)
In [70]: data

one two three four
OHIO 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11

If you want to create a transformed version of a dataset without modifying the original, a useful method is rename :

In [71]: data.rename(index=str.title, columns=str.upper)

ONE TWO THREE FOUR
Ohio 0 1 2 3
Colo 4 5 6 7
New 8 9 10 11

Discretization and Binning

Continuous data is often discretized or otherwise separated into “bins” for analysis. Suppose you have data about a group of people in a study, and you want to group them into discrete age buckets. Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. To do so, you have to use cut, a function in pandas :

In [75]: ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
In [76]: bins = [18, 25, 35, 60, 100]
In [77]: cats = pd.cut(ages, bins)
In [78]: cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], …, (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]

You can also pass your own bin names by passing a list or array to the labels option.

Detecting and Filtering Outliers

Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data :

In [92]: data = pd.DataFrame(np.random.randn(1000, 4))
In [93]: data.describe()

0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.049091 0.026112 -0.002544 -0.051827
std 0.996947 1.007458 0.995232 0.998311
min -3.645860 -3.184377 -3.745356 -3.428254
25% -0.599807 -0.612162 -0.687373 -0.747478
50% 0.047101 -0.013609 -0.022158 -0.088274
75% 0.756646 0.695298 0.699046 0.623331
max 2.653656 3.525865 2.735527 3.366626

Suppose you wanted to find values in one of the columns exceeding 3 in absolute value :

In [94]: col = data[2]
In [95]: col[np.abs(col) > 3]

41 -3.399312
136 -3.745356
Name: 2, dtype: float64

To select all rows having a value exceeding 3 or –3, you can use the any method on a boolean DataFrame :

In [96]: data[(np.abs(data) > 3).any(1)]

0 1 2 3
41 0.457246 -0.025907 -3.399312 -0.974657
60 1.951312 3.260383 0.963301 1.201206
136 0.508391 -0.196713 -3.745356 -1.520113
235 -0.242459 -3.056990 1.918403 -0.578828
258 0.682841 0.326045 0.425384 -3.428254
322 1.179227 -3.184377 1.369891 -1.074833
544 -3.548824 1.553205 -2.186301 1.277104
635 -0.578093 0.193299 1.397822 3.366626
782 -0.207434 3.525865 0.283070 0.544635
803 -3.645860 0.255475 -0.549574 -1.907459

Computing Indicator/Dummy Variables

Another type of transformation for statistical modeling or machine learning applications is converting a categorical variable into a “dummy” or “indicator” matrix. If a column in a DataFrame has k distinct values, you would derive a matrix or DataFrame with k columns containing all 1s and 0s. pandas has a get_dummies function for doing this, though devising one yourself is not difficult. Let’s return to an earlier example DataFrame :

In [109]: df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
   .....:                    'data1': range(6)})
In [110]: pd.get_dummies(df['key'])

a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0

String Manipulation

Python has long been a popular raw data manipulation language in part due to its ease of use for string and text processing. Most text operations are made simple with the string object’s built-in methods. For more complex pattern matching and text manipulations, regular expressions may be needed. pandas adds to the mix by enabling you to apply string and regular expressions concisely on whole arrays of data, additionally handling the annoyance of missing data.

String Object Methods

In many string munging and scripting applications, built-in string methods are sufficient. As an example, a comma-separated string can be broken into pieces with split :

In [134]: val = 'a,b,  guido'
In [135]: val.split(',')

[‘a’, ‘b’, ‘ guido’]

Relatedly, count returns the number of occurrences of a particular substring :

In [145]: val.count(',')

2

replace will substitute occurrences of one pattern for another. It is commonly used to delete patterns, too, by passing an empty string :

In [146]: val.replace(',', '::')

‘a::b:: guido’

In [147]: val.replace(',', '')

‘ab guido’

Python built-in string methods :

Argument Description
count Return the number of non-overlapping occurrences of substring in the string.
endswith Returns True if string ends with suffix.
startswith Returns True if string starts with prefix.
join Use string as delimiter for concatenating a sequence of other strings.
index Return position of first character in substring if found in the string; raises ValueError if not found.
find Return position of first character of first occurrence of substring in the string; like index, but returns –1 if not found.
rfind Return position of first character of last occurrence of substring in the string; returns –1 if not found.
replace Replace occurrences of string with another string.
strip, rstrip, lstrip Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively) for each element.
split Break string into list of substrings using passed delimiter.
lower Convert alphabet characters to lowercase.
upper Convert alphabet characters to uppercase.
casefold Convert characters to lowercase, and convert any region-specific variable character combinations to a common comparable form.
ljust, rjust Left justify or right justify, respectively; pad opposite side of string with spaces (or some other fill character) to return a string with a minimum width.

Regular Expressions

Regular expressions provide a flexible way to search or match (often more complex) string patterns in text. A single expression, commonly called a regex, is a string formed according to the regular expression language. Python’s built-in re module is responsible for applying regular expressions to strings; I’ll give a number of examples of its use here. Suppose we wanted to split a string with a variable number of whitespace characters (tabs, spaces, and newlines). The regex describing one or more whitespace characters is \s+ :

In [148]: import re
In [149]: text = "foo    bar\t baz  \tqux"
In [150]: re.split('\s+', text)

[‘foo’, ‘bar’, ‘baz’, ‘qux’]

regex.match returns None, as it only will match if the pattern occurs at the start of the string :

In [159]: print(regex.match(text))

None

Regular expression methods :

Argument Description
findall Return all non-overlapping matching patterns in a string as a list
finditer Like findall, but returns an iterator
match Match pattern at start of string and optionally segment pattern components into groups; if the pattern matches, returns a match object, and otherwise None
search Scan string for match to pattern; returning a match object if so; unlike match, the match can be anywhere in the string as opposed to only at the beginning
split Break string into pieces at each occurrence of pattern
sub, subn Replace all (sub) or first n occurrences (subn) of pattern in string with replacement expression; use symbols \1, \2, … to refer to match group elements in the replacement string