NumPy and Pandas

The NumPy and Pandas libraries are central to data science in Python. NumPy allows for the efficient analysis and processing of data arrays with varying sizes, shapes, and number of dimensions while Pandas allows for reading in and working with data tables.

After working through this module you will be able to:

  1. create, reshape, and slice NumPy arrays.
  2. describe the data type, size, shape, and number of dimensions in an array.
  3. perform numeric and comparison operations on arrays.
  4. create and manipulate Series and DataFrames using Pandas.
  5. query and subset DataFrames.
  6. manipulate DataFrames.
  7. summarize and group DataFrames.

NumPy

Creating NumPy Arrays

The NumPy library allows for creating and working with arrays. It is of specific value when you want to perform mathematical operations on values stored in arrays. Also, it is very fast and memory efficient. As mentioned in the prior module, arrays are similar to lists in that they store a series of values or elements. However, arrays can be expanded to include many dimensions. For example, an image could be represented as an array with 3 dimensions: height, width, and channels. If you work with deep learning, tensors are the primary data model used to read and manipulate data and are essentially multidimensional arrays. In short, array-based calculations and manipulations are essential to data science, so NumPy is an important library to learn if you work in the Python environment.

The complete documentation for NumPy can be found here.

Before you can use NumPy, you must make sure that it is installed into your Anaconda environment, as demonstrated in the set-up module. Once NumPy is installed, you will need to import it in order to use it in your code. It is common to assign NumPy an alias name of "np" to simplify your code.

import numpy as np

Lists can be converted to NumPy arrays using the array() method. Once the list object is converted to an array the type is defined as "numpy.ndarray", which indicates that it is a NumPy array specifically. Since this array only has one dimension, it is specifically called a vector.

lst1 = [3, 6, 7, 8, 9]
arr1 = np.array(lst1)
print(type(lst1))
print(type(arr1))
print(arr1)
<class 'list'>
<class 'numpy.ndarray'>
[3 6 7 8 9]

A two dimensional array is known as a matrix. In the example below, I am generating a matrix array from a list of lists.

lst2 = [[3, 6, 7, 8, 9], [3, 6, 7, 8, 9], [3, 6, 7, 8, 9]]
arr2 = np.array(lst2)
print(arr2)
[[3 6 7 8 9]
 [3 6 7 8 9]
 [3 6 7 8 9]]

Again, one of the powerful advantages of NumPy arrays is the ability to store data in arrays with many dimensions. In the example below, I am creating a three dimensional array from a list of lists of lists. This would be similar to an image with dimensions image height, image width, and image channels (for exmaple, red, green, and blue). A four dimensional array could represent a time series (height, width, channels, time) or a video containing multiple frames (frame height, frame width, channels, and frame number).

Similar to lists, to make a copy of an array that does not simply reference the original, you should use the copy() method.

lst3 = [[[3, 6, 7, 8, 9], [3, 6, 7, 8, 9], [3, 6, 7, 8, 9]], 
         [[3, 6, 7, 8, 9], [3, 6, 7, 8, 9], [3, 6, 7, 8, 9]], 
         [[3, 6, 7, 8, 9], [3, 6, 7, 8, 9], [3, 6, 7, 8, 9]]]
arr3 = np.array(lst3)
print(arr3)
[[[3 6 7 8 9]
  [3 6 7 8 9]
  [3 6 7 8 9]]

 [[3 6 7 8 9]
  [3 6 7 8 9]
  [3 6 7 8 9]]

 [[3 6 7 8 9]
  [3 6 7 8 9]
  [3 6 7 8 9]]]

The cell below provides some examples of NumPy methods for generating arrays.

The arange() method returns an array of evenly spaced values and accepts start, stop, step, and data type parameters. In the example, I have created an array of evenly spaced values from 0 to 100 with a step size of 5. I specifically define the data type as integer, but NumPy can infer a data type if it is not provided.

The linspace() method is similar to arange(); however, a number of samples is specified as opposed to a step size. In the example, since 5 samples are requested, 5 evenly spaced values between 0 and 100 are returned.

The ones() method is used to return an array of 1s. In the example, I have generated a three dimensional array where the first dimension has a length of 3, the second a length of 4, and the third a length of 4. The shape and dimensions of the array are specified using a tuple.

Similar to ones(), zeros() generates an array of zeros.

It is also possible to generate random values between 0 and 1 (random.rand()) and a specified number of random integer values between two values (random.randint()).

arr4 = np.arange(0, 100, 5, dtype="int")
print(arr4)

arr5 = np.linspace(0, 100, 5, dtype="int")
print(arr5)

arr6 = np.ones((3, 4, 4))
print(arr6)

arr7 = np.zeros((3, 4, 2))
print(arr7)

arr8 = np.random.rand(3, 4, 5)
print(arr8)

arr9 = np.random.randint(1, 200, 7)
print(arr9)
[ 0  5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95]
[  0  25  50  75 100]
[[[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]

 [[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]

 [[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]]
[[[0. 0.]
  [0. 0.]
  [0. 0.]
  [0. 0.]]

 [[0. 0.]
  [0. 0.]
  [0. 0.]
  [0. 0.]]

 [[0. 0.]
  [0. 0.]
  [0. 0.]
  [0. 0.]]]
[[[0.63212252 0.28139054 0.32678571 0.57472475 0.5459576 ]
  [0.9897669  0.22912723 0.4047774  0.30486559 0.03414862]
  [0.49746668 0.59689657 0.51962197 0.56545513 0.50955797]
  [0.50396066 0.13347408 0.54276816 0.1280929  0.05683246]]

 [[0.85218019 0.35436517 0.01749475 0.83686899 0.64099688]
  [0.04990366 0.59950451 0.5565049  0.75741816 0.79339763]
  [0.7018239  0.93146016 0.43380738 0.32625464 0.66060082]
  [0.31205816 0.63729095 0.33971363 0.86804703 0.87199357]]

 [[0.86696758 0.18435079 0.27993821 0.89973497 0.57005956]
  [0.38569423 0.45411688 0.5184146  0.3545654  0.75077276]
  [0.46772019 0.58720663 0.86815338 0.86719931 0.13468105]
  [0.9376317  0.12156825 0.29883329 0.7359851  0.69748038]]]
[159 168 103  91 170 160 126]

Understanding and Manipulating Array Shape and Dimensions

Let's spend some time discussing the dimensions and shape of an array. The shape of an array relates to the length of each dimension. The len() function will return the length of the first dimension (in this case 3). To obtain a tuple of the lengths for all dimensions, you must use the shape property. So, the array generated has three dimensions with lengths of 3, 4, and 4, respectively. The number of dimensions is returned with the ndim property. The size property returns the number of features in the array. There are 48 features in the example array: 3 X 4 X 4 = 48. The dtype property provides the data type.

arr6 = np.ones((3, 4, 4))

print("Length of first dimension: " + str(len(arr6)))
print("Shape of array: " + str(arr6.shape))
print("Number of dimensions: " + str(arr6.ndim))
print("Size of array: " + str(arr6.size))
print("Data type of array: " + str(arr6.dtype))
Length of first dimension: 3
Shape of array: (3, 4, 4)
Number of dimensions: 3
Size of array: 48
Data type of array: float64

Numpy has built-in methods for changing the shape of an array. Note that the number of features or size of the array must perfectly fill the new shape. In the first example, I am maintaining the number of dimensions but changing the shape or length of each dimension. In the second two examples, I am converting the three dimensional array to two dimensional arrays. Lastly, I convert the array to a one dimensional array, or vector, with a length of 48.

arr6b = arr6.reshape(4, 4, 3)
arr6c = arr6.reshape(4, 12)
arr6d = arr6.reshape(12, 4)
arr6e = arr6.reshape(48)
print(arr6b)
print(arr6c)
print(arr6d)
print(arr6e)
[[[1. 1. 1.]
  [1. 1. 1.]
  [1. 1. 1.]
  [1. 1. 1.]]

 [[1. 1. 1.]
  [1. 1. 1.]
  [1. 1. 1.]
  [1. 1. 1.]]

 [[1. 1. 1.]
  [1. 1. 1.]
  [1. 1. 1.]
  [1. 1. 1.]]

 [[1. 1. 1.]
  [1. 1. 1.]
  [1. 1. 1.]
  [1. 1. 1.]]]
[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

As a second example, here I am converting a vector array into a multidimensional array or matrix.

arr10 = np.random.randint(1, 1200, 100)
arr10b = arr10.reshape(10, 10)
print(arr10b)
[[ 309  663  429  597 1162  360  124 1103  428  877]
 [ 411   63  724 1026  329  169  968  421  926  135]
 [ 618  799  926  280 1042  267  846  682  730  394]
 [ 670 1076  109  448  225  480  102  969  686 1152]
 [ 671  807  286  211 1095  933  683   32  132  181]
 [1151   83  567 1162  946  766 1016  935  329  600]
 [1137  278  507  180  813  297  969 1188 1085  774]
 [ 626  383  283  350 1113 1146  176  867  376  877]
 [ 609  244 1009  287  178 1095  560  690  325  379]
 [1010  391    9 1150   75  373  508  541  280  689]]

NumPy Array Indexing

Similar to lists, NumPy arrays are indexed. So, values from the array can be extracted or referenced using their associated index. Since arrays often have multiple dimensions, indexes must also extend into multiple dimensions. See the comments below for general array indexing rules. Remember that indexing starts at 0, the first index provided is included, and the last index provided is not inlcuded. Extracting portions of an array is known as slicing.

arr11 = np.linspace(0, 50, 50, dtype="int")
arr12 = arr11.reshape(2,5,5)
print("Original array")
print(arr12)
print("All values in first index of first dimension")
print(arr12[0]) #This will extract just the values from the first index in the first dimension
print("All values in second index of first dimension")
print(arr12[1]) #This will extract just the values from the second index in the first dimension
print("All values in first index of first dimenion and first index of second dimension")
print(arr12[0][0]) #This will extract all values occuring in the first index of both the first and second dimensions
print("A single value specified with three indexes, one fo reach dimension")
print(arr12[1, 3, 3]) #This will extract a specific value based on an index in all three dimensions.
print("Incorporating ranges")
print(arr12[1, 0:2, 0:2]) #All values in second index of first dimension that are also include din the first to second index of the second and third dimensions
print("Using colons")
print(arr12[:, 0:2, 0:2]) #Only a colon means select all values in a dimension
print(arr12[:,2:,0:2]) #Can also use colons to select all values before or after an index. 
Original array
[[[ 0  1  2  3  4]
  [ 5  6  7  8  9]
  [10 11 12 13 14]
  [15 16 17 18 19]
  [20 21 22 23 24]]

 [[25 26 27 28 29]
  [30 31 32 33 34]
  [35 36 37 38 39]
  [40 41 42 43 44]
  [45 46 47 48 50]]]
All values in first index of first dimension
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]
All values in second index of first dimension
[[25 26 27 28 29]
 [30 31 32 33 34]
 [35 36 37 38 39]
 [40 41 42 43 44]
 [45 46 47 48 50]]
All values in first index of first dimenion and first index of second dimension
[0 1 2 3 4]
A single value specified with three indexes, one fo reach dimension
43
Incorporating ranges
[[25 26]
 [30 31]]
Using colons
[[[ 0  1]
  [ 5  6]]

 [[25 26]
  [30 31]]]
[[[10 11]
  [15 16]
  [20 21]]

 [[35 36]
  [40 41]
  [45 46]]]

Once values have been selected using index notation they can be changed, a process known as broadcasting. In the example below I have converted all values in the first index of the first dimension and the first index of the second dimension to zeroes.

arr12[0][0] = 0
print(arr12)
[[[ 0  0  0  0  0]
  [ 5  6  7  8  9]
  [10 11 12 13 14]
  [15 16 17 18 19]
  [20 21 22 23 24]]

 [[25 26 27 28 29]
  [30 31 32 33 34]
  [35 36 37 38 39]
  [40 41 42 43 44]
  [45 46 47 48 50]]]

Boolean Arrays

It is also possible to create arrays of Boolean values as demonstrated below.

arr13 = np.array([True, False, True, False, True, False, True, False, False])
arr13b = arr13.reshape(3, 3)
print(arr13b)
[[ True False  True]
 [False  True False]
 [ True False False]]

Comparison Operators can be used to compare each value in an array to a value and return the Boolean result to the associated position in a new array.

arr10 = np.random.randint(1, 1200, 100)
arr10b = arr10.reshape(10, 10)
print(arr10b)
arr10bool = arr10b > 150
print(arr10bool)
[[1173  244  318  762  435  447 1095 1055   32  709]
 [ 221  328  479  121  115  687  726 1005   24  937]
 [ 465  297  142  360  680  895 1030  801  881  498]
 [1165  858  338  396  396  573   11  364   54  205]
 [ 510  310 1068  891  901  997 1029  299 1118 1126]
 [  76  367  879  841   34  177   45  401  575 1186]
 [ 210  866  823  204  289  242  796 1069  399  540]
 [ 831  716  680  499  987  243  723 1135  666 1038]
 [1078 1196 1027  420   79  306   61 1028  761   40]
 [ 329  914  483 1012  273  554 1000   13  442  204]]
[[ True  True  True  True  True  True  True  True False  True]
 [ True  True  True False False  True  True  True False  True]
 [ True  True False  True  True  True  True  True  True  True]
 [ True  True  True  True  True  True False  True False  True]
 [ True  True  True  True  True  True  True  True  True  True]
 [False  True  True  True False  True False  True  True  True]
 [ True  True  True  True  True  True  True  True  True  True]
 [ True  True  True  True  True  True  True  True  True  True]
 [ True  True  True  True False  True False  True  True False]
 [ True  True  True  True  True  True  True False  True  True]]

Array Arithmetic and Operations

It is generally easy to perform mathemtical operations on arrays as demonstrated below.

arr14 = np.random.randint(1, 1200, 25)
arr14b = arr14.reshape(5, 5)
print(arr14b)
print(arr14b+21)
print(arr14b-52)
print(arr14b*2)
print(arr14b/3)
print(arr14b**2)
[[1199  413  374  738  721]
 [ 930  632  932 1083 1194]
 [ 600  994  434  308 1097]
 [ 782  142 1161  522 1172]
 [ 764  159  370 1125  430]]
[[1220  434  395  759  742]
 [ 951  653  953 1104 1215]
 [ 621 1015  455  329 1118]
 [ 803  163 1182  543 1193]
 [ 785  180  391 1146  451]]
[[1147  361  322  686  669]
 [ 878  580  880 1031 1142]
 [ 548  942  382  256 1045]
 [ 730   90 1109  470 1120]
 [ 712  107  318 1073  378]]
[[2398  826  748 1476 1442]
 [1860 1264 1864 2166 2388]
 [1200 1988  868  616 2194]
 [1564  284 2322 1044 2344]
 [1528  318  740 2250  860]]
[[399.66666667 137.66666667 124.66666667 246.         240.33333333]
 [310.         210.66666667 310.66666667 361.         398.        ]
 [200.         331.33333333 144.66666667 102.66666667 365.66666667]
 [260.66666667  47.33333333 387.         174.         390.66666667]
 [254.66666667  53.         123.33333333 375.         143.33333333]]
[[1437601  170569  139876  544644  519841]
 [ 864900  399424  868624 1172889 1425636]
 [ 360000  988036  188356   94864 1203409]
 [ 611524   20164 1347921  272484 1373584]
 [ 583696   25281  136900 1265625  184900]]

It is also possible to perform mathematical oeprations on sets of arrays as long as they have the same shape.

arr14 = np.random.randint(1, 1200, 25)
arr14b = arr14.reshape(5, 5)
print(arr14b)
print(arr14b+arr14b)
print(arr14b-arr14b)
[[  29  937  198   11  505]
 [ 204  736 1109 1078  217]
 [1113  622  342  597  630]
 [ 518  439  783  161  203]
 [ 829  374  847  641  178]]
[[  58 1874  396   22 1010]
 [ 408 1472 2218 2156  434]
 [2226 1244  684 1194 1260]
 [1036  878 1566  322  406]
 [1658  748 1694 1282  356]]
[[0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]]

NumPy provides matematical functions for performing common tasks.

arr14 = np.random.randint(1, 1200, 25)
arr14b = arr14.reshape(5, 5)

print(np.max(arr14b))
print(np.min(arr14b))
print(np.sqrt(arr14b))
1181
16
[[16.88194302 29.35983651 16.76305461 14.73091986 26.81417536]
 [ 4.          6.70820393 31.36877428 25.63201124 14.86606875]
 [22.91287847 30.56141358 25.0998008  18.13835715 17.74823935]
 [14.59451952 22.20360331  8.77496439 17.2626765  29.68164416]
 [18.05547009 28.53068524 29.27456234 26.5894716  34.36568055]]

Pandas

Most scientists are first introduced to data tables in the form of spreadsheets, such as an Excel Spreadsheet. In such a structure, each record or feature is represented by a row of data while each column represents a specific piece of information for each record. For example, each row in a table could contain all the data for a specific student, and each column could hold the same data for different students, such as academic year, age, height, etc.

Further, spreadsheets are able to hold different data types in each column. For example academic year is categorical or ordinal while height is ratio or continuous.

A comparable data structure would be handy for use in Python. This is made available by the Pandas library. Pandas allows for data to be stored in DataFrames. If you work in the R environment, this is very similar to the concept of data frames in R. In fact, Pandas DataFrames were inspired by R data frames.

Pandas makes use of the NumPy library, so it is generally a good idea to import NumPy if you plan to use Pandas. Also, you will need to intall Pandas and NumPy into your environment prior to using them.

The complete documentation for Pandas can be found here.

Intro to Series

import numpy as np
import pandas as pd

Before we talk about DataFrames, I will introduce the concept of a Series. These are actually very similar to a NumPy array except that they allow for axis labels to be assigned. Examples of generating a series from lists, NumPy arrays, and dictionaries are provided below.

A series is comparable to a single column from a Spreadsheet.

lst1 = ["GIS", "Remote Sensing", "Spatial Analysis", "Digital Cartography"]
arr1 = np.array([350, 455, 457, 642])
dict1 = {'Class1':"GIS", "Class2":"Remote Sensing", "Class3":"Spatial Analysis", "Class4":"Digital Cartography"}

s_lst = pd.Series(data=lst1, index = ["Class1", "Class2", "Class3", "Class4"])
s_arr = pd.Series(data=arr1, index = ["Class1", "Class2", "Class3", "Class4"])
s_dict = pd.Series(dict1)

print(s_lst)
print(s_arr)
print(s_dict)
Class1                    GIS
Class2         Remote Sensing
Class3       Spatial Analysis
Class4    Digital Cartography
dtype: object
Class1    350
Class2    455
Class3    457
Class4    642
dtype: int32
Class1                    GIS
Class2         Remote Sensing
Class3       Spatial Analysis
Class4    Digital Cartography
dtype: object

Labels or names can then be used to select data either using bracket notation or dot notation.

You can use whichever method you prefer. However, if you use dot notation you should not included spaces in the column names.

print(s_dict["Class3"])
print(s_dict.Class3)
Spatial Analysis
Spatial Analysis

Intro to DataFrames

Let's start by building a dataframe from a set of lists. First, I create three lists to hold different components of course title information. Next, I combine these lists into a dictionary. Finally, I convert the dictionary into a DataFrame. Note that a nice table is generated by just calling the DataFrame name without the print() function; however, I am using print() here since I have found that this works better when rendering a web page. Also, the keys from the dictionary have been used as the column names and a default index has been assigned to each row.

prefix = ["Geol", "Geol", "Geol", "Geol", "Geog", "Geog", "Geog"]
cnum = [103, 321, 331, 341, 350, 455, 462]
cname = ["Earth Through Time", "Geomorphology", "Paleontology", "Structural Geology", "GIScience", "Remote Sensing", "Digital Cartography"]
course_dict = {"prefix": prefix, "course_number": cnum, "course_name": cname}
course_df = pd.DataFrame(course_dict)
print(course_df)
  prefix  course_number          course_name
0   Geol            103   Earth Through Time
1   Geol            321        Geomorphology
2   Geol            331         Paleontology
3   Geol            341   Structural Geology
4   Geog            350            GIScience
5   Geog            455       Remote Sensing
6   Geog            462  Digital Cartography

Since column names are assigned, they can be used to select out individual columns using bracket or dot notation. Single columns can be saved as a Series.

print(course_df["course_name"])
# Or course_df.course_name
0     Earth Through Time
1          Geomorphology
2           Paleontology
3     Structural Geology
4              GIScience
5         Remote Sensing
6    Digital Cartography
Name: course_name, dtype: object

A list of column names can be provided to subset out multiple columns.

print(course_df[["course_number", "course_name"]])
   course_number          course_name
0            103   Earth Through Time
1            321        Geomorphology
2            331         Paleontology
3            341   Structural Geology
4            350            GIScience
5            455       Remote Sensing
6            462  Digital Cartography

You can also subset based on column names and row labels combined using the .loc method. The iloc method, in contrast, is used for selection based on indexes.

print(course_df.loc[[1, 2, 4],["course_number", "course_name"]])
# Or course_df.iloc[[1, 2, 4],[1, 2]]
   course_number    course_name
1            321  Geomorphology
2            331   Paleontology
4            350      GIScience

You can even use the data stored in existing columns to create a new column. Note that the new column does not need to be declared prior to writing to it. In the example, I have written the entire course name to a new column. The map() method is used to make sure all data are treated as strings. Also, I am including blank spaces so that the components are not ran together.

course_df["full_name"] = course_df["prefix"].map(str)  + " " + course_df["course_number"].map(str)  + ": " + course_df["course_name"].map(str) 
print(course_df)
  prefix  course_number          course_name                      full_name
0   Geol            103   Earth Through Time   Geol 103: Earth Through Time
1   Geol            321        Geomorphology        Geol 321: Geomorphology
2   Geol            331         Paleontology         Geol 331: Paleontology
3   Geol            341   Structural Geology   Geol 341: Structural Geology
4   Geog            350            GIScience            Geog 350: GIScience
5   Geog            455       Remote Sensing       Geog 455: Remote Sensing
6   Geog            462  Digital Cartography  Geog 462: Digital Cartography

Instead of creating data tables manually, you are probably more likely to read in a data table from a file or web link. Fortunately, Pandas provides functions for reading data in from a variety of formats. Here are some examples:

  • read_table(): deliminated files (TXT, CSV, etc.)
  • read_csv(): comma-separated values (CSV)
  • read_excel(): Excel Spreadsheet
  • read_json(): JavaScript Object Notation (JSON)
  • read_html(): HTML tables
  • read_sas(): SAS files

Full documenatation on reading in data can be found here.

In the example below, I am reading in a CSV file from my local computer. The sep argument is used to define the deliminator. However, commas are the default, so it isn't necessary to include this argument in this case. Setting the header argument to 0 indicated that the first row of data should be treated as column names or headers. It isn't always necessary to specify the character encoding; however, I found that it is ncessary for this data set due to the use of special characters.

To view the first 10 rows of the data, I use the head() method. The len() function returns the number of rows.

This data set was created by my brother and provides his ratings for movies along with the movie's name, director, release year, genre, and whether or not he owns it.

movies_df = pd.read_csv("YOUR FILE PATH HERE", sep=",", header=0, encoding="ISO-8859-1")
print(movies_df.head(10))
                 Movie Name             Director  Release Year  My Rating  \
0             Almost Famous        Cameron Crowe          2000       9.99   
1  The Shawshank Redemption       Frank Darabont          1994       9.98   
2             Groundhog Day         Harold Ramis          1993       9.96   
3              Donnie Darko        Richard Kelly          2001       9.95   
4           Children of Men       Alfonso Cuaron          2006       9.94   
5                Annie Hall          Woody Allen          1977       9.93   
6                  Rushmore         Wes Anderson          1998       9.92   
7                   Memento    Christopher Nolan          2000       9.91   
8    No Country for Old Men  Joel and Ethan Coen          2007       9.90   
9                     Seven        David Fincher          1995       9.88

         Genre  Own  
0        Drama  Yes  
1        Drama  Yes  
2       Comedy  Yes  
3       Sci-Fi  Yes  
4       Sci-Fi  Yes  
5       Comedy  Yes  
6  Independent  Yes  
7     Thriller  Yes  
8     Thriller  Yes  
9     Thriller  Yes
print(len(movies_df))
1852

DataFrame Query and Subset

Let's now use this data table to explore data query and selection methods. In the first example, I am selecting out all of movies that are dramas and saving them to a new DataFrame. Note the use of bracket notation. The code in the middle bracket is used to perform the selection.

The second example includes a compound query. Note the use of parenthesis within the query.

Lastly, it is also possible to subset out only certain columns that meet the query. In the last example, I am subsetting out just the movie name and director.

#Example 1
just_dramas = movies_df[movies_df["Genre"]=="Drama"]
print(just_dramas.head(10))
print(len(just_dramas))
#Example 2
dramas_gt_9 = movies_df[(movies_df["Genre"]=="Drama") & (movies_df["My Rating"]>9.0)]
print(len(dramas_gt_9))
#Example 3
dramas_gt_9_sub = movies_df[(movies_df["Genre"]=="Drama") & (movies_df["My Rating"]>9.0)][["Movie Name", "Director"]]
print(dramas_gt_9_sub.head(10))
                  Movie Name           Director  Release Year  My Rating  \
0              Almost Famous      Cameron Crowe          2000       9.99   
1   The Shawshank Redemption     Frank Darabont          1994       9.98   
21                   Boyhood  Richard Linklater          2014       9.75   
26           American Beauty         Sam Mendes          1999       9.70   
39              Mystic River     Clint Eastwood          2003       9.56   
51    Hannah and Her Sisters        Woody Allen          1986       9.47   
52            In the Bedroom         Todd Field          2001       9.47   
57                  Big Fish         Tim Burton          2003       9.44   
60       The Sweet Hereafter        Atom Egoyan          1997       9.42   
62               Stand By Me         Rob Reiner          1986       9.40

    Genre  Own  
0   Drama  Yes  
1   Drama  Yes  
21  Drama   No  
26  Drama  Yes  
39  Drama  Yes  
51  Drama  Yes  
52  Drama  Yes  
57  Drama  Yes  
60  Drama  Yes  
62  Drama  Yes  
321
23
                  Movie Name           Director
0              Almost Famous      Cameron Crowe
1   The Shawshank Redemption     Frank Darabont
21                   Boyhood  Richard Linklater
26           American Beauty         Sam Mendes
39              Mystic River     Clint Eastwood
51    Hannah and Her Sisters        Woody Allen
52            In the Bedroom         Todd Field
57                  Big Fish         Tim Burton
60       The Sweet Hereafter        Atom Egoyan
62               Stand By Me         Rob Reiner

Another option for performing queries is to use the query() method provided by Pandas. When using this method, the query will need to be provided as an expression in string form. Also, spaces in column names can be problematic, so spaces should be removed or replaced with underscores.

#Remove spaces in column names. 
movies_df.columns = [column.replace(" ", "_") for column in movies_df.columns]
#Example 1
just_dramas = movies_df.query('Genre=="Drama"')
print(just_dramas.head(10))
print(len(just_dramas))
#Example 2
dramas_gt_9 = movies_df.query('Genre=="Drama" and My_Rating > 9.0')
print(len(dramas_gt_9))
#Example 3
dramas_gt_9_sub = movies_df.query('Genre=="Drama" and My_Rating > 9.0')[["Movie_Name", "Director"]]
print(dramas_gt_9_sub.head(10))
                  Movie_Name           Director  Release_Year  My_Rating  \
0              Almost Famous      Cameron Crowe          2000       9.99   
1   The Shawshank Redemption     Frank Darabont          1994       9.98   
21                   Boyhood  Richard Linklater          2014       9.75   
26           American Beauty         Sam Mendes          1999       9.70   
39              Mystic River     Clint Eastwood          2003       9.56   
51    Hannah and Her Sisters        Woody Allen          1986       9.47   
52            In the Bedroom         Todd Field          2001       9.47   
57                  Big Fish         Tim Burton          2003       9.44   
60       The Sweet Hereafter        Atom Egoyan          1997       9.42   
62               Stand By Me         Rob Reiner          1986       9.40

    Genre  Own  
0   Drama  Yes  
1   Drama  Yes  
21  Drama   No  
26  Drama  Yes  
39  Drama  Yes  
51  Drama  Yes  
52  Drama  Yes  
57  Drama  Yes  
60  Drama  Yes  
62  Drama  Yes  
321
23
                  Movie_Name           Director
0              Almost Famous      Cameron Crowe
1   The Shawshank Redemption     Frank Darabont
21                   Boyhood  Richard Linklater
26           American Beauty         Sam Mendes
39              Mystic River     Clint Eastwood
51    Hannah and Her Sisters        Woody Allen
52            In the Bedroom         Todd Field
57                  Big Fish         Tim Burton
60       The Sweet Hereafter        Atom Egoyan
62               Stand By Me         Rob Reiner

Once a query is complete, you may want to save the result back to a file on your local machnine. The code below provides an example for saving out the last subset of data to a CSV file. The Pandas documentation provides examples for saving to other formats.

dramas_gt_9_sub.to_csv("YOUR FILE PATH HERE", sep=",", header=True)

Missing Values

The NULL, NoData, or missing indicator in Python is NaN. To begin exploring missing values, let's recode some of the data to NaN in the movies data set. In the example below, I am changing the "Drama" and "Fantasy" genres to NaN. I am also recoding any ratings between 8.9 and 9.9 to NaN. The replace() method is used to change the categories while the mask() method is used to recode the rating values. np.nan is a NumPy method for defining null values.

movies_nan = movies_df.copy()
movies_nan["Genre"] = movies_nan[["Genre"]].replace(["Drama", "Fantasy"], np.nan)
movies_nan['My_Rating'].mask(movies_nan['My_Rating'].between(8.9, 9.9), inplace=True)
print(movies_nan.head(10))
print(len(movies_nan))
                 Movie_Name             Director  Release_Year  My_Rating  \
0             Almost Famous        Cameron Crowe          2000       9.99   
1  The Shawshank Redemption       Frank Darabont          1994       9.98   
2             Groundhog Day         Harold Ramis          1993       9.96   
3              Donnie Darko        Richard Kelly          2001       9.95   
4           Children of Men       Alfonso Cuaron          2006       9.94   
5                Annie Hall          Woody Allen          1977       9.93   
6                  Rushmore         Wes Anderson          1998       9.92   
7                   Memento    Christopher Nolan          2000       9.91   
8    No Country for Old Men  Joel and Ethan Coen          2007        NaN   
9                     Seven        David Fincher          1995        NaN

         Genre  Own  
0          NaN  Yes  
1          NaN  Yes  
2       Comedy  Yes  
3       Sci-Fi  Yes  
4       Sci-Fi  Yes  
5       Comedy  Yes  
6  Independent  Yes  
7     Thriller  Yes  
8     Thriller  Yes  
9     Thriller  Yes  
1852

The dropna() method can be used to remove rows or columns that contain missing data. If the axis parameter is set to 0, rows with missing values in any column will be removed. If it is set to 1, columns with missing data in any row will be removed.

movies_drop = movies_nan.dropna(axis=0)
print(movies_drop.head(10))
print(len(movies_drop))

movies_dropc = movies_nan.dropna(axis=1)
print(movies_dropc.head(10))
print(len(movies_dropc))
           Movie_Name           Director  Release_Year  My_Rating  \
2       Groundhog Day       Harold Ramis          1993       9.96   
3        Donnie Darko      Richard Kelly          2001       9.95   
4     Children of Men     Alfonso Cuaron          2006       9.94   
5          Annie Hall        Woody Allen          1977       9.93   
6            Rushmore       Wes Anderson          1998       9.92   
7             Memento  Christopher Nolan          2000       9.91   
127       After Hours    Martin Scorsese          1985       8.89   
128   Shotgun Stories       Jeff Nichols          2007       8.89   
129  The Untouchables     Brian de Palma          1987       8.89   
130   2 Days in Paris        Julie Delpy          2007       8.88

           Genre  Own  
2         Comedy  Yes  
3         Sci-Fi  Yes  
4         Sci-Fi  Yes  
5         Comedy  Yes  
6    Independent  Yes  
7       Thriller  Yes  
127       Comedy  Yes  
128  Independent  Yes  
129       Action  Yes  
130      Romance  Yes  
1420
                 Movie_Name             Director  Release_Year  Own
0             Almost Famous        Cameron Crowe          2000  Yes
1  The Shawshank Redemption       Frank Darabont          1994  Yes
2             Groundhog Day         Harold Ramis          1993  Yes
3              Donnie Darko        Richard Kelly          2001  Yes
4           Children of Men       Alfonso Cuaron          2006  Yes
5                Annie Hall          Woody Allen          1977  Yes
6                  Rushmore         Wes Anderson          1998  Yes
7                   Memento    Christopher Nolan          2000  Yes
8    No Country for Old Men  Joel and Ethan Coen          2007  Yes
9                     Seven        David Fincher          1995  Yes
1852

The .fillna() method can be used to replace NA values with another value or string. In the example, I am changing the missing genres to "Unknown Genre".

movies_nan["Genre"] = movies_nan["Genre"].fillna(value="Unknown Genre")
print(movies_nan.head(10))
                 Movie_Name             Director  Release_Year  My_Rating  \
0             Almost Famous        Cameron Crowe          2000       9.99   
1  The Shawshank Redemption       Frank Darabont          1994       9.98   
2             Groundhog Day         Harold Ramis          1993       9.96   
3              Donnie Darko        Richard Kelly          2001       9.95   
4           Children of Men       Alfonso Cuaron          2006       9.94   
5                Annie Hall          Woody Allen          1977       9.93   
6                  Rushmore         Wes Anderson          1998       9.92   
7                   Memento    Christopher Nolan          2000       9.91   
8    No Country for Old Men  Joel and Ethan Coen          2007        NaN   
9                     Seven        David Fincher          1995        NaN

           Genre  Own  
0  Unknown Genre  Yes  
1  Unknown Genre  Yes  
2         Comedy  Yes  
3         Sci-Fi  Yes  
4         Sci-Fi  Yes  
5         Comedy  Yes  
6    Independent  Yes  
7       Thriller  Yes  
8       Thriller  Yes  
9       Thriller  Yes

It is also possible to replace null values with a statistic derived from the available values. In the example, I am replacing the missing ratings with the mean of all available ratings. This would likely not be valid in this case, since the missing genres might have different average ratings than the available genres. This is just an example.

movies_nan["My_Rating"] = movies_nan["My_Rating"].fillna(value=movies_nan["My_Rating"].mean())
print(movies_nan.head(10))
                 Movie_Name             Director  Release_Year  My_Rating  \
0             Almost Famous        Cameron Crowe          2000   9.990000   
1  The Shawshank Redemption       Frank Darabont          1994   9.980000   
2             Groundhog Day         Harold Ramis          1993   9.960000   
3              Donnie Darko        Richard Kelly          2001   9.950000   
4           Children of Men       Alfonso Cuaron          2006   9.940000   
5                Annie Hall          Woody Allen          1977   9.930000   
6                  Rushmore         Wes Anderson          1998   9.920000   
7                   Memento    Christopher Nolan          2000   9.910000   
8    No Country for Old Men  Joel and Ethan Coen          2007   6.786509   
9                     Seven        David Fincher          1995   6.786509

           Genre  Own  
0  Unknown Genre  Yes  
1  Unknown Genre  Yes  
2         Comedy  Yes  
3         Sci-Fi  Yes  
4         Sci-Fi  Yes  
5         Comedy  Yes  
6    Independent  Yes  
7       Thriller  Yes  
8       Thriller  Yes  
9       Thriller  Yes

Grouping and Summarizing DataFrames

Pandas provides methods for summarizing data as described in the examples below. First, I am creating individual statistics and saving them to variables. I then create a Series from a dictionary of these statistics, convert it to a DataFrame using the to_frame() method, then transpose the DataFrame using transpose().

movies_df = pd.read_csv("YOUR FILE PATH HERE", sep=",", header=0, encoding="ISO-8859-1")
movies_df.columns = [column.replace(" ", "_") for column in movies_df.columns]
mov_cnt = movies_df["My_Rating"].count()
mov_mn = movies_df["My_Rating"].mean()
mov_max = movies_df["My_Rating"].max()
mov_min = movies_df["My_Rating"].min()
mov_rang = mov_max-mov_min
mov_stats= pd.Series({"Count": mov_cnt, "Mean": mov_mn, "Max": mov_max, "Min": mov_min, "Range": mov_rang}).to_frame().transpose()
print(mov_stats)
    Count      Mean   Max   Min  Range
0  1852.0  6.952527  9.99  0.67   9.32

It is also possible to obtain summary statistics for each group separately by applying the very userful group_by() method. In the example below, I am obtaining stats for each genre and save them into a DataFrame. The columns do not need to be defined beforehand.

movies_df = pd.read_csv("YOUR FILE PATH HERE", sep=",", header=0, encoding="ISO-8859-1")
movies_df.columns =[column.replace(" ", "_") for column in movies_df.columns]
genre_stats = pd.DataFrame()
genre_stats["Count"] = movies_df.groupby("Genre")["My_Rating"].count()
genre_stats["Mean"] = movies_df.groupby("Genre")["My_Rating"].mean()
genre_stats["Max"] = movies_df.groupby("Genre")["My_Rating"].max()
genre_stats["Min"] = movies_df.groupby("Genre")["My_Rating"].min()
genre_stats["Range"] = genre_stats["Max"] - genre_stats["Min"]
print(genre_stats)
             Count      Mean   Max   Min  Range
Genre                                          
Action         197  6.651878  9.87  0.67   9.20
Classic         47  6.741064  8.33  4.03   4.30
Comedy         275  6.738545  9.96  1.76   8.20
Documentary     78  7.283718  8.87  5.43   3.44
Drama          321  7.140561  9.99  0.88   9.11
Family          84  6.670714  9.86  1.01   8.85
Fantasy         21  7.473810  9.85  6.21   3.64
Foreign        157  7.453439  9.63  3.76   5.87
Horror          96  6.523021  9.60  1.65   7.95
Independent    190  6.940263  9.92  1.81   8.11
Romance         94  7.238723  9.84  3.45   6.39
Sci-Fi          62  6.872419  9.95  1.92   8.03
Sports          23  6.391304  8.78  2.94   5.84
Thriller       164  7.095976  9.91  3.11   6.80
War             27  6.542593  8.67  4.37   4.30
Western         14  6.960000  8.65  4.53   4.12
comedy           1  6.030000  6.03  6.03   0.00
thriller         1  5.750000  5.75  5.75   0.00

You may have noticed above that there are some issues with the Genre names. For example, "Comedy" and "Thriller" show up twice in the summary because these genres were not always capitalized. This can be fixed.

In the example below, I am using the unique() method to obtain a list of all genre names. I then define a list of new names where capitalization is consistent. Next, I use the replace() function to change the original name set to my new name set and fix the capitalization issues.

If I re-execute the code from above, you can see that the duplicates have been removed in the summary.

orig_names = list(movies_df["Genre"].unique())
print(orig_names)
new_names = ['Drama', 'Comedy', 'Sci-Fi', 'Independent', 'Thriller', 'Action', 'Family', 
             'Fantasy', 'Romance', 'Foreign', 'Horror', 'Documentary', 'Sports', 'War', 
             'Western', 'Classic', 'Comedy', 'Thriller']
movies_df.Genre.replace(orig_names, new_names, inplace=True)
list(movies_df["Genre"].unique())

genre_stats = pd.DataFrame()
genre_stats["Count"] = movies_df.groupby("Genre")["My_Rating"].count()
genre_stats["Mean"] = movies_df.groupby("Genre")["My_Rating"].mean()
genre_stats["Max"] = movies_df.groupby("Genre")["My_Rating"].max()
genre_stats["Min"] = movies_df.groupby("Genre")["My_Rating"].min()
genre_stats["Range"] = genre_stats["Max"] - genre_stats["Min"]
print(genre_stats)
['Drama', 'Comedy', 'Sci-Fi', 'Independent', 'Thriller', 'Action', 'Family', 'Fantasy', 'Romance', 'Foreign', 'Horror', 'Documentary', 'Sports', 'War', 'Western', 'Classic', 'comedy', 'thriller']
             Count      Mean   Max   Min  Range
Genre                                          
Action         197  6.651878  9.87  0.67   9.20
Classic         47  6.741064  8.33  4.03   4.30
Comedy         276  6.735978  9.96  1.76   8.20
Documentary     78  7.283718  8.87  5.43   3.44
Drama          321  7.140561  9.99  0.88   9.11
Family          84  6.670714  9.86  1.01   8.85
Fantasy         21  7.473810  9.85  6.21   3.64
Foreign        157  7.453439  9.63  3.76   5.87
Horror          96  6.523021  9.60  1.65   7.95
Independent    190  6.940263  9.92  1.81   8.11
Romance         94  7.238723  9.84  3.45   6.39
Sci-Fi          62  6.872419  9.95  1.92   8.03
Sports          23  6.391304  8.78  2.94   5.84
Thriller       165  7.087818  9.91  3.11   6.80
War             27  6.542593  8.67  4.37   4.30
Western         14  6.960000  8.65  4.53   4.12

The describe() method can be used to obtain a set of default summary statistics for a column of data. Combining this with group_by() allows for the calculation of statistics by group.

print(movies_df.groupby("Genre")["My_Rating"].describe())
             count      mean       std   min     25%    50%     75%   max
Genre                                                                    
Action       197.0  6.651878  1.687098  0.67  5.8400  6.950  7.7300  9.87
Classic       47.0  6.741064  0.903364  4.03  6.3000  6.790  7.2550  8.33
Comedy       276.0  6.735978  1.494788  1.76  5.9025  6.800  7.7025  9.96
Documentary   78.0  7.283718  0.778165  5.43  6.8725  7.290  7.8800  8.87
Drama        321.0  7.140561  1.292603  0.88  6.5800  7.150  7.8900  9.99
Family        84.0  6.670714  1.654381  1.01  5.9550  6.930  7.4875  9.86
Fantasy       21.0  7.473810  1.113838  6.21  6.6500  7.030  7.9400  9.85
Foreign      157.0  7.453439  0.778073  3.76  6.9500  7.410  7.9200  9.63
Horror        96.0  6.523021  1.491829  1.65  5.9000  6.770  7.2425  9.60
Independent  190.0  6.940263  1.208283  1.81  6.3300  6.880  7.6525  9.92
Romance       94.0  7.238723  1.471279  3.45  6.4925  7.085  8.2150  9.84
Sci-Fi        62.0  6.872419  1.901382  1.92  5.9300  7.040  8.2000  9.95
Sports        23.0  6.391304  1.591958  2.94  5.5800  6.770  7.3000  8.78
Thriller     165.0  7.087818  1.427839  3.11  6.3200  7.120  8.0300  9.91
War           27.0  6.542593  1.096518  4.37  5.8400  6.560  7.2100  8.67
Western       14.0  6.960000  1.411033  4.53  6.0400  7.070  8.2275  8.65

Concatenate and Merge

The Pandas concat() method is used to concatenate DataFrames that have the same columns. This would be comparable to copying and pasting rows from two spreadsheets into a new spreadsheet. To demonstrate this, I have extracted rows using indexes. Next, I concatenate them back to a new DataFrame.

movies_df = pd.read_csv("YOUR FILE PATH HERE", sep=",", header=0, encoding="ISO-8859-1")
movies_df.columns =[column.replace(" ", "_") for column in movies_df.columns]

movies_sub1 = movies_df[100:500]
movies_sub2 = movies_df[900:1300]
movies_subc = pd.concat([movies_sub1, movies_sub2])
print(len(movies_subc))
800

Merge is comparable to table joins when using SQL. This requires the use of keys and the declaration of a joining method, such as "Left", "Right", "Inner", or "Outer".

In the example, I first create a unique ID by copying the row index to a column. I then break the data into two components, each containing the ID and a subset of the remaining columns. I then use the merge() method to merge the DataFrames using the "inner" method and the common "id" field. "Inner" will only return rows that occur in both data sets. Since both DataFrames were derived from the same original DataFrame, they will have identical rows, so the result would be the same as using "left", where all rows from the left table are maintained even if they don't occur in the right table, or "right", where all rows from the right table are maintained even if they don't occur in the left table.

In the second example, I use a query to separate out only movies released in 2008. When I perform a join with all of the data using the "inner" method, I only get back the common or shared rows.

Note that there is also a join() method that joins based on indexes. However, that will not be demonstrated here.

movies_df = pd.read_csv("YOUR FILE PATH HERE", sep=",", header=0, encoding="ISO-8859-1")
movies_df.columns =[column.replace(" ", "_") for column in movies_df.columns]

movies_df["id"] = movies_df.index
print(movies_df.head())

movies_first = movies_df[["id", "Movie_Name"]]
movies_second = movies_df[["id", "Director", "Release_Year"]]

movies_merge = pd.merge(movies_first, movies_second, how="inner", on="id")
print(movies_merge.head())
print(len(movies_merge))


movies_third = movies_df.query('Release_Year == 2008')[["id", "Director", "Release_Year"]]
movies_merge2 = pd.merge(movies_first, movies_third, how="inner", on="id")
print(movies_merge2.head())
print(len(movies_merge2))
                 Movie_Name        Director  Release_Year  My_Rating   Genre  \
0             Almost Famous   Cameron Crowe          2000       9.99   Drama   
1  The Shawshank Redemption  Frank Darabont          1994       9.98   Drama   
2             Groundhog Day    Harold Ramis          1993       9.96  Comedy   
3              Donnie Darko   Richard Kelly          2001       9.95  Sci-Fi   
4           Children of Men  Alfonso Cuaron          2006       9.94  Sci-Fi

   Own  id  
0  Yes   0  
1  Yes   1  
2  Yes   2  
3  Yes   3  
4  Yes   4  
   id                Movie_Name        Director  Release_Year
0   0             Almost Famous   Cameron Crowe          2000
1   1  The Shawshank Redemption  Frank Darabont          1994
2   2             Groundhog Day    Harold Ramis          1993
3   3              Donnie Darko   Richard Kelly          2001
4   4           Children of Men  Alfonso Cuaron          2006
1852
    id            Movie_Name           Director  Release_Year
0   11                Wall-E     Andrew Stanton          2008
1   38       The Dark Knight  Christopher Nolan          2008
2  100             In Bruges    Martin McDonagh          2008
3  102           Gran Torino     Clint Eastwood          2008
4  104  Let the Right One In    Tomas Alfredson          2008
93

Concluding Remarks

There are many more topics that could be explored relating to NumPy and Pandas. However, I have found the methods and techniques discussed here to be the ones I use most often and are most useful.

For more examples and details, please consult the documentation for NumPy and Pandas.

If you need to perform a specific task that I did not cover, Google it! Python has a large user base, so it is likely that you will find examples similar to your needs with a simple search.

In the next module, we will disuss methods for graphing and visualizing data using matplotlib, Seaborn, and Pandas.