Pandas
Most scientists are first introduced to data tables in the form of spreadsheets, such as an Excel Spreadsheet. In such a structure, each record or feature is represented by a row of data while each column represents a specific piece of information for each record. For example, each row in a table could contain all the data for a specific student, and each column could hold the same data for different students, such as academic year, age, height, etc.
Further, spreadsheets are able to hold different data types in each column. For example academic year is categorical or ordinal while height is ratio or continuous.
A comparable data structure would be handy for use in Python. This is made available by the Pandas library. Pandas allows for data to be stored in DataFrames. If you work in the R environment, this is very similar to the concept of data frames in R. In fact, Pandas DataFrames were inspired by R data frames.
Pandas makes use of the NumPy library, so it is generally a good idea to import NumPy if you plan to use Pandas. Also, you will need to install Pandas and NumPy into your environment prior to using them.
The complete documentation for Pandas can be found here.
After working through this module you will be able to:
- create and manipulate Series and DataFrames using Pandas.
- query and subset DataFrames.
- manipulate DataFrames.
- summarize and group DataFrames.
Intro to Series
import numpy as np
import pandas as pd
Before we talk about DataFrames, I will introduce the concept of a Series. These are actually very similar to a NumPy array except that they allow for axis labels to be assigned. Examples of generating a series from lists, NumPy arrays, and dictionaries are provided below.
A series is comparable to a single column from a spreadsheet.
lst1 = ["GIS", "Remote Sensing", "Spatial Analysis", "Digital Cartography"]
arr1 = np.array([350, 455, 457, 642])
dict1 = {'Class1':"GIS", "Class2":"Remote Sensing", "Class3":"Spatial Analysis", "Class4":"Digital Cartography"}
s_lst = pd.Series(data=lst1, index = ["Class1", "Class2", "Class3", "Class4"])
s_arr = pd.Series(data=arr1, index = ["Class1", "Class2", "Class3", "Class4"])
s_dict = pd.Series(dict1)
print(s_lst)
print(s_arr)
print(s_dict)
Class1 GIS
Class2 Remote Sensing
Class3 Spatial Analysis
Class4 Digital Cartography
dtype: object
Class1 350
Class2 455
Class3 457
Class4 642
dtype: int32
Class1 GIS
Class2 Remote Sensing
Class3 Spatial Analysis
Class4 Digital Cartography
dtype: object
Labels or names can then be used to select data either using bracket notation or dot notation.
You can use whichever method you prefer. However, if you use dot notation you should not included spaces in the column names.
print(s_dict["Class3"])
print(s_dict.Class3)
Spatial Analysis
Spatial Analysis
Intro to DataFrames
Let's start by building a DataFrame from a set of lists. First, I create three lists to hold different components of course title information. Next, I combine these lists into a dictionary. Finally, I convert the dictionary into a DataFrame. Note that a well formatted table is generated by just calling the DataFrame name without the print() function; however, I am using print() here since I have found that this works better when rendering a web page. Also, the keys from the dictionary have been used as the column names, and a default index has been assigned to each row.
prefix = ["Geol", "Geol", "Geol", "Geol", "Geog", "Geog", "Geog"]
cnum = [103, 321, 331, 341, 350, 455, 462]
cname = ["Earth Through Time", "Geomorphology", "Paleontology", "Structural Geology", "GIScience", "Remote Sensing", "Digital Cartography"]
course_dict = {"prefix": prefix, "course_number": cnum, "course_name": cname}
course_df = pd.DataFrame(course_dict)
print(course_df)
prefix course_number course_name
0 Geol 103 Earth Through Time
1 Geol 321 Geomorphology
2 Geol 331 Paleontology
3 Geol 341 Structural Geology
4 Geog 350 GIScience
5 Geog 455 Remote Sensing
6 Geog 462 Digital Cartography
Since column names are assigned, they can be used to select out individual columns using bracket or dot notation. Single columns can be saved as a Series.
print(course_df["course_name"])
# Or course_df.course_name
0 Earth Through Time
1 Geomorphology
2 Paleontology
3 Structural Geology
4 GIScience
5 Remote Sensing
6 Digital Cartography
Name: course_name, dtype: object
A list of column names can be provided to subset out multiple columns.
print(course_df[["course_number", "course_name"]])
course_number course_name
0 103 Earth Through Time
1 321 Geomorphology
2 331 Paleontology
3 341 Structural Geology
4 350 GIScience
5 455 Remote Sensing
6 462 Digital Cartography
You can also subset based on column names and row labels combined using the .loc method. The iloc method, in contrast, is used for selection based on indexes.
print(course_df.loc[[1, 2, 4],["course_number", "course_name"]])
# Or course_df.iloc[[1, 2, 4],[1, 2]]
course_number course_name
1 321 Geomorphology
2 331 Paleontology
4 350 GIScience
You can even use the data stored in existing columns to create a new column. Note that the new column does not need to be declared prior to writing to it. In the example, I have written the entire course name to a new column. The map() method is used to make sure all data are treated as strings. It allow for the same function, in this case str(), to be applied to each element in an iterable, in this case each row in the DataFrame. I am including blank spaces so that the components are not ran together.
course_df["full_name"] = course_df["prefix"].map(str) + " " + course_df["course_number"].map(str) + ": " + course_df["course_name"].map(str)
print(course_df)
prefix course_number course_name full_name
0 Geol 103 Earth Through Time Geol 103: Earth Through Time
1 Geol 321 Geomorphology Geol 321: Geomorphology
2 Geol 331 Paleontology Geol 331: Paleontology
3 Geol 341 Structural Geology Geol 341: Structural Geology
4 Geog 350 GIScience Geog 350: GIScience
5 Geog 455 Remote Sensing Geog 455: Remote Sensing
6 Geog 462 Digital Cartography Geog 462: Digital Cartography
Instead of creating data tables or DataFrames manually, you are probably more likely to read in a data table from a file or web link. Fortunately, Pandas provides functions for reading data in from a variety of formats. Here are some examples:
- read_table(): delimited file (TXT, CSV, etc.)
- read_csv(): comma-separated values (CSV)
- read_excel(): Excel Spreadsheet
- read_json(): JavaScript Object Notation (JSON)
- read_html(): HTML table
- read_sas(): SAS file
Full documentation on reading in data can be found here.
In the example below, I am reading in a CSV file from my local computer. The sep argument is used to define the deliminator. However, commas are the default, so it isn't necessary to include this argument in this case. Setting the header argument to 0 indicated that the first row of data should be treated as column names or headers. It isn't always necessary to specify the character encoding; however, I found that it is necessary for this data set due to the use of special characters.
To view the first 10 rows of the data, I use the head() method. The len() function returns the number of rows.
This data set was created by my brother and provides his ratings for movies along with the movie's name, director, release year, genre, and whether or not he owns it.
movies_df = pd.read_csv("D:/mydata/matts_movies.csv", sep=",", header=0, encoding="ISO-8859-1")
print(movies_df.head(10))
Movie Name Director Release Year My Rating \
0 Almost Famous Cameron Crowe 2000 9.99
1 The Shawshank Redemption Frank Darabont 1994 9.98
2 Groundhog Day Harold Ramis 1993 9.96
3 Donnie Darko Richard Kelly 2001 9.95
4 Children of Men Alfonso Cuaron 2006 9.94
5 Annie Hall Woody Allen 1977 9.93
6 Rushmore Wes Anderson 1998 9.92
7 Memento Christopher Nolan 2000 9.91
8 No Country for Old Men Joel and Ethan Coen 2007 9.90
9 Seven David Fincher 1995 9.88
Genre Own
0 Drama Yes
1 Drama Yes
2 Comedy Yes
3 Sci-Fi Yes
4 Sci-Fi Yes
5 Comedy Yes
6 Independent Yes
7 Thriller Yes
8 Thriller Yes
9 Thriller Yes
print(len(movies_df))
1852
DataFrame Query and Subset
Let's now use this data table to explore data query and selection methods. In the first example, I am selecting out all movies that are dramas and saving them to a new DataFrame. Note the use of bracket notation.
The second example includes a compound query. Note the use of parenthesis within the query.
Lastly, it is also possible to subset out only certain columns that meet the query. In the last example, I am subsetting out just the movie name and director.
#Example 1
just_dramas = movies_df[movies_df["Genre"]=="Drama"]
print(just_dramas.head(10))
print(len(just_dramas))
#Example 2
dramas_gt_9 = movies_df[(movies_df["Genre"]=="Drama") & (movies_df["My Rating"]>9.0)]
print(len(dramas_gt_9))
#Example 3
dramas_gt_9_sub = movies_df[(movies_df["Genre"]=="Drama") & (movies_df["My Rating"]>9.0)][["Movie Name", "Director"]]
print(dramas_gt_9_sub.head(10))
Movie Name Director Release Year My Rating \
0 Almost Famous Cameron Crowe 2000 9.99
1 The Shawshank Redemption Frank Darabont 1994 9.98
21 Boyhood Richard Linklater 2014 9.75
26 American Beauty Sam Mendes 1999 9.70
39 Mystic River Clint Eastwood 2003 9.56
51 Hannah and Her Sisters Woody Allen 1986 9.47
52 In the Bedroom Todd Field 2001 9.47
57 Big Fish Tim Burton 2003 9.44
60 The Sweet Hereafter Atom Egoyan 1997 9.42
62 Stand By Me Rob Reiner 1986 9.40
Genre Own
0 Drama Yes
1 Drama Yes
21 Drama No
26 Drama Yes
39 Drama Yes
51 Drama Yes
52 Drama Yes
57 Drama Yes
60 Drama Yes
62 Drama Yes
321
23
Movie Name Director
0 Almost Famous Cameron Crowe
1 The Shawshank Redemption Frank Darabont
21 Boyhood Richard Linklater
26 American Beauty Sam Mendes
39 Mystic River Clint Eastwood
51 Hannah and Her Sisters Woody Allen
52 In the Bedroom Todd Field
57 Big Fish Tim Burton
60 The Sweet Hereafter Atom Egoyan
62 Stand By Me Rob Reiner
Another option for performing queries is to use the query() method provided by Pandas. When using this method, the query will need to be provided as an expression in string form. Also, spaces in column names can be problematic, so spaces should be removed or replaced with underscores.
#Remove spaces in column names using list comprehension
movies_df.columns = [column.replace(" ", "_") for column in movies_df.columns]
#Example 1
just_dramas = movies_df.query('Genre=="Drama"')
print(just_dramas.head(10))
print(len(just_dramas))
#Example 2
dramas_gt_9 = movies_df.query('Genre=="Drama" and My_Rating > 9.0')
print(len(dramas_gt_9))
#Example 3
dramas_gt_9_sub = movies_df.query('Genre=="Drama" and My_Rating > 9.0')[["Movie_Name", "Director"]]
print(dramas_gt_9_sub.head(10))
Movie_Name Director Release_Year My_Rating \
0 Almost Famous Cameron Crowe 2000 9.99
1 The Shawshank Redemption Frank Darabont 1994 9.98
21 Boyhood Richard Linklater 2014 9.75
26 American Beauty Sam Mendes 1999 9.70
39 Mystic River Clint Eastwood 2003 9.56
51 Hannah and Her Sisters Woody Allen 1986 9.47
52 In the Bedroom Todd Field 2001 9.47
57 Big Fish Tim Burton 2003 9.44
60 The Sweet Hereafter Atom Egoyan 1997 9.42
62 Stand By Me Rob Reiner 1986 9.40
Genre Own
0 Drama Yes
1 Drama Yes
21 Drama No
26 Drama Yes
39 Drama Yes
51 Drama Yes
52 Drama Yes
57 Drama Yes
60 Drama Yes
62 Drama Yes
321
23
Movie_Name Director
0 Almost Famous Cameron Crowe
1 The Shawshank Redemption Frank Darabont
21 Boyhood Richard Linklater
26 American Beauty Sam Mendes
39 Mystic River Clint Eastwood
51 Hannah and Her Sisters Woody Allen
52 In the Bedroom Todd Field
57 Big Fish Tim Burton
60 The Sweet Hereafter Atom Egoyan
62 Stand By Me Rob Reiner
Once a query is complete, you may want to save the result back to a file on your local machine. The code below provides an example for saving out the last subset of data to a CSV file. The Pandas documentation provides examples for saving to other formats.
dramas_gt_9_sub.to_csv("D:/mydata/matts_movies_sub.csv", sep=",", header=True)
Missing Values
The NULL, NoData, or missing indicator in Python is NaN. To begin exploring missing values, let's recode some of the data to NaN in the movies data set. In the example below, I am changing the "Drama" and "Fantasy" genres to NaN. I am also recoding any ratings between 8.9 and 9.9 to NaN. The replace() method is used to change the categories while the mask() method is used to recode the rating values. np.nan is a NumPy method for defining null values.
movies_nan = movies_df.copy()
movies_nan["Genre"] = movies_nan[["Genre"]].replace(["Drama", "Fantasy"], np.nan)
movies_nan['My_Rating'].mask(movies_nan['My_Rating'].between(8.9, 9.9), inplace=True)
print(movies_nan.head(10))
print(len(movies_nan))
Movie_Name Director Release_Year My_Rating \
0 Almost Famous Cameron Crowe 2000 9.99
1 The Shawshank Redemption Frank Darabont 1994 9.98
2 Groundhog Day Harold Ramis 1993 9.96
3 Donnie Darko Richard Kelly 2001 9.95
4 Children of Men Alfonso Cuaron 2006 9.94
5 Annie Hall Woody Allen 1977 9.93
6 Rushmore Wes Anderson 1998 9.92
7 Memento Christopher Nolan 2000 9.91
8 No Country for Old Men Joel and Ethan Coen 2007 NaN
9 Seven David Fincher 1995 NaN
Genre Own
0 NaN Yes
1 NaN Yes
2 Comedy Yes
3 Sci-Fi Yes
4 Sci-Fi Yes
5 Comedy Yes
6 Independent Yes
7 Thriller Yes
8 Thriller Yes
9 Thriller Yes
1852
The dropna() method can be used to remove rows or columns that contain missing data. If the axis parameter is set to 0, rows with missing values in any column will be removed. If it is set to 1, columns with missing data in any row will be removed.
movies_drop = movies_nan.dropna(axis=0)
print(movies_drop.head(10))
print(len(movies_drop))
movies_dropc = movies_nan.dropna(axis=1)
print(movies_dropc.head(10))
print(len(movies_dropc))
Movie_Name Director Release_Year My_Rating \
2 Groundhog Day Harold Ramis 1993 9.96
3 Donnie Darko Richard Kelly 2001 9.95
4 Children of Men Alfonso Cuaron 2006 9.94
5 Annie Hall Woody Allen 1977 9.93
6 Rushmore Wes Anderson 1998 9.92
7 Memento Christopher Nolan 2000 9.91
127 After Hours Martin Scorsese 1985 8.89
128 Shotgun Stories Jeff Nichols 2007 8.89
129 The Untouchables Brian de Palma 1987 8.89
130 2 Days in Paris Julie Delpy 2007 8.88
Genre Own
2 Comedy Yes
3 Sci-Fi Yes
4 Sci-Fi Yes
5 Comedy Yes
6 Independent Yes
7 Thriller Yes
127 Comedy Yes
128 Independent Yes
129 Action Yes
130 Romance Yes
1420
Movie_Name Director Release_Year Own
0 Almost Famous Cameron Crowe 2000 Yes
1 The Shawshank Redemption Frank Darabont 1994 Yes
2 Groundhog Day Harold Ramis 1993 Yes
3 Donnie Darko Richard Kelly 2001 Yes
4 Children of Men Alfonso Cuaron 2006 Yes
5 Annie Hall Woody Allen 1977 Yes
6 Rushmore Wes Anderson 1998 Yes
7 Memento Christopher Nolan 2000 Yes
8 No Country for Old Men Joel and Ethan Coen 2007 Yes
9 Seven David Fincher 1995 Yes
1852
The .fillna() method can be used to replace NA values with another value or string. In the example, I am changing the missing genres to "Unknown Genre".
movies_nan["Genre"] = movies_nan["Genre"].fillna(value="Unknown Genre")
print(movies_nan.head(10))
Movie_Name Director Release_Year My_Rating \
0 Almost Famous Cameron Crowe 2000 9.99
1 The Shawshank Redemption Frank Darabont 1994 9.98
2 Groundhog Day Harold Ramis 1993 9.96
3 Donnie Darko Richard Kelly 2001 9.95
4 Children of Men Alfonso Cuaron 2006 9.94
5 Annie Hall Woody Allen 1977 9.93
6 Rushmore Wes Anderson 1998 9.92
7 Memento Christopher Nolan 2000 9.91
8 No Country for Old Men Joel and Ethan Coen 2007 NaN
9 Seven David Fincher 1995 NaN
Genre Own
0 Unknown Genre Yes
1 Unknown Genre Yes
2 Comedy Yes
3 Sci-Fi Yes
4 Sci-Fi Yes
5 Comedy Yes
6 Independent Yes
7 Thriller Yes
8 Thriller Yes
9 Thriller Yes
It is also possible to replace null values with a statistic derived from the available values. In the example, I am replacing the missing ratings with the mean of all available ratings. This would likely not be valid in this case, since the missing genres might have different average ratings than the available genres. This is just an example.
movies_nan["My_Rating"] = movies_nan["My_Rating"].fillna(value=movies_nan["My_Rating"].mean())
print(movies_nan.head(10))
Movie_Name Director Release_Year My_Rating \
0 Almost Famous Cameron Crowe 2000 9.990000
1 The Shawshank Redemption Frank Darabont 1994 9.980000
2 Groundhog Day Harold Ramis 1993 9.960000
3 Donnie Darko Richard Kelly 2001 9.950000
4 Children of Men Alfonso Cuaron 2006 9.940000
5 Annie Hall Woody Allen 1977 9.930000
6 Rushmore Wes Anderson 1998 9.920000
7 Memento Christopher Nolan 2000 9.910000
8 No Country for Old Men Joel and Ethan Coen 2007 6.786509
9 Seven David Fincher 1995 6.786509
Genre Own
0 Unknown Genre Yes
1 Unknown Genre Yes
2 Comedy Yes
3 Sci-Fi Yes
4 Sci-Fi Yes
5 Comedy Yes
6 Independent Yes
7 Thriller Yes
8 Thriller Yes
9 Thriller Yes
Grouping and Summarizing DataFrames
Pandas provides methods for summarizing data as described in the examples below. First, I am creating individual statistics and saving them to variables. I then create a Series from a dictionary of these statistics, convert it to a DataFrame using the to_frame() method, then transpose the DataFrame using transpose().
movies_df = pd.read_csv("D:/mydata/matts_movies.csv", sep=",", header=0, encoding="ISO-8859-1")
movies_df.columns = [column.replace(" ", "_") for column in movies_df.columns]
mov_cnt = movies_df["My_Rating"].count()
mov_mn = movies_df["My_Rating"].mean()
mov_max = movies_df["My_Rating"].max()
mov_min = movies_df["My_Rating"].min()
mov_rang = mov_max-mov_min
mov_stats= pd.Series({"Count": mov_cnt, "Mean": mov_mn, "Max": mov_max, "Min": mov_min, "Range": mov_rang}).to_frame().transpose()
print(mov_stats)
Count Mean Max Min Range
0 1852.0 6.952527 9.99 0.67 9.32
It is also possible to obtain summary statistics for each group separately by applying the very useful group_by() method. In the example below, I am obtaining stats for each genre and saving them into a DataFrame. The columns do not need to be defined beforehand.
movies_df = pd.read_csv("D:/mydata/matts_movies.csv", sep=",", header=0, encoding="ISO-8859-1")
movies_df.columns =[column.replace(" ", "_") for column in movies_df.columns]
genre_stats = pd.DataFrame()
genre_stats["Count"] = movies_df.groupby("Genre")["My_Rating"].count()
genre_stats["Mean"] = movies_df.groupby("Genre")["My_Rating"].mean()
genre_stats["Max"] = movies_df.groupby("Genre")["My_Rating"].max()
genre_stats["Min"] = movies_df.groupby("Genre")["My_Rating"].min()
genre_stats["Range"] = genre_stats["Max"] - genre_stats["Min"]
print(genre_stats)
Count Mean Max Min Range
Genre
Action 197 6.651878 9.87 0.67 9.20
Classic 47 6.741064 8.33 4.03 4.30
Comedy 275 6.738545 9.96 1.76 8.20
Documentary 78 7.283718 8.87 5.43 3.44
Drama 321 7.140561 9.99 0.88 9.11
Family 84 6.670714 9.86 1.01 8.85
Fantasy 21 7.473810 9.85 6.21 3.64
Foreign 157 7.453439 9.63 3.76 5.87
Horror 96 6.523021 9.60 1.65 7.95
Independent 190 6.940263 9.92 1.81 8.11
Romance 94 7.238723 9.84 3.45 6.39
Sci-Fi 62 6.872419 9.95 1.92 8.03
Sports 23 6.391304 8.78 2.94 5.84
Thriller 164 7.095976 9.91 3.11 6.80
War 27 6.542593 8.67 4.37 4.30
Western 14 6.960000 8.65 4.53 4.12
comedy 1 6.030000 6.03 6.03 0.00
thriller 1 5.750000 5.75 5.75 0.00
You may have noticed above that there are some issues with the Genre names. For example, "Comedy" and "Thriller" show up twice in the summary because these genres were not always capitalized. This can be fixed.
In the example below, I am using the unique() method to obtain a list of all genre names. I then define a list of new names where capitalization is consistent. Next, I use the replace() function to change the original name set to my new name set and fix the capitalization issues.
If I re-execute the code from above, you can see that the duplicates have been removed in the summary.
orig_names = list(movies_df["Genre"].unique())
print(orig_names)
new_names = ['Drama', 'Comedy', 'Sci-Fi', 'Independent', 'Thriller', 'Action', 'Family',
'Fantasy', 'Romance', 'Foreign', 'Horror', 'Documentary', 'Sports', 'War',
'Western', 'Classic', 'Comedy', 'Thriller']
movies_df.Genre.replace(orig_names, new_names, inplace=True)
list(movies_df["Genre"].unique())
genre_stats = pd.DataFrame()
genre_stats["Count"] = movies_df.groupby("Genre")["My_Rating"].count()
genre_stats["Mean"] = movies_df.groupby("Genre")["My_Rating"].mean()
genre_stats["Max"] = movies_df.groupby("Genre")["My_Rating"].max()
genre_stats["Min"] = movies_df.groupby("Genre")["My_Rating"].min()
genre_stats["Range"] = genre_stats["Max"] - genre_stats["Min"]
print(genre_stats)
['Drama', 'Comedy', 'Sci-Fi', 'Independent', 'Thriller', 'Action', 'Family', 'Fantasy', 'Romance', 'Foreign', 'Horror', 'Documentary', 'Sports', 'War', 'Western', 'Classic', 'comedy', 'thriller']
Count Mean Max Min Range
Genre
Action 197 6.651878 9.87 0.67 9.20
Classic 47 6.741064 8.33 4.03 4.30
Comedy 276 6.735978 9.96 1.76 8.20
Documentary 78 7.283718 8.87 5.43 3.44
Drama 321 7.140561 9.99 0.88 9.11
Family 84 6.670714 9.86 1.01 8.85
Fantasy 21 7.473810 9.85 6.21 3.64
Foreign 157 7.453439 9.63 3.76 5.87
Horror 96 6.523021 9.60 1.65 7.95
Independent 190 6.940263 9.92 1.81 8.11
Romance 94 7.238723 9.84 3.45 6.39
Sci-Fi 62 6.872419 9.95 1.92 8.03
Sports 23 6.391304 8.78 2.94 5.84
Thriller 165 7.087818 9.91 3.11 6.80
War 27 6.542593 8.67 4.37 4.30
Western 14 6.960000 8.65 4.53 4.12
The describe() method can be used to obtain a set of default summary statistics for a column of data. Combining this with group_by() allows for the calculation of statistics by group.
print(movies_df.groupby("Genre")["My_Rating"].describe())
count mean std min 25% 50% 75% max
Genre
Action 197.0 6.651878 1.687098 0.67 5.8400 6.950 7.7300 9.87
Classic 47.0 6.741064 0.903364 4.03 6.3000 6.790 7.2550 8.33
Comedy 276.0 6.735978 1.494788 1.76 5.9025 6.800 7.7025 9.96
Documentary 78.0 7.283718 0.778165 5.43 6.8725 7.290 7.8800 8.87
Drama 321.0 7.140561 1.292603 0.88 6.5800 7.150 7.8900 9.99
Family 84.0 6.670714 1.654381 1.01 5.9550 6.930 7.4875 9.86
Fantasy 21.0 7.473810 1.113838 6.21 6.6500 7.030 7.9400 9.85
Foreign 157.0 7.453439 0.778073 3.76 6.9500 7.410 7.9200 9.63
Horror 96.0 6.523021 1.491829 1.65 5.9000 6.770 7.2425 9.60
Independent 190.0 6.940263 1.208283 1.81 6.3300 6.880 7.6525 9.92
Romance 94.0 7.238723 1.471279 3.45 6.4925 7.085 8.2150 9.84
Sci-Fi 62.0 6.872419 1.901382 1.92 5.9300 7.040 8.2000 9.95
Sports 23.0 6.391304 1.591958 2.94 5.5800 6.770 7.3000 8.78
Thriller 165.0 7.087818 1.427839 3.11 6.3200 7.120 8.0300 9.91
War 27.0 6.542593 1.096518 4.37 5.8400 6.560 7.2100 8.67
Western 14.0 6.960000 1.411033 4.53 6.0400 7.070 8.2275 8.65
Concatenate and Merge
The Pandas concat() method is used to concatenate DataFrames that have the same columns. This is comparable to copying and pasting rows from two spreadsheets into a new spreadsheet. To demonstrate this, I have extracted rows using indexes. Next, I concatenate them back to a new DataFrame.
movies_df = pd.read_csv("D:/mydata/matts_movies.csv", sep=",", header=0, encoding="ISO-8859-1")
movies_df.columns =[column.replace(" ", "_") for column in movies_df.columns]
movies_sub1 = movies_df[100:500]
movies_sub2 = movies_df[900:1300]
movies_subc = pd.concat([movies_sub1, movies_sub2])
print(len(movies_subc))
800
Merge is comparable to table joins when using SQL. This requires the use of keys and the declaration of a joining method, such as "Left", "Right", "Inner", or "Outer".
In the example, I first create a unique ID by copying the row index to a column. I then break the data into two components, each containing the ID and a subset of the remaining columns. I then use the merge() method to merge the DataFrames using the "inner" method and the common "id" field. "Inner" will only return rows that occur in both data sets. Since both DataFrames were derived from the same original DataFrame, they will have identical rows, so the result would be the same as using "left", where all rows from the left table are maintained even if they don't occur in the right table, or "right", where all rows from the right table are maintained even if they don't occur in the left table.
In the second example, I use a query to separate out only movies released in 2008. When I perform a join with all of the data using the "inner" method, I only get back the common or shared rows.
Note that there is also a join() method that joins based on indexes. However, that will not be demonstrated here.
movies_df = pd.read_csv("D:/mydata/matts_movies.csv", sep=",", header=0, encoding="ISO-8859-1")
movies_df.columns =[column.replace(" ", "_") for column in movies_df.columns]
movies_df["id"] = movies_df.index
print(movies_df.head())
movies_first = movies_df[["id", "Movie_Name"]]
movies_second = movies_df[["id", "Director", "Release_Year"]]
movies_merge = pd.merge(movies_first, movies_second, how="inner", on="id")
print(movies_merge.head())
print(len(movies_merge))
movies_third = movies_df.query('Release_Year == 2008')[["id", "Director", "Release_Year"]]
movies_merge2 = pd.merge(movies_first, movies_third, how="inner", on="id")
print(movies_merge2.head())
print(len(movies_merge2))
Movie_Name Director Release_Year My_Rating Genre \
0 Almost Famous Cameron Crowe 2000 9.99 Drama
1 The Shawshank Redemption Frank Darabont 1994 9.98 Drama
2 Groundhog Day Harold Ramis 1993 9.96 Comedy
3 Donnie Darko Richard Kelly 2001 9.95 Sci-Fi
4 Children of Men Alfonso Cuaron 2006 9.94 Sci-Fi
Own id
0 Yes 0
1 Yes 1
2 Yes 2
3 Yes 3
4 Yes 4
id Movie_Name Director Release_Year
0 0 Almost Famous Cameron Crowe 2000
1 1 The Shawshank Redemption Frank Darabont 1994
2 2 Groundhog Day Harold Ramis 1993
3 3 Donnie Darko Richard Kelly 2001
4 4 Children of Men Alfonso Cuaron 2006
1852
id Movie_Name Director Release_Year
0 11 Wall-E Andrew Stanton 2008
1 38 The Dark Knight Christopher Nolan 2008
2 100 In Bruges Martin McDonagh 2008
3 102 Gran Torino Clint Eastwood 2008
4 104 Let the Right One In Tomas Alfredson 2008
93
Concluding Remarks
There are many more topics that could be explored relating to Pandas. However, I have found the methods and techniques discussed here to be the ones I use most often and are most useful.
For more examples and details, please consult the documentation for Pandas.
If you need to perform a specific task that I did not cover, Google it! Python has a large user base, so it is likely that you will find examples similar to your needs with a simple search.
In the next module, we will discuss methods for graphing and visualizing data using matplotlib, Seaborn, and Pandas.