Project: SQLite Clean and Upload data (Academy Awards)


Questions to address:

  • Clean up the dataset Academy Awards
  • Rename columns, drop columns, extract information from columns and map values
  • Load the table into a local database
  • Query the database to check if the upload was successful


Tools:

  • Series.str[0:4].astype(np.int64)
  • df[df['col1'].isin(col_list)].copy()
  • Series.map(dict)
  • df.drop(labels=cols)
  • df.reset_index(drop=True)
  • string manipulation
  • df.to_sql
  • pd.read_sql_query


load defaults

In [1]:
import sqlite3
import pandas as pd
import re
import numpy as np
import requests 
import os
import shutil

%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
from matplotlib import rcParams
import matplotlib.dates as mdates

from functions import *

plt.rcParams.update({'axes.titlepad': 20, 'font.size': 12, 'axes.titlesize':20})

colors = [(0/255,107/255,164/255), (255/255, 128/255, 14/255), 'red', 'green', '#9E80BA', '#8EDB8E', '#58517A']
Ncolors = 10
color_map = plt.cm.Blues_r(np.linspace(0.2, 0.5, Ncolors))
#color_map = plt.cm.tab20c_r(np.linspace(0.2, 0.5, Ncolors))


Dataset: Academy Awards for different years

In [72]:
df = pd.read_csv('./data/academy_awards.csv', encoding='ISO-8859-1')
display(df.iloc[:5,:7])
Year Category Nominee Additional Info Won? Unnamed: 5 Unnamed: 6
0 2010 (83rd) Actor -- Leading Role Javier Bardem Biutiful {'Uxbal'} NO NaN NaN
1 2010 (83rd) Actor -- Leading Role Jeff Bridges True Grit {'Rooster Cogburn'} NO NaN NaN
2 2010 (83rd) Actor -- Leading Role Jesse Eisenberg The Social Network {'Mark Zuckerberg'} NO NaN NaN
3 2010 (83rd) Actor -- Leading Role Colin Firth The King's Speech {'King George VI'} YES NaN NaN
4 2010 (83rd) Actor -- Leading Role James Franco 127 Hours {'Aron Ralston'} NO NaN NaN


Analysis:

Q1: Clean up the Year column and select only years after 2010

In [73]:
df['Year'] = df['Year'].str[0:4].astype(np.int64)
#select only years after 2000
later_than_2000 = df[df['Year']>2000].copy()

display(later_than_2000.iloc[:3,:7])
Year Category Nominee Additional Info Won? Unnamed: 5 Unnamed: 6
0 2010 Actor -- Leading Role Javier Bardem Biutiful {'Uxbal'} NO NaN NaN
1 2010 Actor -- Leading Role Jeff Bridges True Grit {'Rooster Cogburn'} NO NaN NaN
2 2010 Actor -- Leading Role Jesse Eisenberg The Social Network {'Mark Zuckerberg'} NO NaN NaN


Q2: Select only a subset of Categories

In [74]:
#select only the following categories
award_categories = ['Actor -- Leading Role', 'Actor -- Supporting Role', 'Actress -- Leading Role', 'Actress -- Supporting Role']
nominations = later_than_2000.loc[later_than_2000['Category'].isin(award_categories),:].copy()
display(nominations.iloc[:3,:7])
Year Category Nominee Additional Info Won? Unnamed: 5 Unnamed: 6
0 2010 Actor -- Leading Role Javier Bardem Biutiful {'Uxbal'} NO NaN NaN
1 2010 Actor -- Leading Role Jeff Bridges True Grit {'Rooster Cogburn'} NO NaN NaN
2 2010 Actor -- Leading Role Jesse Eisenberg The Social Network {'Mark Zuckerberg'} NO NaN NaN


Q3: Use map to replace Yes or No with 1 or 0 in the 'Won' columns

In [75]:
replace_dict = {'YES': 1, 'NO': 0}
nominations['Won?'] = nominations['Won?'].map(replace_dict)
display(nominations.iloc[:2,:7])
Year Category Nominee Additional Info Won? Unnamed: 5 Unnamed: 6
0 2010 Actor -- Leading Role Javier Bardem Biutiful {'Uxbal'} 0 NaN NaN
1 2010 Actor -- Leading Role Jeff Bridges True Grit {'Rooster Cogburn'} 0 NaN NaN


Q4: Rename 'Won?' to 'Won' and drop Unnamed columns

In [76]:
nominations['Won'] = nominations['Won?']
cols_to_drop = ['Won?', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10']
nominations.drop(labels=cols_to_drop, axis=1, inplace=True)

display(nominations[:3])
Year Category Nominee Additional Info Won
0 2010 Actor -- Leading Role Javier Bardem Biutiful {'Uxbal'} 0
1 2010 Actor -- Leading Role Jeff Bridges True Grit {'Rooster Cogburn'} 0
2 2010 Actor -- Leading Role Jesse Eisenberg The Social Network {'Mark Zuckerberg'} 0


Q5: Reset index after all the changes which involved droping some columns

In [77]:
final_nominations = nominations.reset_index(drop=True).copy()
display(nominations[:3])
Year Category Nominee Additional Info Won
0 2010 Actor -- Leading Role Javier Bardem Biutiful {'Uxbal'} 0
1 2010 Actor -- Leading Role Jeff Bridges True Grit {'Rooster Cogburn'} 0
2 2010 Actor -- Leading Role Jesse Eisenberg The Social Network {'Mark Zuckerberg'} 0


Q6: Get the movie name and the Character played by the actor from the 'Additional info' columns

In [78]:
additional_info_one = final_nominations['Additional Info'].str.rstrip("'}")
additional_info_two = additional_info_one.str.split("{'")
additional_info_two = additional_info_two.str
movie_names = additional_info_two[0]
characters = additional_info_two[1]

final_nominations['Movie'] = movie_names
final_nominations['Character'] = characters
final_nominations = final_nominations.drop('Additional Info', axis=1)
display(final_nominations[:3])
Year Category Nominee Won Movie Character
0 2010 Actor -- Leading Role Javier Bardem 0 Biutiful Uxbal
1 2010 Actor -- Leading Role Jeff Bridges 0 True Grit Rooster Cogburn
2 2010 Actor -- Leading Role Jesse Eisenberg 0 The Social Network Mark Zuckerberg


Q7: Use pd.to_Sql to upload tables into a database

To create a new database file, we use the sqlite3 library to connect to a file path that doesn't exist yet. If Python can't find the file we specified, it will create it for us and treat it as a SQLite database file.

In [81]:
conn = sqlite3.connect('./data/nominations.db')
final_nominations.to_sql('nominations', conn, index=False, if_exists='replace')
conn.close()
print('table loaded')
table loaded


Q8: Access the newly created database to see if the upload was done correctly

In [82]:
conn = sqlite3.connect('./data/nominations.db')
query = 'pragma table_info(nominations)'
schema = conn.execute(query).fetchall()
display(schema)

query = 'select * from nominations LIMIT 10'
df = pd.read_sql_query(query, conn)
display(df[:3])
conn.close()
[(0, 'Year', 'INTEGER', 0, None, 0),
 (1, 'Category', 'TEXT', 0, None, 0),
 (2, 'Nominee', 'TEXT', 0, None, 0),
 (3, 'Won', 'INTEGER', 0, None, 0),
 (4, 'Movie', 'TEXT', 0, None, 0),
 (5, 'Character', 'TEXT', 0, None, 0)]
Year Category Nominee Won Movie Character
0 2010 Actor -- Leading Role Javier Bardem 0 Biutiful Uxbal
1 2010 Actor -- Leading Role Jeff Bridges 0 True Grit Rooster Cogburn
2 2010 Actor -- Leading Role Jesse Eisenberg 0 The Social Network Mark Zuckerberg

Future steps, explore academy_awards.csv and fix the rest of the dataset:

  • The awards categories in older ceremonies were different than the ones we have today. What relevant information should we keep from older ceremonies?
  • What are all the different formatting styles that the Additional Info column contains. Use regular expressions to capture these patterns and clean them up?
    • The nominations for the Art Direction category have lengthy values for Additional Info. What information is useful and how do we extract it?
    • Many values in Additional Info don't contain the character name the actor or actress played. Should we toss out character name altogether as we expand our data?
  • Handle awards ceremonies that included movies from 2 years: keep only later year
    • E.g. see 1927/28 (1st) in the Year column.
In [ ]: