Project: SQLite Clean and Upload data (Academy Awards)¶

Questions to address:¶

Clean up the dataset Academy Awards
Rename columns, drop columns, extract information from columns and map values
Load the table into a local database
Query the database to check if the upload was successful

Tools:¶

Series.str[0:4].astype(np.int64)
df[df['col1'].isin(col_list)].copy()
Series.map(dict)
df.drop(labels=cols)
df.reset_index(drop=True)
string manipulation
df.to_sql
pd.read_sql_query

load defaults¶

import sqlite3
import pandas as pd
import re
import numpy as np
import requests 
import os
import shutil

%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
from matplotlib import rcParams
import matplotlib.dates as mdates

from functions import *

plt.rcParams.update({'axes.titlepad': 20, 'font.size': 12, 'axes.titlesize':20})

colors = [(0/255,107/255,164/255), (255/255, 128/255, 14/255), 'red', 'green', '#9E80BA', '#8EDB8E', '#58517A']
Ncolors = 10
color_map = plt.cm.Blues_r(np.linspace(0.2, 0.5, Ncolors))
#color_map = plt.cm.tab20c_r(np.linspace(0.2, 0.5, Ncolors))

Dataset: Academy Awards for different years¶

df = pd.read_csv('./data/academy_awards.csv', encoding='ISO-8859-1')
display(df.iloc[:5,:7])

Analysis:¶

Q1: Clean up the Year column and select only years after 2010¶

df['Year'] = df['Year'].str[0:4].astype(np.int64)
#select only years after 2000
later_than_2000 = df[df['Year']>2000].copy()

display(later_than_2000.iloc[:3,:7])

Q2: Select only a subset of Categories¶

#select only the following categories
award_categories = ['Actor -- Leading Role', 'Actor -- Supporting Role', 'Actress -- Leading Role', 'Actress -- Supporting Role']
nominations = later_than_2000.loc[later_than_2000['Category'].isin(award_categories),:].copy()
display(nominations.iloc[:3,:7])

Q3: Use map to replace Yes or No with 1 or 0 in the 'Won' columns¶

replace_dict = {'YES': 1, 'NO': 0}
nominations['Won?'] = nominations['Won?'].map(replace_dict)
display(nominations.iloc[:2,:7])

Q4: Rename 'Won?' to 'Won' and drop Unnamed columns¶

nominations['Won'] = nominations['Won?']
cols_to_drop = ['Won?', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10']
nominations.drop(labels=cols_to_drop, axis=1, inplace=True)

display(nominations[:3])

Q5: Reset index after all the changes which involved droping some columns¶

final_nominations = nominations.reset_index(drop=True).copy()
display(nominations[:3])

Q6: Get the movie name and the Character played by the actor from the 'Additional info' columns¶

additional_info_one = final_nominations['Additional Info'].str.rstrip("'}")
additional_info_two = additional_info_one.str.split("{'")
additional_info_two = additional_info_two.str
movie_names = additional_info_two[0]
characters = additional_info_two[1]

final_nominations['Movie'] = movie_names
final_nominations['Character'] = characters
final_nominations = final_nominations.drop('Additional Info', axis=1)
display(final_nominations[:3])

Q7: Use pd.to_Sql to upload tables into a database¶

To create a new database file, we use the sqlite3 library to connect to a file path that doesn't exist yet. If Python can't find the file we specified, it will create it for us and treat it as a SQLite database file.

conn = sqlite3.connect('./data/nominations.db')
final_nominations.to_sql('nominations', conn, index=False, if_exists='replace')
conn.close()
print('table loaded')

table loaded

Q8: Access the newly created database to see if the upload was done correctly¶

conn = sqlite3.connect('./data/nominations.db')
query = 'pragma table_info(nominations)'
schema = conn.execute(query).fetchall()
display(schema)

query = 'select * from nominations LIMIT 10'
df = pd.read_sql_query(query, conn)
display(df[:3])
conn.close()

[(0, 'Year', 'INTEGER', 0, None, 0),
 (1, 'Category', 'TEXT', 0, None, 0),
 (2, 'Nominee', 'TEXT', 0, None, 0),
 (3, 'Won', 'INTEGER', 0, None, 0),
 (4, 'Movie', 'TEXT', 0, None, 0),
 (5, 'Character', 'TEXT', 0, None, 0)]

Future steps, explore academy_awards.csv and fix the rest of the dataset:

The awards categories in older ceremonies were different than the ones we have today. What relevant information should we keep from older ceremonies?
What are all the different formatting styles that the Additional Info column contains. Use regular expressions to capture these patterns and clean them up?
- The nominations for the Art Direction category have lengthy values for Additional Info. What information is useful and how do we extract it?
- Many values in Additional Info don't contain the character name the actor or actress played. Should we toss out character name altogether as we expand our data?
Handle awards ceremonies that included movies from 2 years: keep only later year
- E.g. see 1927/28 (1st) in the Year column.

	Year	Category	Nominee	Additional Info	Won?	Unnamed: 5	Unnamed: 6
0	2010 (83rd)	Actor -- Leading Role	Javier Bardem	Biutiful {'Uxbal'}	NO	NaN	NaN
1	2010 (83rd)	Actor -- Leading Role	Jeff Bridges	True Grit {'Rooster Cogburn'}	NO	NaN	NaN
2	2010 (83rd)	Actor -- Leading Role	Jesse Eisenberg	The Social Network {'Mark Zuckerberg'}	NO	NaN	NaN
3	2010 (83rd)	Actor -- Leading Role	Colin Firth	The King's Speech {'King George VI'}	YES	NaN	NaN
4	2010 (83rd)	Actor -- Leading Role	James Franco	127 Hours {'Aron Ralston'}	NO	NaN	NaN

	Year	Category	Nominee	Additional Info	Won?	Unnamed: 5	Unnamed: 6
0	2010	Actor -- Leading Role	Javier Bardem	Biutiful {'Uxbal'}	NO	NaN	NaN
1	2010	Actor -- Leading Role	Jeff Bridges	True Grit {'Rooster Cogburn'}	NO	NaN	NaN
2	2010	Actor -- Leading Role	Jesse Eisenberg	The Social Network {'Mark Zuckerberg'}	NO	NaN	NaN