Project: SQLile and DBDesigner Designing and Creating a Database (Baseball Stats)


Questions to address:

  • Start with exploratory data analysis (EDA)
  • Create database from original files
  • Normalize database
  • Create new tables according to normalization decisions and insert data


Tools:

  • df.to_sql
  • https://www.dbdesigner.net/
  • ALTER TABLE game_log ADD COLUMN
  • UPDATE game_log set game_id
  • CREATE TABLE IF NOT EXISTS
  • INSERT OR IGNORE INTO
  • UNION
  • insert via dictionary:
    • query_vars = {"hv": hv, "num": num}
    • run_command(database_file, template.format(**query_vars))


load defaults

In [1]:
import sqlite3
import pandas as pd
import re
import numpy as np
import requests 
import os
import shutil

%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
from matplotlib import rcParams
import matplotlib.dates as mdates

from functions import *

plt.rcParams.update({'axes.titlepad': 20, 'font.size': 12, 'axes.titlesize':20})

colors = [(0/255,107/255,164/255), (255/255, 128/255, 14/255), 'red', 'green', '#9E80BA', '#8EDB8E', '#58517A']
Ncolors = 10
color_map = plt.cm.Blues_r(np.linspace(0.2, 0.5, Ncolors))
#color_map = plt.cm.tab20c_r(np.linspace(0.2, 0.5, Ncolors))


EDA (Exploratory Data Analysis)

  • Becoming familiar, at a high level, with the meaning of each column in each file.
  • Thinking about the relationships between columns within each file.
  • Thinking about the relationships between columns across different files.
In [2]:
game_log = pd.read_csv('./data/game_log.csv', low_memory=False)
park_c = pd.read_csv('./data/park_codes.csv')
person_c = pd.read_csv('./data/person_codes.csv')
team_c = pd.read_csv('./data/team_codes.csv')

print("reading done")
reading done


game_log.csv: one line per game with stats like date, teams, player names, game stats for victory and loosing team

In [3]:
print(game_log.shape)
display(game_log.iloc[:3,[0,1,2,3,4,5,6,7,8,20,80]])
(171907, 161)
date number_of_game day_of_week v_name v_league v_game_number h_name h_league h_game_number h_line_score 1b_umpire_name
0 18710504 0 Thu CL1 NaN 1 FW1 NaN 1 010010000 NaN
1 18710505 0 Fri BS1 NaN 1 WS3 NaN 1 640113030 NaN
2 18710506 0 Sat CL1 NaN 2 RC1 NaN 1 010020100 NaN


park_codes.csv: stadium information like city, state, league

In [4]:
print(park_c.shape)
display(park_c[:3])
(252, 9)
park_id name aka city state start end league notes
0 ALB01 Riverside Park NaN Albany NY 09/11/1880 05/30/1882 NL TRN:9/11/80;6/15&9/10/1881;5/16-5/18&5/30/1882
1 ALT01 Columbia Park NaN Altoona PA 04/30/1884 05/31/1884 UA NaN
2 ANA01 Angel Stadium of Anaheim Edison Field; Anaheim Stadium Anaheim CA 04/19/1966 NaN AL NaN


person_codes.csv: personal information on name and role (player, coach, etc)

In [5]:
print(person_c.shape)
display(person_c[:3])
(20494, 7)
id last first player_debut mgr_debut coach_debut ump_debut
0 aardd001 Aardsma David 04/06/2004 NaN NaN NaN
1 aaroh101 Aaron Hank 04/13/1954 NaN NaN NaN
2 aarot101 Aaron Tommie 04/10/1962 NaN 04/06/1979 NaN


team_codes.csv: teams information like ID, league, city, dates for the start and end of the team

In [6]:
print(team_c.shape)
display(team_c[:3])
(150, 8)
team_id league start end city nickname franch_id seq
0 ALT UA 1884 1884 Altoona Mountain Cities ALT 1
1 ARI NL 1998 0 Arizona Diamondbacks ARI 1
2 BFN NL 1879 1885 Buffalo Bisons BFN 1


First notes on EDA

  • game_log, park codes and team code all have league
  • park_codes have date of the first and last game that can be obtained from the game log


Information on the league codes:

  • NL: national league
  • UA: union association (extinct)
  • PL: players league
  • AA: double-A (minor league)
  • AL: American League
  • FL: florida league (college summer)
In [3]:
game_log["h_league"].value_counts()
Out[3]:
NL    88867
AL    74712
AA     5039
FL     1243
PL      532
UA      428
Name: h_league, dtype: int64


Analysis

We will create a new database so the first thing to do is to delete it if it exists on disk

In [3]:
try:
    os.remove("./data/mlb.db")
    print("file removed")
except:
    print("file not present")
file removed


Q1: Create Database with existing tables

In [4]:
conn = sqlite3.connect('./data/mlb.db')
game_log.to_sql('game_log', conn, index=False)
person_c.to_sql('person_codes', conn, index=False)
team_c.to_sql('team_codes', conn, index=False)
park_c.to_sql('park_codes', conn, index=False)
conn.close()

database_file = './data/mlb.db'

print("databases created")
databases created


Q2: Create unique ID in game_log: h_name + date + number_of_game

In [5]:
q=""" ALTER TABLE game_log
       ADD COLUMN game_id text
  """
run_command(database_file, q)

q=""" UPDATE game_log
         set game_id = h_name || date || number_of_game
  """
run_command(database_file, q)

q = "select game_id from game_log LIMIT 10"
df = run_query(database_file, q)
display(df[:3])
game_id
0 FW1187105040
1 WS3187105050
2 RC1187105060


Q3: Design Normalization

A few facts about the data to take into account for the normalization decisions:

  • teams can move between leagues.
  • The same person might be in a single game as both a player and a manager
  • Because of how pitchers are represented in the game log, not all pitchers used in a game will be shown. We only want to worry about the pitchers mentioned via position or the 'winning pitcher'/ 'losing pitcher'.

https://www.dbdesigner.net/

Alt text

key aspects of normalization:

  • separate game stats player info from game_log and move them into team_appearence and player_appearence (foreign key game_id in game_log)
  • connect game_log to park via park_id
  • connect team_appearence to team via team_id and person_appearence to person via player_id
  • create a league table and connect both team and team_appearence to it via league_id
  • separate information of the person role into appearence_type and connect person_appearance to it via appearence_type_ID


Q4: create person table

In [8]:
q=""" CREATE TABLE IF NOT EXISTS person (person_id text ,
                                         first_name text,
                                         last_name text,
                                         PRIMARY KEY (person_id))
  """
run_command(database_file, q)

q=""" INSERT OR IGNORE INTO person
                     SELECT id, first, last
                       FROM person_codes
  """
run_command(database_file, q)

q="select * from person"
df = run_query(database_file, q)
display(df[:2])
person_id first_name last_name
0 aardd001 David Aardsma
1 aaroh101 Hank Aaron


Q5: create park table

In [9]:
q=""" CREATE TABLE IF NOT EXISTS park (park_id text, 
                                       name text,
                                       nickname text,
                                       city text,
                                       state text,
                                       notes text,
                                       PRIMARY KEY (park_id))
  """
run_command(database_file, q)

q=""" INSERT OR IGNORE INTO park
                     select park_id,
                            name,
                            aka,
                            city,
                            state,
                            notes
                       from park_codes
  """
run_command(database_file, q)

q="select * from park"
df = run_query(database_file, q)
display(df[:2])
park_id name nickname city state notes
0 ALB01 Riverside Park None Albany NY TRN:9/11/80;6/15&9/10/1881;5/16-5/18&5/30/1882
1 ALT01 Columbia Park None Altoona PA None


Q6: create league table

In [10]:
q=""" CREATE TABLE IF NOT EXISTS league (league_id text,
                                         name text,
                                         PRIMARY KEY (league_id))
  """
run_command(database_file, q)

q=""" INSERT OR IGNORE INTO league VALUES ('NL', 'National League'),
                                          ('UA', 'Unions Association'),
                                          ('PL', 'Players League'),
                                          ('AA', 'Double-A League'),
                                          ('AL', 'American League'),
                                          ('FL', 'Florida League')
  """
run_command(database_file, q)

q="select * from league"
df = run_query(database_file, q)
display(df[:2])
league_id name
0 NL National League
1 UA Unions Association


Q7: create appearance_type table

In [12]:
q=""" CREATE TABLE IF NOT EXISTS appearance_type (appearance_type_id text PRIMARY KEY,
                                                  name text,
                                                  category text)
"""
run_command(database_file, q)

#this information is in the original tables, but to make it simpler it is loaded from a file
conn = sqlite3.connect('./data/mlb.db')
df = pd.read_csv('./data/appearance_type.csv')
df.to_sql('appearance_type_full', conn, index=False)
conn.close()

q="INSERT OR IGNORE INTO appearance_type select * from appearance_type_full"
run_command(database_file, q)

q="select * from appearance_type"
df = run_query(database_file, q)
display(df[:2])
appearance_type_id name category
0 O1 Batter 1 offense
1 O2 Batter 2 offense


Q8: create team table

In [13]:
q=""" CREATE TABLE IF NOT EXISTS team (team_id text PRIMARY KEY,
                                       league_id text,
                                       city text,
                                       nickname text,
                                       franch_id text,
                                       FOREIGN KEY (league_id) REFERENCES league(league_id))
   """
run_command(database_file, q)

q=""" INSERT OR IGNORE INTO team
                     SELECT team_id,
                            league,
                            city,
                            nickname,
                            franch_id
                       FROM team_codes
  """
run_command(database_file, q)

q="select * from team"
df = run_query(database_file, q)
display(df[:2])
team_id league_id city nickname franch_id
0 ALT UA Altoona Mountain Cities ALT
1 ARI NL Arizona Diamondbacks ARI


Q9: create game table

In [14]:
q=""" CREATE TABLE IF NOT EXISTS game (game_id text PRIMARY KEY,
                                       date text,
                                       number_of_game integer,
                                       park_id text,
                                       length_outs float,
                                       day BOOLEAN,
                                       completion text,
                                       forfeit text,
                                       protest text,
                                       attendance float,
                                       length_minutes float,
                                       additional_info text,
                                       acquisition_info text,
                                       FOREIGN KEY (park_id) REFERENCES park(park_id))
  """
run_command(database_file, q)

q=""" INSERT OR IGNORE INTO game
                     SELECT game_id,
                            date,
                            number_of_game,
                            park_id,
                            length_outs,
                            CASE WHEN 'day' THEN 1 ELSE 0 END as day,
                            completion,
                            forefeit,
                            protest,
                            attendance,
                            length_minutes,
                            additional_info,
                            acquisition_info
                       FROM game_log
"""
run_command(database_file, q)

q="select * from game"
df = run_query(database_file, q)
display(df.iloc[:2,:-1])
game_id date number_of_game park_id length_outs day completion forfeit protest attendance length_minutes additional_info
0 FW1187105040 18710504 0 FOR01 54.0 0 None None None 200.0 120.0 None
1 WS3187105050 18710505 0 WAS01 54.0 0 None None None 5000.0 145.0 HTBF


Q10: create team_appearance

In [18]:
q="""CREATE TABLE IF NOT EXISTS team_appearance (team_id text, 
                                                 game_id text, 
                                                 home BOOLEAN,
                                                 league_id text, 
                                                 score INTEGER, 
                                                 line_score TEXT,
                                                 at_bats ,
                                                 hits REAL,
                                                 doubles REAL,
                                                 triples REAL,
                                                 homeruns REAL,
                                                 rbi REAL,
                                                 sacrifice_hits REAL,
                                                 sacrifice_flies REAL,
                                                 hit_by_pitch REAL,
                                                 walks REAL,
                                                 intentional_walks REAL,
                                                 strikeouts REAL,
                                                 stolen_bases REAL,
                                                 caught_stealing REAL,
                                                 grounded_into_double REAL,
                                                 first_catcher_interference REAL,
                                                 left_on_base REAL,
                                                 pitchers_used REAL,
                                                 individual_earned_runs REAL,
                                                 team_earned_runs REAL,
                                                 wild_pitches REAL,
                                                 balks REAL,
                                                 putouts REAL,
                                                 assists REAL,
                                                 errors REAL,
                                                 passed_balls REAL,
                                                 double_plays REAL,
                                                 triple_plays REAL,

                                                 PRIMARY KEY (team_id, game_id),
                                                 FOREIGN KEY (team_id) REFERENCES team(team_id),
                                                 FOREIGN KEY (game_id) REFERENCES game(game_id),
                                                 FOREIGN KEY (league_id) REFERENCES league(league_id))
"""
run_command(database_file, q)

#for each line in game_log there will be two entrie in team_apperance
#one for the home and one for the away team
q=""" INSERT OR IGNORE INTO team_appearance
                     SELECT h_name,
                     game_id,
                     1 AS home,
                     h_league,
                     h_score, 
                     h_line_score,
                     h_at_bats,  
                     h_hits,
                     h_doubles,
                     h_triples,
                     h_homeruns,
                     h_rbi,
                     h_sacrifice_hits,
                     h_sacrifice_flies,
                     h_hit_by_pitch,
                     h_walks,
                     h_intentional_walks,
                     h_strikeouts,
                     h_stolen_bases,
                     h_caught_stealing,
                     h_grounded_into_double,
                     h_first_catcher_interference,
                     h_left_on_base,
                     h_pitchers_used,
                     h_individual_earned_runs,
                     h_team_earned_runs,
                     h_wild_pitches,
                     h_balks,
                     h_putouts,
                     h_assists,
                     h_errors,
                     h_passed_balls,
                     h_double_plays,
                     h_triple_plays
                from game_log
  
   UNION

              SELECT v_name,
                     game_id,
                     0 AS home,
                     v_league,
                     v_score, 
                     v_line_score,
                     v_at_bats,  
                     v_hits,
                     v_doubles,
                     v_triples,
                     v_homeruns,
                     v_rbi,
                     v_sacrifice_hits,
                     v_sacrifice_flies,
                     v_hit_by_pitch,
                     v_walks,
                     v_intentional_walks,
                     v_strikeouts,
                     v_stolen_bases,
                     v_caught_stealing,
                     v_grounded_into_double,
                     v_first_catcher_interference,
                     v_left_on_base,
                     v_pitchers_used,
                     v_individual_earned_runs,
                     v_team_earned_runs,
                     v_wild_pitches,
                     v_balks,
                     v_putouts,
                     v_assists,
                     v_errors,
                     v_passed_balls,
                     v_double_plays,
                     v_triple_plays
                from game_log


"""
run_command(database_file, q)


q="select * from team_appearance"
df = run_query(database_file, q)
display(df.iloc[:2,:9])
team_id game_id home league_id score line_score at_bats hits doubles
0 ALT ALT188404300 1 UA 2 None NaN NaN NaN
1 ALT ALT188405020 1 UA 3 None NaN NaN NaN


Q11: create person_appearance table

In [19]:
q=""" CREATE TABLE IF NOT EXISTS person_appearance (appearance_id INTEGER,
                                                    game_id text,
                                                    team_id integer,
                                                    person_id text,
                                                    appearance_type_id text,
                                                    PRIMARY KEY (appearance_id),
                                                    FOREIGN KEY (person_id) REFERENCES person(person_id),
                                                    FOREIGN KEY (team_id) REFERENCES team(team_id),
                                                    FOREIGN KEY (game_id) REFERENCES game(game_id),
                                                    FOREIGN KEY (appearance_type_id) REFERENCES appearance_type(appearance_type_id))
"""
run_command(database_file, q)
print("table created")
table created

insert into person_appearance first part with explicit query, below with for loop for players

In [21]:
q=""" INSERT OR IGNORE INTO person_appearance(game_id, team_id, person_id, appearance_type_id)
                     SELECT game_id,
                            NULL,
                            lf_umpire_id,
                            "ULF"
                       FROM game_log
                      WHERE lf_umpire_id IS NOT NULL
 
                     UNION

                     SELECT game_id,
                            NULL,
                            rf_umpire_id,
                            "URF"
                       FROM game_log
                      WHERE rf_umpire_id IS NOT NULL
 
                      UNION

                     SELECT game_id,
                            NULL,
                            hp_umpire_id,
                            "UHP"
                       FROM game_log
                      WHERE hp_umpire_id IS NOT NULL
  
                      UNION

                     SELECT game_id,
                            NULL,
                            [1b_umpire_id],
                            "U1B"
                       FROM game_log
                      WHERE [1b_umpire_id] IS NOT NULL

                      UNION

                     SELECT game_id,
                            NULL,
                            [2b_umpire_id],
                            "U2B"
                       FROM game_log
                      WHERE [2b_umpire_id] IS NOT NULL

                      UNION

                     SELECT game_id,
                            NULL,
                            [3b_umpire_id],
                            "U3B"
                       FROM game_log
                      WHERE [3b_umpire_id] IS NOT NULL

                      UNION

                     SELECT game_id,
                            v_name,
                            v_manager_id,
                            "MM"
                       FROM game_log
                      WHERE v_manager_id IS NOT NULL

                      UNION

                     SELECT game_id,
                            h_name,
                            h_manager_id,
                            "MM"
                       FROM game_log
                      WHERE h_manager_id IS NOT NULL
    
                      UNION

                     SELECT game_id,
                            CASE WHEN h_score > v_score THEN h_name ELSE v_name END,
                            winning_pitcher_id,
                            "AWP"
                       FROM game_log
                      WHERE winning_pitcher_id IS NOT NULL  
    
                      UNION

                     SELECT game_id,
                            CASE WHEN h_score < v_score THEN h_name ELSE v_name END,
                            losing_pitcher_id,
                            "ALP"
                       FROM game_log
                      WHERE losing_pitcher_id IS NOT NULL  
    
                      UNION

                     SELECT game_id,
                            CASE WHEN h_score > v_score THEN h_name ELSE v_name END,
                            saving_pitcher_id,
                            "ASP"
                       FROM game_log
                      WHERE saving_pitcher_id IS NOT NULL 
    
                      UNION

                     SELECT game_id,
                            CASE WHEN h_score > v_score THEN h_name ELSE v_name END,
                            winning_rbi_batter_id,
                            "AWB"
                       FROM game_log
                      WHERE winning_rbi_batter_id IS NOT NULL 
        
                      UNION

                     SELECT game_id,
                            h_name,
                            h_starting_pitcher_id,
                            "PSP"
                       FROM game_log
                      WHERE h_starting_pitcher_id IS NOT NULL
    
                      UNION

                     SELECT game_id,
                            v_name,
                            v_starting_pitcher_id,
                            "PSP"
                       FROM game_log
                      WHERE v_starting_pitcher_id IS NOT NULL
    """
run_command(database_file, q)
print("insert done")
insert done

insert player positions with for loop, each line in game log will have multiple entries plus offensive and defensive positions must be joined

In [22]:
template = """ INSERT INTO person_appearance (game_id,
                                              team_id,
                                              person_id,
                                              appearance_type_id) 
                    SELECT game_id,
                           {hv}_name,
                           {hv}_player_{num}_id,
                           "O{num}"
                      FROM game_log
                     WHERE {hv}_player_{num}_id IS NOT NULL

                     UNION

                    SELECT game_id,
                           {hv}_name,
                           {hv}_player_{num}_id,
                           "D" || CAST({hv}_player_{num}_def_pos AS INT)
                      FROM game_log
                     WHERE {hv}_player_{num}_id IS NOT NULL;
"""

#run_command(c1)
#run_command(c2)

for hv in ["h","v"]:
    for num in range(1,10):
        query_vars = {
            "hv": hv,
            "num": num
        }
        run_command(database_file, template.format(**query_vars))
print("insert done")
insert done
In [23]:
q="""select appearance_type_id, count(*) as count
       from person_appearance
   group by appearance_type_id
   order by 1"""
df = run_query(database_file, q)
display(df[:3])
appearance_type_id count
0 ALP 280458
1 ASP 96036
2 AWB 211398


Q12: Drop original tables

In [24]:
q="drop table game_log"
run_command(database_file, q)
q="drop table park_codes"
run_command(database_file, q)
q="drop table team_codes"
run_command(database_file, q)
q="drop table person_codes"
run_command(database_file, q)

print("done!")
done!