Project: Cleaning Data and String Manipulation (Jeopardy)¶

Questions to address:¶

Check patterns in Jeopardy questions that could help you win.
First clean the dataset
Decide whether to study past questions, general knowledge, or not study it all based on whether questions reoccur or whether the answer is deducible from the question.

Tools:¶

string manipulation/normalization: lower case words and remove punctuation
convert some columns to numeric and datetime.
- pd.apply(), pd.to_datetime()
set()

load defaults¶

import pandas as pd
import re
import numpy as np
import requests 
from bs4 import BeautifulSoup

%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
from matplotlib import rcParams
import matplotlib.dates as mdates

from functions import *

plt.rcParams.update({'axes.titlepad': 20, 'font.size': 12, 'axes.titlesize':20})

colors = [(0/255,107/255,164/255), (255/255, 128/255, 14/255), 'red', 'green', '#9E80BA', '#8EDB8E', '#58517A']
Ncolors = 10
color_map = plt.cm.Blues_r(np.linspace(0.2, 0.5, Ncolors))
#color_map = plt.cm.tab20c_r(np.linspace(0.2, 0.5, Ncolors))

Dataset: Jeopardy Questions and Answers¶

jeopardy = pd.read_csv('./data/jeopardy.csv')
display(jeopardy[:3])

remove spaces from column names¶

new_columns=[re.sub('^ ','', ii) for ii in jeopardy.columns]
jeopardy.columns = new_columns
print(jeopardy.columns.values)

['Show Number' 'Air Date' 'Round' 'Category' 'Value' 'Question' 'Answer']

Normalize Question and Answer columns: lowercaser words and remove punctuation¶

def normalize (string):
    #lower case
    new_string = string.lower()
    #remove punctuation
    new_string = re.sub('[:;,\'\".!?]','', new_string)
    return new_string

jeopardy.loc[:, 'clean_question'] = jeopardy['Question'].apply(normalize)
jeopardy.loc[:, 'clean_answer'] = jeopardy['Answer'].apply(normalize)

display(jeopardy[:1])

Convert Value to numeric and Air Date to datetime¶

def normalize (string):
    new_string = re.sub('[$:;,\'\".!?]','', string)
    try:
        integer = int(new_string)
    except ValueError:
        integer =0
    return integer

#remove dolar sign and convert to integer
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize)

#convert to datetime
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date']);

Analysis:¶

Decide whether to study past questions, general knowledge, or not study it all:

How often the answer is deducible from the question.
- how many times words in the answer occur in the question
How often new questions are repeats of older questions.
- how often complex words (> 6 characters) reoccur.

Q1: How many times words from the answer appear in the question¶

def funct (series):
    #split answer and question into list of words
    split_answer = series['clean_answer'].split(' ')
    split_question = series['clean_question'].split(' ')
    match_count = 0
    
    #remove 'The' from word list
    if('The' in split_answer):
        split_answer.remove('The')
    if(len(split_answer)==0):
        return 0
    
    #count number of words in common
    for element in split_answer:
        if(element in split_question):
            match_count+=1
    match_count = match_count/len(split_answer)
    
    return match_count

answer_in_question = jeopardy.apply(funct, axis=1)

print("Average fraction of words shared by question and answer: %0.2f" % answer_in_question.mean())

#count number of aswers for which there is at least on common word in the question
fraction = len(answer_in_question[answer_in_question>0.])/ len(answer_in_question)
print("Fraction of answers with at least one word in the question: %0.2f" % fraction)

Average fraction of words shared by question and answer: 0.08
Fraction of answers with at least one word in the question: 0.18

Q2: How often new questions are repeats of older ones¶

Sort jeopardy in order of ascending air date.
Split questions into words, remove words shorter than 6 characters, and check if words have occured before

question_overlap = []
terms_used = set()

for idx, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')    
    split_question = [ii for ii in split_question if len(ii)>6]
    match_count = 0
    
    for element in split_question:
        if(element in terms_used):
            match_count+=1
        #since terms_used is a set, the new worded is only added if it's not present
        terms_used.add(element)
        
    if(len(split_question)>0):
        match_count=match_count/len(split_question)
        
    #fraction of words that have occured before
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap

print("Average fraction of reoccuring words: %0.2f" % jeopardy['question_overlap'].mean())

Average fraction of reoccuring words: 0.62

Most of the words overlap!

Q3: Find how many times reoccuring word appear in high and low value questions (below and above $800)¶

#high or low value question
def funct (row):
    if(row['clean_value']>800):
        value=1
    else:
        value=0
    return value

jeopardy['high_value'] = jeopardy.apply(funct, axis=1)

def funct_word(word):
    low_count=0
    high_count=0
    for idx, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if(word in split_question):
            if(row['high_value']==1):
                high_count+=1
            else:
                low_count+=1
    return high_count, low_count

observed_counts = []
#select only a subset of repeated words since this takes a long time
comparison_terms = list(terms_used)[:19]

#find the number of low and high value questions that the terms appeared in
for element in comparison_terms:
    observed_counts.append(funct_word(element))

print("Number of occurences in high and low value questions")
for ii in range(0, len(observed_counts)):
    print("%s - %d | %d" %(list(terms_used)[ii], observed_counts[ii][0], observed_counts[ii][1]))

Number of occurences in high and low value questions
sailing - 2 | 9
leonard - 3 | 4
chrysanthemum - 0 | 1
hillman - 0 | 1
href=http//wwwj-archivecom/media/2008-03-13_dj_30jpg - 1 | 0
diligently - 1 | 0
nominees - 0 | 2
target=_blank>the - 3 | 0
classics - 2 | 2
indulgence - 0 | 1
satirical - 1 | 1
microsoft - 0 | 2
pledged - 0 | 2
grandfathers - 0 | 2
bedding - 0 | 1
marinate - 0 | 1
crawfords - 0 | 1
generic - 0 | 2
fleet-ingly - 0 | 1

only for the first 20 terms since it takes a long time

Q4: Find if repeated words are more likely to appear in low or high value questions (compute expected counts and chi-squared for the sub-set of terms)¶

Find the percentage of questions the word occurs in.
Based on the percentage of questions the word occurs in, find expected low and high value counts.
Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

from scipy.stats import chisquare

high_value_count = len(jeopardy[jeopardy['high_value']==1])
low_value_count = len(jeopardy[jeopardy['high_value']==0])

chi_squared=[]

for element in observed_counts:
    #take number of words that appeared more than once
    total = element[0]+element[1]
    #divide by total number of words
    total_prop = total/jeopardy.shape[0]
    #get expected number of high_values for the number of words that are repeated
    expected_high_value_count = total_prop*high_value_count
    expected_low_value_count = total_prop*low_value_count     
    observed = [element[0], element[1]]
    expected = [expected_high_value_count, expected_low_value_count]
    
    chi, p_value = chisquare(observed, expected)
    chi_squared.append([chi, p_value])

for idx, element in enumerate(chi_squared):
    print("%s - p_value: %0.2f" % (comparison_terms[idx], element[1]))

sailing - p_value: 0.44
leonard - p_value: 0.41
chrysanthemum - p_value: 0.53
hillman - p_value: 0.53
href=http//wwwj-archivecom/media/2008-03-13_dj_30jpg - p_value: 0.11
diligently - p_value: 0.11
nominees - p_value: 0.37
target=_blank>the - p_value: 0.01
classics - p_value: 0.35
indulgence - p_value: 0.53
satirical - p_value: 0.50
microsoft - p_value: 0.37
pledged - p_value: 0.37
grandfathers - p_value: 0.37
bedding - p_value: 0.53
marinate - p_value: 0.53
crawfords - p_value: 0.53
generic - p_value: 0.37
fleet-ingly - p_value: 0.53

There are no very small p values implying that the re_occuring low and high value counts are similar so the overall low and high value counts

Potential next steps:

Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
- Manually create a list of words to remove, like the, than, etc.
- Find a list of stopwords to remove.
- Remove words that occur in more than a certain percentage (like 5%) of questions.
Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
- Use the apply method to make the code that calculates frequencies more efficient.
- Only select terms that have high frequencies across the dataset, and ignore the others.

Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
- See which categories appear the most often.
- Find the probability of each category appearing in each round.
Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.

	Show Number	Air Date	Round	Category	Value	Question	Answer
0	4680	2004-12-31	Jeopardy!	HISTORY	$200	For the last 8 years of his life, Galileo was ...	Copernicus
1	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$200	No. 2: 1912 Olympian; football star at Carlisl...	Jim Thorpe
2	4680	2004-12-31	Jeopardy!	EVERYBODY TALKS ABOUT IT...	$200	The city of Yuma in this state has a record av...	Arizona