Project: Cleaning Data and String Manipulation (Jeopardy)


Questions to address:

  • Check patterns in Jeopardy questions that could help you win.
  • First clean the dataset
  • Decide whether to study past questions, general knowledge, or not study it all based on whether questions reoccur or whether the answer is deducible from the question.


Tools:

  • string manipulation/normalization: lower case words and remove punctuation
  • convert some columns to numeric and datetime.
    • pd.apply(), pd.to_datetime()
  • set()


load defaults

In [1]:
import pandas as pd
import re
import numpy as np
import requests 
from bs4 import BeautifulSoup

%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
from matplotlib import rcParams
import matplotlib.dates as mdates

from functions import *

plt.rcParams.update({'axes.titlepad': 20, 'font.size': 12, 'axes.titlesize':20})

colors = [(0/255,107/255,164/255), (255/255, 128/255, 14/255), 'red', 'green', '#9E80BA', '#8EDB8E', '#58517A']
Ncolors = 10
color_map = plt.cm.Blues_r(np.linspace(0.2, 0.5, Ncolors))
#color_map = plt.cm.tab20c_r(np.linspace(0.2, 0.5, Ncolors))


Dataset: Jeopardy Questions and Answers

In [2]:
jeopardy = pd.read_csv('./data/jeopardy.csv')
display(jeopardy[:3])
Show Number Air Date Round Category Value Question Answer
0 4680 2004-12-31 Jeopardy! HISTORY $200 For the last 8 years of his life, Galileo was ... Copernicus
1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe
2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record av... Arizona

remove spaces from column names

In [9]:
new_columns=[re.sub('^ ','', ii) for ii in jeopardy.columns]
jeopardy.columns = new_columns
print(jeopardy.columns.values)
['Show Number' 'Air Date' 'Round' 'Category' 'Value' 'Question' 'Answer']


Normalize Question and Answer columns: lowercaser words and remove punctuation

In [11]:
def normalize (string):
    #lower case
    new_string = string.lower()
    #remove punctuation
    new_string = re.sub('[:;,\'\".!?]','', new_string)
    return new_string

jeopardy.loc[:, 'clean_question'] = jeopardy['Question'].apply(normalize)
jeopardy.loc[:, 'clean_answer'] = jeopardy['Answer'].apply(normalize)

display(jeopardy[:1])
Show Number Air Date Round Category Value Question Answer clean_question clean_answer
0 4680 2004-12-31 Jeopardy! HISTORY $200 For the last 8 years of his life, Galileo was ... Copernicus for the last 8 years of his life galileo was u... copernicus


Convert Value to numeric and Air Date to datetime

In [12]:
def normalize (string):
    new_string = re.sub('[$:;,\'\".!?]','', string)
    try:
        integer = int(new_string)
    except ValueError:
        integer =0
    return integer

#remove dolar sign and convert to integer
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize)

#convert to datetime
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date']);


Analysis:

Decide whether to study past questions, general knowledge, or not study it all:

  • How often the answer is deducible from the question.
    • how many times words in the answer occur in the question
  • How often new questions are repeats of older questions.
    • how often complex words (> 6 characters) reoccur.


Q1: How many times words from the answer appear in the question

In [34]:
def funct (series):
    #split answer and question into list of words
    split_answer = series['clean_answer'].split(' ')
    split_question = series['clean_question'].split(' ')
    match_count = 0
    
    #remove 'The' from word list
    if('The' in split_answer):
        split_answer.remove('The')
    if(len(split_answer)==0):
        return 0
    
    #count number of words in common
    for element in split_answer:
        if(element in split_question):
            match_count+=1
    match_count = match_count/len(split_answer)
    
    return match_count

answer_in_question = jeopardy.apply(funct, axis=1)

print("Average fraction of words shared by question and answer: %0.2f" % answer_in_question.mean())

#count number of aswers for which there is at least on common word in the question
fraction = len(answer_in_question[answer_in_question>0.])/ len(answer_in_question)
print("Fraction of answers with at least one word in the question: %0.2f" % fraction)
Average fraction of words shared by question and answer: 0.08
Fraction of answers with at least one word in the question: 0.18


Q2: How often new questions are repeats of older ones

  • Sort jeopardy in order of ascending air date.
  • Split questions into words, remove words shorter than 6 characters, and check if words have occured before
In [35]:
question_overlap = []
terms_used = set()

for idx, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')    
    split_question = [ii for ii in split_question if len(ii)>6]
    match_count = 0
    
    for element in split_question:
        if(element in terms_used):
            match_count+=1
        #since terms_used is a set, the new worded is only added if it's not present
        terms_used.add(element)
        
    if(len(split_question)>0):
        match_count=match_count/len(split_question)
        
    #fraction of words that have occured before
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap

print("Average fraction of reoccuring words: %0.2f" % jeopardy['question_overlap'].mean())
Average fraction of reoccuring words: 0.62
  • Most of the words overlap!


Q3: Find how many times reoccuring word appear in high and low value questions (below and above $800)

In [56]:
#high or low value question
def funct (row):
    if(row['clean_value']>800):
        value=1
    else:
        value=0
    return value

jeopardy['high_value'] = jeopardy.apply(funct, axis=1)

def funct_word(word):
    low_count=0
    high_count=0
    for idx, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if(word in split_question):
            if(row['high_value']==1):
                high_count+=1
            else:
                low_count+=1
    return high_count, low_count

observed_counts = []
#select only a subset of repeated words since this takes a long time
comparison_terms = list(terms_used)[:19]

#find the number of low and high value questions that the terms appeared in
for element in comparison_terms:
    observed_counts.append(funct_word(element))

print("Number of occurences in high and low value questions")
for ii in range(0, len(observed_counts)):
    print("%s - %d | %d" %(list(terms_used)[ii], observed_counts[ii][0], observed_counts[ii][1]))
Number of occurences in high and low value questions
sailing - 2 | 9
leonard - 3 | 4
chrysanthemum - 0 | 1
hillman - 0 | 1
href=http//wwwj-archivecom/media/2008-03-13_dj_30jpg - 1 | 0
diligently - 1 | 0
nominees - 0 | 2
target=_blank>the - 3 | 0
classics - 2 | 2
indulgence - 0 | 1
satirical - 1 | 1
microsoft - 0 | 2
pledged - 0 | 2
grandfathers - 0 | 2
bedding - 0 | 1
marinate - 0 | 1
crawfords - 0 | 1
generic - 0 | 2
fleet-ingly - 0 | 1
  • only for the first 20 terms since it takes a long time


Q4: Find if repeated words are more likely to appear in low or high value questions (compute expected counts and chi-squared for the sub-set of terms)

  • Find the percentage of questions the word occurs in.
  • Based on the percentage of questions the word occurs in, find expected low and high value counts.
  • Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.
In [58]:
from scipy.stats import chisquare

high_value_count = len(jeopardy[jeopardy['high_value']==1])
low_value_count = len(jeopardy[jeopardy['high_value']==0])

chi_squared=[]

for element in observed_counts:
    #take number of words that appeared more than once
    total = element[0]+element[1]
    #divide by total number of words
    total_prop = total/jeopardy.shape[0]
    #get expected number of high_values for the number of words that are repeated
    expected_high_value_count = total_prop*high_value_count
    expected_low_value_count = total_prop*low_value_count     
    observed = [element[0], element[1]]
    expected = [expected_high_value_count, expected_low_value_count]
    
    chi, p_value = chisquare(observed, expected)
    chi_squared.append([chi, p_value])

for idx, element in enumerate(chi_squared):
    print("%s - p_value: %0.2f" % (comparison_terms[idx], element[1]))
    
sailing - p_value: 0.44
leonard - p_value: 0.41
chrysanthemum - p_value: 0.53
hillman - p_value: 0.53
href=http//wwwj-archivecom/media/2008-03-13_dj_30jpg - p_value: 0.11
diligently - p_value: 0.11
nominees - p_value: 0.37
target=_blank>the - p_value: 0.01
classics - p_value: 0.35
indulgence - p_value: 0.53
satirical - p_value: 0.50
microsoft - p_value: 0.37
pledged - p_value: 0.37
grandfathers - p_value: 0.37
bedding - p_value: 0.53
marinate - p_value: 0.53
crawfords - p_value: 0.53
generic - p_value: 0.37
fleet-ingly - p_value: 0.53
  • There are no very small p values implying that the re_occuring low and high value counts are similar so the overall low and high value counts

Potential next steps:

  • Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
    • Manually create a list of words to remove, like the, than, etc.
    • Find a list of stopwords to remove.
    • Remove words that occur in more than a certain percentage (like 5%) of questions.
  • Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
    • Use the apply method to make the code that calculates frequencies more efficient.
    • Only select terms that have high frequencies across the dataset, and ignore the others.
  • Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
    • See which categories appear the most often.
    • Find the probability of each category appearing in each round.
  • Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.
In [ ]: