Project: (Find the Best Markets to Advertise in)¶

Questions to address:¶

Check whether or not the sample we have is relevant for the study we want to perform
Calculate how much money students spent per month and select top 4 countries
Look at the distribution of money spent per month and remove outliers
Find the markets with the best combination of a lot of money spent and a lot of people

Tools:¶

drop rows with NaN
- df.dropna(axis='rows', subset=['col1'], inplace=True)
df.groupby(['col1'])['col2'].mean()): group by col1, calculate mean of col2
sns.boxplot
df_1.drop(df_2.index)

load defaults¶

import pandas as pd
import re
import numpy as np
import requests 
import seaborn as sns

%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
from matplotlib import rcParams
import matplotlib.dates as mdates

from functions import *

plt.rcParams.update({'axes.titlepad': 20, 'font.size': 12, 'axes.titlesize':20})

colors = [(0/255,107/255,164/255), (255/255, 128/255, 14/255), 'red', 'green', '#9E80BA', '#8EDB8E', '#58517A']
Ncolors = 10
color_map = plt.cm.Blues_r(np.linspace(0.2, 0.5, Ncolors))
#color_map = plt.cm.tab20c_r(np.linspace(0.2, 0.5, Ncolors))

Dataset: New Coder Survey (information on what type of online courses new developers took)¶

df = pd.read_csv('./data/2017-fCC-New-Coders-Survey-Data.csv')
display(df.iloc[:3,:8])

Analysis¶

Q1: Check Whether the sample we have is relevant for our study¶

Do we have new developers interested in web development?

df1 = df['JobRoleInterest'].value_counts().reset_index().reset_index(drop=True)
df1.columns = ['Job Title', 'Frequency']
display(df1[:9])

Yes, a lot of them took web related courses

Are people interested in only one subject or they can be interested in more than one subject?

print('Total answers=%d, Total people=%d' % (sum(df1['Frequency']),len(df['JobRoleInterest'])))

Total answers=6992, Total people=18175

most people must have only one subject since we only have 6000 answers to subject for 18000 people

How many people are interested in at least one of the two subjects we teach (web and mobile development)?

number_of_web = 0
for index, row in df1.iterrows():    
    if 'Web' in row['Job Title']:
        number_of_web+=row['Frequency']
          
print("People interested in web: %0.1f%%" % (100.*int(number_of_web)/len(df['JobRoleInterest'])))  

number_of_web = 0
for index, row in df1.iterrows():    
    if 'Mobile' in row['Job Title']:
        number_of_web+=row['Frequency']
          
print("People interested in mobile: %0.1f%%" % (100.*int(number_of_web)/len(df['JobRoleInterest'])))

People interested in web: 31.8%
People interested in mobile: 12.7%

Q2: Drop rows whithout interest specified and plot country distribution frequency¶

df.dropna(axis='rows', subset=['JobRoleInterest'], inplace=True)
print('Total people=%d' % len(df['JobRoleInterest']))

df1 = df['CountryLive'].value_counts().reset_index().reset_index(drop=True)
df1.columns = ['CountryLive', 'Frequency']
print (df1[:5])

Total people=6992
                CountryLive  Frequency
0  United States of America       3125
1                     India        528
2            United Kingdom        315
3                    Canada        260
4                    Poland        131

Q3: Narrow down analysis to top 4 countries and compute money spent per month¶

country_list=['United States of America', 'India', 'United Kingdom', 'Canada']
final_df = df.loc[df['CountryLive'].isin(country_list)].copy()

display(final_df[['Age','CountryLive']].iloc[:3])

#compute money per month
final_df['MonthsProgramming'].replace(0,1, inplace = True)
final_df['MoneyPerMonth'] = final_df['MoneyForLearning']/final_df['MonthsProgramming']

#drop null values
final_df = final_df[final_df['JobRoleInterest'].notnull()].copy()
final_df = final_df[final_df['MoneyPerMonth'].notnull()]
final_df = final_df[final_df['CountryLive'].notnull()]
print('Number of students willing to pay: %d' % len(final_df))

#group by country and calculate mean of money spent per month
print(final_df.groupby(['CountryLive'])['MoneyPerMonth'].mean())

Number of students willing to pay: 3915
CountryLive
Canada                      113.510961
India                       135.100982
United Kingdom               45.534443
United States of America    227.997996
Name: MoneyPerMonth, dtype: float64

The mean amount of money spent seems to vary a lot between countries so we need to investigate further what is causing the variation

Q4: Generate box plots to visualize distributions of money spent per month in the top 4 countries¶

sns.boxplot(y = 'MoneyPerMonth', x = 'CountryLive',data = final_df)
plt.title('Money Spent Per Month Per Country\n(Distributions)',fontsize = 16)
plt.ylabel('Money per month (US dollars)')
plt.xlabel('Country')
plt.xticks(range(4), ['US', 'UK', 'India', 'Canada']) # avoids tick labels overlap
plt.show()

There are a lot of outliers in the distributions, particulary in the US and India, the countries with the highest means

Q5: Remove outliers (broad condition)¶

# Isolate only those participants who spend less than 10000 per month
final_df = final_df[final_df['MoneyPerMonth'] < 20000]
print(final_df.groupby(['CountryLive'])['MoneyPerMonth'].mean())

CountryLive
Canada                      113.510961
India                       135.100982
United Kingdom               45.534443
United States of America    183.800110
Name: MoneyPerMonth, dtype: float64

sns.boxplot(y = 'MoneyPerMonth', x = 'CountryLive',data = final_df)
plt.title('Money Spent Per Month Per Country\n(Distributions)',fontsize = 16)
plt.ylabel('Money per month (US dollars)')
plt.xlabel('Country')
plt.xticks(range(4), ['US', 'UK', 'India', 'Canada']) # avoids tick labels overlap
plt.show()

Q6: Remove outliers (detailed level per country)¶

Remove India's outliers: MoneyPerMonth >= 2500

india_outliers = final_df[(final_df['CountryLive'] == 'India') & (final_df['MoneyPerMonth'] >= 2500)]
final_df = final_df.drop(india_outliers.index) # using the row labels

Remove Canada's outliers: MoneyPerMonth >= 4500

canada_outliers = final_df[(final_df['CountryLive'] == 'Canada') & (final_df['MoneyPerMonth'] > 4500)]
# Remove the extreme outliers for Canada
final_df = final_df.drop(canada_outliers.index)

Remove US's outliers: No Bootcamp and less than 3 months programming (one of buyers)

us_outliers = final_df[(final_df['CountryLive'] == 'United States of America') & (final_df['MoneyPerMonth'] >= 6000)]
#remove: Didn't attend bootcamps, less than 3 months programming
no_bootcamp = final_df[(final_df['CountryLive'] == 'United States of America') & 
                       (final_df['MoneyPerMonth'] >= 6000) &
                       (final_df['AttendedBootcamp'] == 0) ]

final_df = final_df.drop(no_bootcamp.index)


# Remove the respondents that had been programming for less than 3 months
less_than_3_months = final_df[(final_df['CountryLive'] == 'United States of America') & 
                              (final_df['MoneyPerMonth'] >= 6000) &
                              (final_df['MonthsProgramming'] <= 3) ]

final_df = final_df.drop(less_than_3_months.index)

Q7: Best Market to advertise after removing outliers?¶

money_spent = final_df.groupby(['CountryLive'])['MoneyPerMonth'].mean()
population = final_df['CountryLive'].value_counts(normalize = True) * 100

print(money_spent)

CountryLive
Canada                       93.065400
India                        65.758763
United Kingdom               45.534443
United States of America    142.654608
Name: MoneyPerMonth, dtype: float64

print(population)

United States of America    74.967908
India                       11.732991
United Kingdom               7.163030
Canada                       6.136072
Name: CountryLive, dtype: float64

print(population*money_spent)

Canada                        571.055986
India                         771.546974
United Kingdom                326.164561
United States of America    10694.517462
dtype: float64

US is clearly the best choice to advertise on
India is the second choice, less money per person, but more people

	Job Title	Frequency
0	Full-Stack Web Developer	823
1	Front-End Web Developer	450
2	Data Scientist	152
3	Back-End Web Developer	142
4	Mobile Developer	117
5	Game Developer	114
6	Information Security	92
7	Full-Stack Web Developer, Front-End Web Deve...	64
8	Front-End Web Developer, Full-Stack Web Deve...	56

	Age	BootcampFinish	BootcampLoanYesNo	BootcampName	BootcampRecommend	ChildrenNumber	CityPopulation
0	27.0	NaN	NaN	NaN	NaN	NaN	more than 1 million
1	34.0	NaN	NaN	NaN	NaN	NaN	less than 100,000
2	21.0	NaN	NaN	NaN	NaN	NaN	more than 1 million

	Age	CountryLive
1	34.0	United States of America
2	21.0	United States of America
6	29.0	United Kingdom