Many people play video games, but there is a growing community of people who enjoy an extra challenge: finishing the game as fast as possible. For practically every game, there is a speedrunning scene. In the early days of speedrunning, the practice was incredibly niche and splintered on various websites. But over the past 2 decades Speedrunning.com has become the biggest congregation of speedrunning content on the internet. As a member of the Minecraft: Java Edition speedrunning scene, I noticed an interesting phenomenon: the number of speedruns submitted during quarantine has exploded, to the point where moderation became backlogged. Many runners have assumed the spare time during quarantine, but I posit one particular runner, Dream, may have an outstanding effect.
import requests
import json
import sys
import matplotlib.pyplot as plt
from matplotlib.dates import drange
from datetime import datetime, timedelta
import numpy as np
import pandas as pd
from tqdm import tqdm
import os
import math
import time
import statsmodels.api as sm
import matplotlib.dates as mdates
from patsy import dmatrices
import pickle
%matplotlib inline
/opt/conda/lib/python3.9/site-packages/statsmodels/compat/pandas.py:65: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. from pandas import Int64Index as NumericIndex
Thankfully, Speedrun.com provides a rest API for us to pull information about games, leaderboards, and even individual runs. The API requires a header that briefly describes the use, which we do here:
init_headers = {'User-Agent': 'uni-project-bot/1.0'}
Now we can pull every game that is on Speedrun.com. We can request the IDs of games, which are used to uniquely identify them, and are useful for gaining more important information later.
# If data is already in directory, load that
if os.path.isfile("game_IDs.txt"): # if file exists we have already pickled a list
with open("game_IDs.txt", 'rb') as f:
game_IDs = pickle.load(f)
else:
# Request every game in batches of 1000
game_IDs = []
offset = 0
while(True):
# Request and unpack 1000 games
URL = 'https://www.speedrun.com/api/v1/games?_bulk=yes&max=1000&offset=' + str(offset)
response = requests.get(URL, init_headers)
data = response.json()['data']
# Add each game to array
for game in data:
game_IDs.append(game['id'])
offset += 1000
# If length is less than max, we break
if len(data) < 1000:
break
print(len(game_IDs))
28755
We see that in all, there are over 28 thousand games on Speedrun.com! With the IDs we can now pull and store all the data we need. Here is the dataframe we will use, and we collect data in the following categories:
Game: This is the international name of the game
Category: The category of the speedrun, which defines the leaderboard. A game can have multiple types of speedruns, such as beating the full game vs. a signle level, and the category differentiates these types. It can be split into further subcategories with the values.
Run Time: The length of the speedrun, defined as whatever time is used for rankings on the leaderboard, in seconds
Date: The date the speedrun was submitted (not always present), in Year-Month-Day format
Values: Set of aspects of a run that can put it in a subcategory, such as using glitches vs. glitchless. Only present if it creates a subcategory. The values combined with the category designate what 'type' of run a person is submitting
Game ID, Cat ID: Unique identifiers the API uses for finding games and categories
Note that the API returns runs that are current to the date of data collection, so future runs of data collection may look different as it will include runs which did not exist at the time I collected.
df = pd.DataFrame(columns = ['Game','Category','Run Time','Date','Values', 'Game ID', 'Cat ID'])
Finally, we use our game IDs to collect the data! For each game, we collect every verified run that was ever submitted to a leaderboard. This means no runs that were rejected. This also means we are essentially collecting the entire speedrun history of a game.
#game_IDs = ['j1npme6p']
path = os.getcwd()
path += '/FinalData(2)'
# If data is already in directory, load that
if os.path.isfile(path):
df = pd.read_csv('FinalData(2)')
df.drop('Unnamed: 0', axis=1, inplace=True)
# Else, collect the data
else:
maxim = 200 # Max number of runs we pull at a time, 200 is maximum allowed
sec = 15 # Cooldown time for when
track = 0
# Extracts all runs for each game, stores them in df
# Currently accounting for crash!!! Starting from where it crashed
for game_ID in tqdm(game_IDs):
# Every 100 games save the dataframe to disk
track += 1
if track % 500 == 0:
cwd = os.getcwd()
path = cwd + "/DataV" + str(track/500)
df.to_csv(path)
same = ''
# Get info about game, categories, and variables
URL = 'https://www.speedrun.com/api/v1/games/' + str(game_ID) + '?embed=categories.variables'
response = requests.get(URL,init_headers)
data = response.json() # This has failed exactly once for reasons unknown
try:
data = data['data']
except:
# Occurs if we get a throttling error. We wait 15 seconds then try again.
if 'status' in data and data['status'] == 420:
while 'status' in data and data['status'] == 420:
time.sleep(sec)
response = requests.get(URL,init_headers)
data = response.json()
data = data['data']
else:
# If other error, print and move on
# The only error in my case was a game not being found, presumably being deleted between pulling the game and pulling the runs
print('1b')
print(data)
continue
game = data['names']['international']
cats = data['categories']['data']
# Finds all the runs for each category
for categ in cats:
cat = categ['id'] # Category ID
cat_name = categ['name'] # Category Name
offset = 0
dir = 'asc'
fin = ''
sub_categories = [] # Collection of the variables that define subcategories
all_vars = categ['variables']['data'] # Collects all variables of a run
for var in all_vars:
if var['is-subcategory']:
sub_categories.append(var['values']['values'])
sub_keys = {}
for s in sub_categories:
# Assumed no two sub-categories in the same category will have the same variable ID
temp_dict = dict(s)
for t in temp_dict.keys():
temp_dict[t] = temp_dict[t]['label']
sub_keys.update(temp_dict)
# Collect data on every run.
while(True):
# Asks API for verified runs from this category, ordered by date submitted
URL = 'https://www.speedrun.com/api/v1/runs?game=' + str(game_ID) + '&category=' + str(cat) + '&orderby=submitted&direction=' + str(dir) + '&status=verified&max=' + str(maxim) + '&offset=' + str(offset)
response = requests.get(URL,init_headers)
data2 = response.json()
try:
data2 = data2['data']
except:
# Throttling error. Wait 15 seconds and try again.
if 'status' in data2 and data2['status'] == 420:
while 'status' in data2 and data2['status'] == 420:
time.sleep(sec)
response = requests.get(URL,init_headers)
data2 = response.json()
data2 = data2['data']
elif 'times' in data2:
data2 = data2
else:
# If other error, print and move on
print(2)
print(data2)
continue
for run in data2:
# Add game, category, time, date, and options
sub_cat = set()
# We store the label of the subcategory for ease of reading
for var in run['values'].values():
if var in sub_keys:
sub_cat.add(sub_keys[var])
df.loc[len(df.index)] = [game, cat_name, run['times']['primary_t'], run['date'], sub_cat, game_ID, cat]
# If length of collected data is smaller than maximum we can collect, we're at the end of the list and break
if len(data2) < maxim:
break
# Need to work from the back of the list if the offset is more than 10k (known bug)
if offset + maxim >= 10000:
fin = data2[-1]
dir = 'desc'
offset = 0
continue
# If we're working backwords and find the run we ended on going forward, we've found all runs and break
if dir == 'desc' and fin in data2:
dir = 'asc'
fin = ''
break
# If we collect 0 runs we break immediately (happens when no runs in category)
if(len(data2) == 0):
break
offset += maxim
# Convert the dates from a string to a datetime object, which is easier to use
def time_convert(x):
if pd.isna(x):
return np.nan
try:
return datetime.strptime(x, '%Y-%m-%d')
except:
try:
return datetime.strptime(x, '%Y-%m-%d')
except:
print(type(x))
df['Date'] = [time_convert(x) for x in df['Date']]
df
Game | Category | Run Time | Date | Values | Game ID | Cat ID | |
---|---|---|---|---|---|---|---|
0 | Bibi & Tina: New Adventures With Horses | Main Missions | 3531.0 | 2022-04-21 | set() | ldej22j1 | wdmm094d |
1 | Bibi & Tina: New Adventures With Horses | Main Missions | 3482.0 | 2022-04-22 | set() | ldej22j1 | wdmm094d |
2 | Bibi & Tina: New Adventures With Horses | Main Missions | 3396.0 | 2022-04-23 | set() | ldej22j1 | wdmm094d |
3 | Bibi & Tina: New Adventures With Horses | Main Missions | 3346.0 | 2022-04-26 | set() | ldej22j1 | wdmm094d |
4 | Burger & Frights | Any% | 906.0 | 2021-09-01 | set() | 3698y4ld | zdnzx59d |
... | ... | ... | ... | ... | ... | ... | ... |
2580131 | 暖雪 Warm Snow | White Ash% NMG | 1045.0 | 2022-04-19 | set() | v1pxz946 | ndxnwvvk |
2580132 | 暖雪 Warm Snow | Fresh File% NMG | 2569.0 | 2022-02-10 | set() | v1pxz946 | vdoy5my2 |
2580133 | 暖雪 Warm Snow | Fresh File% NMG | 2351.0 | 2022-04-21 | set() | v1pxz946 | vdoy5my2 |
2580134 | 暖雪 Warm Snow | Fresh File% NMG | 1676.0 | 2022-04-21 | set() | v1pxz946 | vdoy5my2 |
2580135 | 鬼神童子ZENKI | Any% | 1390.0 | 2021-08-10 | set() | 9d387701 | 5dw180ek |
2580136 rows × 7 columns
80 hours and over 2.5 million runs later, we have finally finished data collection.
To exam how the number of submitted runs changes with time, I decide to group them by month. I use the average length of month to split the data.
# Split range of dates in to approximately 1 month bins
bins = int(round((max(df['Date'])-min(df['Date']))/timedelta(weeks = 4.345),0))
print(bins)
605
We see we have about 605 months worth of runs, now we can split the data between these months.
# Cut data into the bins based on submission date
df['Date_Cut'] = pd.cut(df.Date, bins = bins)
# We don't need to know full interval for graphing, take left endpoints
def relabel(x):
if pd.isna(x):
return np.nan
else:
return x.left
df['Date_Cut'] = [relabel(x) for x in df['Date_Cut']]
Finally, we plot the runs as a bar chart, with a bar for each month.
# Count how many runs fall in each of the cuts
counts = df['Date_Cut'].value_counts()
counts = dict(counts)
# Plot these counts
plt.bar(*zip(*counts.items()), width = 31)
plt.title('Runs Submitted to Speedrun.Com')
plt.xlabel('Date Submitted')
plt.ylabel('Number of runs submitted')
plt.show()
This graph reveals several things, including both really old speedruns that were imported to Speedrun.com and runs that have plain errors in their input time (such as the run that supposedly submitted at the start of Unix time). To reel in the scope of this project, we will limit the data to runs that were submitted in 2014 or later, when Speedrun.com went online.
# Subset of more recent data
rec = df[df['Date'] >= '01-01-14']
# Count how many runs fall in each of the cuts
tot_counts = rec['Date_Cut'].value_counts()
tot_counts = dict(tot_counts)
# Plot these counts
plt.bar(*zip(*tot_counts.items()), width = 31)
plt.title('Runs Submitted to Speedrun.Com')
plt.xlabel('Date Submitted')
plt.ylabel('Number of runs submitted')
plt.show()
We see a rather large spike in 2020, that increases so rapidly it could be exponential. We can test this theory with a graph with a logarithmic y-axis:
plt.bar(*zip(*tot_counts.items()), width = 31)
plt.title('Runs Submitted to Speedrun.Com')
plt.xlabel('Date Submitted')
plt.yscale('log')
plt.ylabel('Number of runs submitted (log scale)')
plt.show()
Look at that! We can see a nearly linear relationship with this scale, suggesting the number of runs submitted to Speedrun.com is approximately exponential. Let's do the same thing with Minecraft.
At this scale, we can see an explosion in the speedrunning scene as a whole, starting gradually from 2014, slowing around 2018, before a huge and sustained spike in submitted speedruns in 2020. We can also see if this matches for Minecraft as well.
# All runs with game ID associated with Minecraft: JE
mine = rec[rec['Game ID'] == 'j1npme6p']
# Count how many runs fall in each of the cuts
counts = mine['Date_Cut'].value_counts()
counts = dict(counts)
# Plot these counts
plt.bar(*zip(*counts.items()), width = 31)
plt.title('Runs Submitted to Minecraft Leaderboards')
plt.xlabel('Date Submitted')
plt.ylabel('Number of runs submitted')
plt.show()
plt.bar(*zip(*counts.items()), width = 31)
plt.title('Runs Submitted to Minecraft Leaderboards')
plt.yscale('log')
plt.xlabel('Date Submitted')
plt.ylabel('Number of runs submitted (log scale)')
plt.show()
Here we can see why people believed Minecraft speedrunning really took off after quarantine began. We see a modest number of runs continuously submitted up until 2019, then a gradual growth through 2020, then an explosion going into 2021. However the log graph shows the key differences between Minecraft and the Speedrun.com as a whole. While we could draw a general linear trend from 2013 to 2020, it appears to be quite weak. More importantly, we see a huge spike starting in 2020 and going into 2021, even on the log graph. This suggests that Minecraft surged in popularity exceedingly so, compared to its earlier years. Further, we see a sharp decline starting in 2021 and leading into 2022. These last two features differ drastically from the results in the overall Speedrun.com. This suggests that Minecraft's speedrunning popularity is different from the site as a whole, and we can show this quantitatively with linear regressions.
We can use a linear regression to make an exponential fit of the Speedrun.com data by taking the log of the number of runs per month, then fitting a linear regression with respect to time. To start, let's copy the data we want: the months and the log of the number of runs in those months
tot_freq = pd.DataFrame.from_dict([dict(tot_counts)]).melt()
tot_freq.rename(columns = {'variable': 'Month', 'value': 'Count'}, inplace = True)
# Take log
log_count = {k: math.log(v) for k, v in counts.items()}
tot_freq["Count"] = tot_freq['Count'].apply(lambda x: math.log10(x))
# Change how time is represented as datetime objects don't fit well with statsmodels
copy = tot_freq['Month'].copy()
tot_freq['Month']=mdates.date2num(tot_freq['Month'])
tot_freq.head()
Month | Count | |
---|---|---|
0 | 18631.705785 | 4.951022 |
1 | 18783.672727 | 4.945469 |
2 | 18662.099174 | 4.929122 |
3 | 18692.492562 | 4.925451 |
4 | 18722.885950 | 4.897220 |
Note that the months are converted to a number, which is necessary for our linear regression. Speaking of which, we will now make a linear regression of all the data using just the log-count and the months.
X = tot_freq['Month']
X = sm.add_constant(X)
mod = sm.OLS(tot_freq['Count'], X)
res = mod.fit()
res.params
const -5.562301 Month 0.000554 dtype: float64
These parameters create a linear fit that looks like the following:
tot_freq['res'] = res.resid
tot_freq['fit'] = res.fittedvalues
tot_freq['Month'] = copy
tot_freq = tot_freq.sort_values(by = 'Month')
copy = tot_freq['Month'].copy()
fig,ax = plt.subplots()
ax.bar(tot_freq['Month'], tot_freq['Count'], width = 31)
plt.title('Runs Submitted to Speedrun.com')
plt.xlabel('Date Submitted')
plt.ylabel('Number of runs submitted (log scale)')
ax.set_ylim(2.5,5)
#plt.yscale('log')
ax2 = plt.twinx()
ax2.set_ylim(ax.get_ylim())
ax2.plot(tot_freq['Month'], tot_freq['fit'], color='k', label='Regression')
plt.show()
That seems to match pretty close to expectations! We can also plot the fit on the original data by taking the exponent of the model's predictions.
fig,ax = plt.subplots()
#ax.bar(tot_freq['Month'], tot_freq['Count'], width = 31)
tot_freq['exp_fit'] = tot_freq['fit'].apply(lambda x: 10**x)
plt.bar(*zip(*tot_counts.items()), width = 31)
plt.title('Runs Submitted to Speedrun.com')
plt.xlabel('Date Submitted')
plt.ylabel('Number of runs submitted')
ax2 = plt.twinx()
ax2.set_ylim(ax.get_ylim())
ax2.plot(tot_freq['Month'], tot_freq['exp_fit'], color='k', label='Regression')
plt.show()
And finally, let's plot the residuals this fit produces:
fig,ax = plt.subplots()
ax.bar(copy, tot_freq['res'], width = 31)
plt.title('Residuals of Linear Model')
plt.xlabel('Date Submitted')
plt.ylabel('Residual of log-count')
plt.show()
These residuals suggest a pattern of dips and peaks in speedruns that our model doesn't capture, which may lie in the plethora of information outside of just the year (such as game genre, number of players, etc.). However despite this, the p-values don't lie:
res.summary2().tables[1]['P>|t|']
const 7.883564e-34 Month 2.025207e-54 Name: P>|t|, dtype: float64
These p-values are miniscule, giving very strong evidence that this correlation between year and number of submitted runs does exist.
Now, ideally we would apply this model to Minecraft and see how the residuals looked, but this would be inaccurate due to Minecraft having significantly less runs than the site as a whole. So instead we will standardize both Speedrun.com and Minecraft, then find our linear fit and apply it to the standardized minecraft data.
# Find standardized count scores
avg_totcount = tot_freq['Count'].mean()
std_totcount = tot_freq['Count'].std()
tot_freq['Standard Count'] = (tot_freq['Count'] - avg_totcount)/std_totcount
tot_freq.head()
Month | Count | res | fit | exp_fit | Standard Count | |
---|---|---|---|---|---|---|
100 | 2013-12-09 06:25:35.206611456 | 2.721811 | -0.603005 | 3.324815 | 2112.590560 | -2.799662 |
98 | 2014-01-08 15:52:03.966942208 | 3.062206 | -0.279441 | 3.341646 | 2196.071160 | -2.139951 |
99 | 2014-02-08 01:18:32.727272704 | 3.054613 | -0.303864 | 3.358477 | 2282.850559 | -2.154666 |
97 | 2014-03-10 10:45:01.487603200 | 3.088136 | -0.287172 | 3.375309 | 2373.059111 | -2.089696 |
96 | 2014-04-09 20:11:30.247933952 | 3.130012 | -0.262128 | 3.392140 | 2466.832322 | -2.008537 |
tot_freq['Month']=mdates.date2num(tot_freq['Month'])
X = tot_freq['Month']
X = sm.add_constant(X)
mod = sm.OLS(tot_freq['Standard Count'], X)
res = mod.fit()
res.params
const -18.854888 Month 0.001073 dtype: float64
tot_freq['res'] = res.resid
tot_freq['fit'] = res.fittedvalues
tot_freq['Month'] = copy
tot_freq = tot_freq.sort_values(by = 'Month')
fig,ax = plt.subplots()
ax.bar(tot_freq['Month'], tot_freq['Standard Count'], width = 31)
plt.title('Month fit')
plt.xlabel('Date Submitted')
plt.ylabel('Standardized Runs')
#plt.yscale('log')
ax2 = plt.twinx()
ax2.set_ylim(ax.get_ylim())
ax2.plot(tot_freq['Month'], tot_freq['fit'], color='k', label='Regression')
plt.show()
We also note the sum of squared residuals, for later comparison.
r_sq = res.resid.apply(lambda x: x**2)
sum(r_sq)
8.650766598067708
And now we can do standardize Minecraft and see if it fits this model
freq = pd.DataFrame.from_dict([dict(counts)]).melt()
freq.rename(columns = {'variable': 'Month', 'value': 'Count'}, inplace = True)
# Take log
log_count = {k: math.log(v) for k, v in counts.items()}
#log_count = counts.apply(lambda x: math.log(x))
freq["Count"] = freq['Count'].apply(lambda x: math.log10(x))
# Change how time is represented as datetime objects don't fit well with statsmodels
copy = freq['Month'].copy()
freq['Month']=mdates.date2num(freq['Month'])
# Find standard scores
avg = freq['Count'].mean()
std = freq['Count'].std()
freq['Standard Count'] = (freq['Count'] - avg)/std
freq['fit'] = freq['Month'] * res.params[1] + res.params[0]
freq.head()
Month | Count | Standard Count | fit | |
---|---|---|---|---|
0 | 18662.099174 | 3.075182 | 1.922728 | 1.174316 |
1 | 18692.492562 | 3.068928 | 1.915992 | 1.206935 |
2 | 18631.705785 | 3.025715 | 1.869447 | 1.141696 |
3 | 18722.885950 | 3.022841 | 1.866351 | 1.239555 |
4 | 18753.279339 | 2.943495 | 1.780886 | 1.272175 |
freq['Month'] = copy
freq['square_res'] = (freq['fit'] - freq['Standard Count'])**2
freq = freq.sort_values(by = 'Month')
copy = freq['Month'].copy()
fig,ax = plt.subplots()
ax.bar(freq['Month'], freq['Standard Count'], width = 31)
plt.title('Standardized Minecraft runs vs. Site fit')
plt.xlabel('Date Submitted')
plt.ylabel('Number of runs submitted (log scale)')
#plt.yscale('log')
ax2 = plt.twinx()
ax2.set_ylim(ax.get_ylim())
ax2.plot(freq['Month'], freq['fit'], color='k', label='Regression')
plt.show()
This seems to somewhat fit the leaderboards, but the shape of Minecraft's runs don't match nearly as well it did for the general Speedrun.com data. Also note the sum of squared residuals:
r_sq = freq['square_res'].sum()
r_sq
36.09247732952555
They are much higher, which is a bigger deal when we remember we are using a log scale. Let's make a fit of Minecraft with its own data to see if it does a better job.
freq['Month']=mdates.date2num(freq['Month'])
X = freq['Month']
X = sm.add_constant(X)
mod = sm.OLS(freq['Count'], X)
res = mod.fit()
res.params
const -13.530621 Month 0.000842 dtype: float64
freq['res'] = res.resid
freq['fit'] = res.fittedvalues
freq['Month'] = copy
freq = freq.sort_values(by = 'Month')
fig,ax = plt.subplots()
plt.title('Runs Submitted to Minecraft Leaderboards')
plt.xlabel('Date Submitted')
plt.ylabel('Number of runs (log scale)')
ax.bar(freq['Month'], freq['Count'], width = 31)
ax2 = plt.twinx()
ax2.set_ylim(ax.get_ylim())
ax2.plot(freq['Month'], freq['fit'], color='k', label='Regression')
plt.show()
We see it roughly follows the data, but with seemingly much looser a fit. Let's examine the residuals:
fig,ax = plt.subplots()
ax.bar(copy, freq['res'], width = 31)
plt.title('Residuals of Minecraft Linear Model')
plt.xlabel('Date Submitted')
plt.ylabel('Residual of log-count')
plt.show()
Compared to before, this model sees much higher residuals for Minecraft, with the same problems about breaking the assumptions about linear models. However a less sharp model could still be significant, let's look at the p-values
res.summary2().tables[1]['P>|t|']
const 8.032060e-21 Month 3.208438e-23 Name: P>|t|, dtype: float64
These are miniscule, showing there is certainly a relationship between number of speedruns and time. However these values are quite different than those of the overall speedrunning community so there must be something about minecraft affecting its results. While quarantine obviously played a part in Minecraft's surge, this would have played a similar role in the overall speedrunning scene yet we see a difference. So, I suggest another variable which boosted Minecraft only until its peak: Dream. To summarize, Dream is a very popular Minecraft Youtuber who, from 2019 through 2020, was Minecraft's most popular speedrunner. However, in December of 2020 it was found that Dream had cheated on his speedruns leading to him publically disavowing the community on Speedrun.com. We can notice how close December 2020 is to the peak we see of Minecraft speedrunning, so perhaps this could be an explantory variable.
Dream uploaded his first world record on March 16, 2020, and the proof of his cheating were published on December 11, 2020. Thus we can consider time between these two to be "peak dream influence." We can add whether a month occured between these two dates to our model.
freq['Month']=mdates.date2num(freq['Month'])
start = mdates.datestr2num('03/16/2020')
#mdates.date2num(freq['Month']) 'Mar 16, 2020'
end = mdates.datestr2num('12/11/2020')
freq['Dream'] = freq['Month'].between(start,end)
# 1 if month occurs in 'peak Dream,' 0 otherwise
freq['Dream'] = freq['Dream'].apply(lambda x: 1 if x else 0)
freq.head()
Month | Count | Standard Count | fit | square_res | res | Dream | |
---|---|---|---|---|---|---|---|
94 | 16078.661157 | 0.000000 | -1.389596 | 0.006438 | 0.043588 | -0.006438 | 0 |
76 | 16109.054545 | 0.477121 | -0.875682 | 0.032027 | 0.476200 | 0.445095 | 0 |
90 | 16139.447934 | 0.301030 | -1.065352 | 0.057616 | 0.218820 | 0.243414 | 0 |
96 | 16169.841322 | 0.000000 | -1.389596 | 0.083205 | 0.012303 | -0.083205 | 0 |
80 | 16200.234711 | 0.477121 | -0.875682 | 0.108794 | 0.350716 | 0.368327 | 0 |
We want to add an interaction term, as we are suggesting that Dream brought an influx new runners, as well as continued to bring popularity to the community. Thus the growth of speedrunning with respect to time would have changed as well.
y,X = dmatrices('Count ~ Month*Dream',freq, return_type = 'dataframe')
y = np.ravel(y)
X.head()
Intercept | Month | Dream | Month:Dream | |
---|---|---|---|---|
94 | 1.0 | 16078.661157 | 0.0 | 0.0 |
76 | 1.0 | 16109.054545 | 0.0 | 0.0 |
90 | 1.0 | 16139.447934 | 0.0 | 0.0 |
96 | 1.0 | 16169.841322 | 0.0 | 0.0 |
80 | 1.0 | 16200.234711 | 0.0 | 0.0 |
Now we fit the model and plot.
mod = sm.OLS(y,X)
fit = mod.fit()
fit.params
Intercept -12.196165 Month 0.000762 Dream -35.893774 Month:Dream 0.001983 dtype: float64
Note how the the month*Dream term is much higher than the Month term
freq['res'] = fit.resid
freq['fit'] = fit.fittedvalues
freq['Month'] = copy
freq = freq.sort_values(by = 'Month')
fig,ax = plt.subplots()
ax.bar(freq['Month'], freq['Count'], width = 31)
plt.title('Dream Minecraft Model')
plt.ylabel('Number of Runs (log scale)')
plt.xlabel('Date')
ax2 = plt.twinx()
ax2.set_ylim(ax.get_ylim())
ax2.plot(freq['Month'], freq['fit'], color='k', label='Regression')
plt.show()
fig,ax = plt.subplots()
freq['exp_fit'] = freq['fit'].apply(lambda x: 10**x)
plt.bar(*zip(*counts.items()), width = 31)
plt.title('Dream Minecraft Model')
plt.xlabel('Date Submitted')
plt.ylabel('Number of runs')
ax2 = plt.twinx()
ax2.set_ylim(ax.get_ylim())
ax2.plot(freq['Month'], freq['exp_fit'], color='k', label='Regression')
plt.show()
We see on both scales this model fits much more closely than the previous, let's examine the residual and p-values:
r_sq = fit.resid.apply(lambda x: x**2)
sum(r_sq)
24.85337402573328
fit.summary2().tables[1]['P>|t|']
Intercept 1.069543e-18 Month 6.716696e-21 Dream 3.788544e-01 Month:Dream 3.690292e-01 Name: P>|t|, dtype: float64
This residual is significantly smaller, but these p-values suggest the dream sweetspot isn't significant, despite how nice the graph looks. But what if we also added the the time after Dream was caught cheating, when he had told his fans not to associate with the leaderboard anymore? This is the period after he was caught cheating, so any data after December 11 would fall in line.
freq['Month']=mdates.date2num(freq['Month'])
freq['Cheat'] = freq['Month'].apply(lambda x: 1 if x > end else 0)
freq.head()
Month | Count | Standard Count | fit | square_res | res | Dream | exp_fit | Cheat | |
---|---|---|---|---|---|---|---|---|---|
94 | 16078.661157 | 0.000000 | -1.389596 | 0.058446 | 0.043588 | -0.058446 | 0 | 1.144051 | 0 |
76 | 16109.054545 | 0.477121 | -0.875682 | 0.081610 | 0.476200 | 0.395511 | 0 | 1.206731 | 0 |
90 | 16139.447934 | 0.301030 | -1.065352 | 0.104775 | 0.218820 | 0.196255 | 0 | 1.272844 | 0 |
96 | 16169.841322 | 0.000000 | -1.389596 | 0.127940 | 0.012303 | -0.127940 | 0 | 1.342579 | 0 |
80 | 16200.234711 | 0.477121 | -0.875682 | 0.151105 | 0.350716 | 0.326016 | 0 | 1.416135 | 0 |
y,X = dmatrices('Count ~ Month*Dream + Month*Cheat',freq, return_type = 'dataframe')
y = np.ravel(y)
mod = sm.OLS(y,X)
fit = mod.fit()
fit.params
Intercept -6.464687 Month 0.000423 Dream -41.625253 Month:Dream 0.002322 Cheat 61.912920 Month:Cheat -0.003224 dtype: float64
freq['res'] = fit.resid
freq['fit'] = fit.fittedvalues
freq['Month'] = copy
freq = freq.sort_values(by = 'Month')
fig,ax = plt.subplots()
ax.bar(freq['Month'], freq['Count'], width = 31)
plt.title('Dream popularity AND cheating Minecraft model')
plt.ylabel('Number of runs (log scale)')
plt.xlabel('Date')
ax2 = plt.twinx()
ax2.set_ylim(ax.get_ylim())
ax2.plot(freq['Month'], freq['fit'], color='k', label='Regression')
plt.show()
fig,ax = plt.subplots()
#ax.bar(tot_freq['Month'], tot_freq['Count'], width = 31)
freq['exp_fit'] = freq['fit'].apply(lambda x: 10**x)
plt.bar(*zip(*counts.items()), width = 31)
plt.title('Dream popularity AND cheating Minecraft model')
plt.xlabel('Date Submitted')
plt.ylabel('Number of runs submitted')
ax2 = plt.twinx()
ax2.set_ylim(ax.get_ylim())
ax2.plot(freq['Month'], freq['exp_fit'], color='k', label='Regression')
plt.show()
We see a much tighter fit around the 'hump', and what about the p-values for this fit?
fit.summary2().tables[1]['P>|t|']
Intercept 2.191631e-07 Month 9.302297e-09 Dream 1.660017e-01 Month:Dream 1.535356e-01 Cheat 5.530645e-06 Month:Cheat 7.928875e-06 Name: P>|t|, dtype: float64
Well, the Dream sweetspot still isn't significant, but the Dream 'cheatspot' certainly seems so! This suggests the time after Dream was caught cheating is certainly correlated with a decrease in speedrun submissions in the following year. Finally, let's plot the residuals.
freq['res'] = fit.resid
fig,ax = plt.subplots()
ax.bar(copy, freq['res'], width = 30)
plt.title('New Model Residuals')
plt.ylabel('Residual')
plt.show()
We see the residuals appear much more closely centered around zero, with much more uniform variance and less variation with respect to time, which is what we are looking for in a linear model.
It appears from these regressions that Dream may have had an impact on Minecraft speedrunning, but not in the way I was expecting. There may be room for argument that Dream brought more people into speedrunning with his world record and fame, but there is a much stronger argument to be made that his cheating and subsequent departure from the community was at least correlated with a shrinkage in the community overall. This would explain why Minecraft seemingly nosedived in 2021 where Speedrunning.com did not. However there are other reasons for why Minecraft's speedruns may have been irregular that may be included in future research in this area. This includes the demographics of Minecraft players, which includes largely children, for whom quarantine keeping them in and out of school would cause their presence (or lack thereof) to be more pronounced. There are also other aspects like game genre which could have an effect. It would also be useful to run this this analysis choosing other points in time than those relating to dream, to ensure the significance that was found isn't a matter of overfitting.