Did Dream Make or Break the Minecraft Speedrunning Scene?¶

A data analysis by Johnny Rajala¶

Introduction¶

Many people play video games, but there is a growing community of people who enjoy an extra challenge: finishing the game as fast as possible. For practically every game, there is a speedrunning scene. In the early days of speedrunning, the practice was incredibly niche and splintered on various websites. But over the past 2 decades Speedrunning.com has become the biggest congregation of speedrunning content on the internet. As a member of the Minecraft: Java Edition speedrunning scene, I noticed an interesting phenomenon: the number of speedruns submitted during quarantine has exploded, to the point where moderation became backlogged. Many runners have assumed the spare time during quarantine, but I posit one particular runner, Dream, may have an outstanding effect.

Collecting the data¶

In [1]:
import requests
import json
import sys
import matplotlib.pyplot as plt
from matplotlib.dates import drange
from datetime import datetime, timedelta
import numpy as np
import pandas as pd
from tqdm import tqdm
import os
import math
import time
import statsmodels.api as sm
import matplotlib.dates as mdates
from patsy import dmatrices
import pickle
%matplotlib inline
/opt/conda/lib/python3.9/site-packages/statsmodels/compat/pandas.py:65: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import Int64Index as NumericIndex

Thankfully, Speedrun.com provides a rest API for us to pull information about games, leaderboards, and even individual runs. The API requires a header that briefly describes the use, which we do here:

In [2]:
init_headers = {'User-Agent': 'uni-project-bot/1.0'}

Now we can pull every game that is on Speedrun.com. We can request the IDs of games, which are used to uniquely identify them, and are useful for gaining more important information later.

In [3]:
# If data is already in directory, load that
if os.path.isfile("game_IDs.txt"): # if file exists we have already pickled a list
    with open("game_IDs.txt", 'rb') as f:
        game_IDs = pickle.load(f)
else:
    # Request every game in batches of 1000
    game_IDs = []
    offset = 0
    while(True):
        # Request and unpack 1000 games
        URL = 'https://www.speedrun.com/api/v1/games?_bulk=yes&max=1000&offset=' + str(offset)
        response = requests.get(URL, init_headers)
        data = response.json()['data']
        # Add each game to array
        for game in data:
            game_IDs.append(game['id'])
        offset += 1000
        # If length is less than max, we break
        if len(data) < 1000:
            break
print(len(game_IDs))
28755

We see that in all, there are over 28 thousand games on Speedrun.com! With the IDs we can now pull and store all the data we need. Here is the dataframe we will use, and we collect data in the following categories:

  • Game: This is the international name of the game

  • Category: The category of the speedrun, which defines the leaderboard. A game can have multiple types of speedruns, such as beating the full game vs. a signle level, and the category differentiates these types. It can be split into further subcategories with the values.

  • Run Time: The length of the speedrun, defined as whatever time is used for rankings on the leaderboard, in seconds

  • Date: The date the speedrun was submitted (not always present), in Year-Month-Day format

  • Values: Set of aspects of a run that can put it in a subcategory, such as using glitches vs. glitchless. Only present if it creates a subcategory. The values combined with the category designate what 'type' of run a person is submitting

  • Game ID, Cat ID: Unique identifiers the API uses for finding games and categories

Note that the API returns runs that are current to the date of data collection, so future runs of data collection may look different as it will include runs which did not exist at the time I collected.

In [4]:
df = pd.DataFrame(columns = ['Game','Category','Run Time','Date','Values', 'Game ID', 'Cat ID'])

Finally, we use our game IDs to collect the data! For each game, we collect every verified run that was ever submitted to a leaderboard. This means no runs that were rejected. This also means we are essentially collecting the entire speedrun history of a game.

In [5]:
#game_IDs = ['j1npme6p']
path = os.getcwd()
path += '/FinalData(2)'
# If data is already in directory, load that
if os.path.isfile(path):
    df = pd.read_csv('FinalData(2)')
    df.drop('Unnamed: 0', axis=1, inplace=True)
# Else, collect the data
else:
    maxim = 200 # Max number of runs we pull at a time, 200 is maximum allowed
    sec = 15 # Cooldown time for when 
    track = 0
    # Extracts all runs for each game, stores them in df
    # Currently accounting for crash!!! Starting from where it crashed
    for game_ID in tqdm(game_IDs):
        # Every 100 games save the dataframe to disk
        track += 1
        if track % 500 == 0:
            cwd = os.getcwd()
            path = cwd + "/DataV" + str(track/500)
            df.to_csv(path)
        same = ''
        # Get info about game, categories, and variables
        URL = 'https://www.speedrun.com/api/v1/games/' + str(game_ID) + '?embed=categories.variables'
        response = requests.get(URL,init_headers)
        data = response.json() # This has failed exactly once for reasons unknown

        try:
            data = data['data']
        except:
            # Occurs if we get a throttling error. We wait 15 seconds then try again.
            if 'status' in data and data['status'] == 420:
                while 'status' in data and data['status'] == 420:
                    time.sleep(sec)
                    response = requests.get(URL,init_headers)
                    data = response.json()
                data = data['data']
            else:
                # If other error, print and move on
                # The only error in my case was a game not being found, presumably being deleted between pulling the game and pulling the runs
                print('1b')
                print(data)
                continue
        game = data['names']['international']
        cats = data['categories']['data']

        # Finds all the runs for each category
        for categ in cats:
            cat = categ['id'] # Category ID
            cat_name = categ['name'] # Category Name
            offset = 0
            dir = 'asc'
            fin = ''
            sub_categories = [] # Collection of the variables that define subcategories
            all_vars = categ['variables']['data'] # Collects all variables of a run

            for var in all_vars:
                if var['is-subcategory']:
                    sub_categories.append(var['values']['values'])
            sub_keys = {}
            for s in sub_categories:
                # Assumed no two sub-categories in the same category will have the same variable ID
                temp_dict = dict(s)
                for t in temp_dict.keys():
                    temp_dict[t] = temp_dict[t]['label']
                sub_keys.update(temp_dict)


            # Collect data on every run. 
            while(True):
                # Asks API for verified runs from this category, ordered by date submitted
                URL = 'https://www.speedrun.com/api/v1/runs?game=' + str(game_ID) + '&category=' + str(cat) + '&orderby=submitted&direction=' + str(dir) + '&status=verified&max=' + str(maxim) + '&offset=' + str(offset)
                response = requests.get(URL,init_headers)
                data2 = response.json()
                try:
                    data2 = data2['data']
                except:
                    # Throttling error. Wait 15 seconds and try again.
                    if 'status' in data2 and data2['status'] == 420:
                        while 'status' in data2 and data2['status'] == 420:
                            time.sleep(sec)
                            response = requests.get(URL,init_headers)
                            data2 = response.json()
                        data2 = data2['data']
                    elif 'times' in data2:
                        data2 = data2
                    else:
                        # If other error, print and move on
                        print(2)
                        print(data2)
                        continue


                for run in data2:
                    # Add game, category, time, date, and options
                    sub_cat = set()
                    # We store the label of the subcategory for ease of reading
                    for var in run['values'].values():
                        if var in sub_keys:
                            sub_cat.add(sub_keys[var])
                    df.loc[len(df.index)] = [game, cat_name, run['times']['primary_t'], run['date'], sub_cat, game_ID, cat]


                # If length of collected data is smaller than maximum we can collect, we're at the end of the list and break
                if len(data2) < maxim:
                    break

                # Need to work from the back of the list if the offset is more than 10k (known bug)
                if offset + maxim >= 10000:
                    fin = data2[-1]
                    dir = 'desc'
                    offset = 0
                    continue

                # If we're working backwords and find the run we ended on going forward, we've found all runs and break
                if dir == 'desc' and fin in data2:
                    dir = 'asc'
                    fin = ''
                    break

                # If we collect 0 runs we break immediately (happens when no runs in category)
                if(len(data2) == 0):
                    break

                offset += maxim
# Convert the dates from a string to a datetime object, which is easier to use    
def time_convert(x):
        if pd.isna(x):
            return np.nan
        try:
            return datetime.strptime(x, '%Y-%m-%d')
        except:
            try:
                return datetime.strptime(x, '%Y-%m-%d')
            except:
                print(type(x))

df['Date'] = [time_convert(x) for x in df['Date']]
df
Out[5]:
Game Category Run Time Date Values Game ID Cat ID
0 Bibi & Tina: New Adventures With Horses Main Missions 3531.0 2022-04-21 set() ldej22j1 wdmm094d
1 Bibi & Tina: New Adventures With Horses Main Missions 3482.0 2022-04-22 set() ldej22j1 wdmm094d
2 Bibi & Tina: New Adventures With Horses Main Missions 3396.0 2022-04-23 set() ldej22j1 wdmm094d
3 Bibi & Tina: New Adventures With Horses Main Missions 3346.0 2022-04-26 set() ldej22j1 wdmm094d
4 Burger & Frights Any% 906.0 2021-09-01 set() 3698y4ld zdnzx59d
... ... ... ... ... ... ... ...
2580131 暖雪 Warm Snow White Ash% NMG 1045.0 2022-04-19 set() v1pxz946 ndxnwvvk
2580132 暖雪 Warm Snow Fresh File% NMG 2569.0 2022-02-10 set() v1pxz946 vdoy5my2
2580133 暖雪 Warm Snow Fresh File% NMG 2351.0 2022-04-21 set() v1pxz946 vdoy5my2
2580134 暖雪 Warm Snow Fresh File% NMG 1676.0 2022-04-21 set() v1pxz946 vdoy5my2
2580135 鬼神童子ZENKI Any% 1390.0 2021-08-10 set() 9d387701 5dw180ek

2580136 rows × 7 columns

80 hours and over 2.5 million runs later, we have finally finished data collection.

Exploratory Data Analysis¶

To exam how the number of submitted runs changes with time, I decide to group them by month. I use the average length of month to split the data.

In [6]:
# Split range of dates in to approximately 1 month bins
bins = int(round((max(df['Date'])-min(df['Date']))/timedelta(weeks = 4.345),0))
In [7]:
print(bins)
605

We see we have about 605 months worth of runs, now we can split the data between these months.

In [8]:
# Cut data into the bins based on submission date
df['Date_Cut'] = pd.cut(df.Date, bins = bins)
# We don't need to know full interval for graphing, take left endpoints
def relabel(x):
    if pd.isna(x):
        return np.nan
    else:
        return x.left

df['Date_Cut'] = [relabel(x) for x in df['Date_Cut']]

Finally, we plot the runs as a bar chart, with a bar for each month.

In [9]:
# Count how many runs fall in each of the cuts
counts = df['Date_Cut'].value_counts()
counts = dict(counts)
# Plot these counts
plt.bar(*zip(*counts.items()), width = 31)
plt.title('Runs Submitted to Speedrun.Com')
plt.xlabel('Date Submitted')
plt.ylabel('Number of runs submitted')
plt.show()

This graph reveals several things, including both really old speedruns that were imported to Speedrun.com and runs that have plain errors in their input time (such as the run that supposedly submitted at the start of Unix time). To reel in the scope of this project, we will limit the data to runs that were submitted in 2014 or later, when Speedrun.com went online.

In [10]:
# Subset of more recent data
rec = df[df['Date']  >= '01-01-14']

# Count how many runs fall in each of the cuts
tot_counts = rec['Date_Cut'].value_counts()
tot_counts = dict(tot_counts)
# Plot these counts
plt.bar(*zip(*tot_counts.items()), width = 31)
plt.title('Runs Submitted to Speedrun.Com')
plt.xlabel('Date Submitted')
plt.ylabel('Number of runs submitted')
plt.show()

We see a rather large spike in 2020, that increases so rapidly it could be exponential. We can test this theory with a graph with a logarithmic y-axis:

In [11]:
plt.bar(*zip(*tot_counts.items()), width = 31)
plt.title('Runs Submitted to Speedrun.Com')
plt.xlabel('Date Submitted')
plt.yscale('log')
plt.ylabel('Number of runs submitted (log scale)')
plt.show()

Look at that! We can see a nearly linear relationship with this scale, suggesting the number of runs submitted to Speedrun.com is approximately exponential. Let's do the same thing with Minecraft.

At this scale, we can see an explosion in the speedrunning scene as a whole, starting gradually from 2014, slowing around 2018, before a huge and sustained spike in submitted speedruns in 2020. We can also see if this matches for Minecraft as well.

In [12]:
# All runs with game ID associated with Minecraft: JE
mine = rec[rec['Game ID'] == 'j1npme6p']

# Count how many runs fall in each of the cuts
counts = mine['Date_Cut'].value_counts()
counts = dict(counts)
# Plot these counts
plt.bar(*zip(*counts.items()), width = 31)
plt.title('Runs Submitted to Minecraft Leaderboards')
plt.xlabel('Date Submitted')
plt.ylabel('Number of runs submitted')
plt.show()
plt.bar(*zip(*counts.items()), width = 31)
plt.title('Runs Submitted to Minecraft Leaderboards')
plt.yscale('log')
plt.xlabel('Date Submitted')
plt.ylabel('Number of runs submitted (log scale)')
plt.show()

Here we can see why people believed Minecraft speedrunning really took off after quarantine began. We see a modest number of runs continuously submitted up until 2019, then a gradual growth through 2020, then an explosion going into 2021. However the log graph shows the key differences between Minecraft and the Speedrun.com as a whole. While we could draw a general linear trend from 2013 to 2020, it appears to be quite weak. More importantly, we see a huge spike starting in 2020 and going into 2021, even on the log graph. This suggests that Minecraft surged in popularity exceedingly so, compared to its earlier years. Further, we see a sharp decline starting in 2021 and leading into 2022. These last two features differ drastically from the results in the overall Speedrun.com. This suggests that Minecraft's speedrunning popularity is different from the site as a whole, and we can show this quantitatively with linear regressions.

Linear Regressions¶

We can use a linear regression to make an exponential fit of the Speedrun.com data by taking the log of the number of runs per month, then fitting a linear regression with respect to time. To start, let's copy the data we want: the months and the log of the number of runs in those months

In [13]:
tot_freq = pd.DataFrame.from_dict([dict(tot_counts)]).melt()
tot_freq.rename(columns = {'variable': 'Month', 'value': 'Count'}, inplace = True)
# Take log
log_count = {k: math.log(v) for k, v in counts.items()}
tot_freq["Count"] = tot_freq['Count'].apply(lambda x: math.log10(x))
# Change how time is represented as datetime objects don't fit well with statsmodels
copy = tot_freq['Month'].copy()
tot_freq['Month']=mdates.date2num(tot_freq['Month'])
tot_freq.head()
Out[13]:
Month Count
0 18631.705785 4.951022
1 18783.672727 4.945469
2 18662.099174 4.929122
3 18692.492562 4.925451
4 18722.885950 4.897220

Note that the months are converted to a number, which is necessary for our linear regression. Speaking of which, we will now make a linear regression of all the data using just the log-count and the months.

In [14]:
X = tot_freq['Month']
X = sm.add_constant(X)
mod = sm.OLS(tot_freq['Count'], X)
res = mod.fit()
res.params
Out[14]:
const   -5.562301
Month    0.000554
dtype: float64

These parameters create a linear fit that looks like the following:

In [15]:
tot_freq['res'] = res.resid
tot_freq['fit'] = res.fittedvalues
tot_freq['Month'] = copy
tot_freq = tot_freq.sort_values(by = 'Month')
copy = tot_freq['Month'].copy()
fig,ax = plt.subplots()
ax.bar(tot_freq['Month'], tot_freq['Count'], width = 31)
plt.title('Runs Submitted to Speedrun.com')
plt.xlabel('Date Submitted')
plt.ylabel('Number of runs submitted (log scale)')
ax.set_ylim(2.5,5)
#plt.yscale('log')
ax2 = plt.twinx()
ax2.set_ylim(ax.get_ylim())
ax2.plot(tot_freq['Month'], tot_freq['fit'], color='k', label='Regression')
plt.show()

That seems to match pretty close to expectations! We can also plot the fit on the original data by taking the exponent of the model's predictions.

In [16]:
fig,ax = plt.subplots()
#ax.bar(tot_freq['Month'], tot_freq['Count'], width = 31)
tot_freq['exp_fit'] = tot_freq['fit'].apply(lambda x: 10**x)
plt.bar(*zip(*tot_counts.items()), width = 31)
plt.title('Runs Submitted to Speedrun.com')
plt.xlabel('Date Submitted')
plt.ylabel('Number of runs submitted')
ax2 = plt.twinx()
ax2.set_ylim(ax.get_ylim())
ax2.plot(tot_freq['Month'], tot_freq['exp_fit'], color='k', label='Regression')
plt.show()

And finally, let's plot the residuals this fit produces:

In [17]:
fig,ax = plt.subplots()
ax.bar(copy, tot_freq['res'], width = 31)
plt.title('Residuals of Linear Model')
plt.xlabel('Date Submitted')
plt.ylabel('Residual of log-count')
plt.show()

These residuals suggest a pattern of dips and peaks in speedruns that our model doesn't capture, which may lie in the plethora of information outside of just the year (such as game genre, number of players, etc.). However despite this, the p-values don't lie:

In [18]:
res.summary2().tables[1]['P>|t|']
Out[18]:
const    7.883564e-34
Month    2.025207e-54
Name: P>|t|, dtype: float64

These p-values are miniscule, giving very strong evidence that this correlation between year and number of submitted runs does exist.

Now, ideally we would apply this model to Minecraft and see how the residuals looked, but this would be inaccurate due to Minecraft having significantly less runs than the site as a whole. So instead we will standardize both Speedrun.com and Minecraft, then find our linear fit and apply it to the standardized minecraft data.

In [19]:
# Find standardized count scores
avg_totcount = tot_freq['Count'].mean()
std_totcount = tot_freq['Count'].std()
tot_freq['Standard Count'] = (tot_freq['Count'] - avg_totcount)/std_totcount
tot_freq.head()
Out[19]:
Month Count res fit exp_fit Standard Count
100 2013-12-09 06:25:35.206611456 2.721811 -0.603005 3.324815 2112.590560 -2.799662
98 2014-01-08 15:52:03.966942208 3.062206 -0.279441 3.341646 2196.071160 -2.139951
99 2014-02-08 01:18:32.727272704 3.054613 -0.303864 3.358477 2282.850559 -2.154666
97 2014-03-10 10:45:01.487603200 3.088136 -0.287172 3.375309 2373.059111 -2.089696
96 2014-04-09 20:11:30.247933952 3.130012 -0.262128 3.392140 2466.832322 -2.008537
In [20]:
tot_freq['Month']=mdates.date2num(tot_freq['Month'])
X = tot_freq['Month']
X = sm.add_constant(X)
mod = sm.OLS(tot_freq['Standard Count'], X)
res = mod.fit()
res.params
Out[20]:
const   -18.854888
Month     0.001073
dtype: float64
In [21]:
tot_freq['res'] = res.resid
tot_freq['fit'] = res.fittedvalues
tot_freq['Month'] = copy
tot_freq = tot_freq.sort_values(by = 'Month')
fig,ax = plt.subplots()
ax.bar(tot_freq['Month'], tot_freq['Standard Count'], width = 31)
plt.title('Month fit')
plt.xlabel('Date Submitted')
plt.ylabel('Standardized Runs')
#plt.yscale('log')
ax2 = plt.twinx()
ax2.set_ylim(ax.get_ylim())
ax2.plot(tot_freq['Month'], tot_freq['fit'], color='k', label='Regression')
plt.show()

We also note the sum of squared residuals, for later comparison.

In [22]:
r_sq = res.resid.apply(lambda x: x**2)
sum(r_sq)
Out[22]:
8.650766598067708

And now we can do standardize Minecraft and see if it fits this model

In [23]:
freq = pd.DataFrame.from_dict([dict(counts)]).melt()
freq.rename(columns = {'variable': 'Month', 'value': 'Count'}, inplace = True)
# Take log
log_count = {k: math.log(v) for k, v in counts.items()}
#log_count = counts.apply(lambda x: math.log(x))
freq["Count"] = freq['Count'].apply(lambda x: math.log10(x))
# Change how time is represented as datetime objects don't fit well with statsmodels
copy = freq['Month'].copy()
freq['Month']=mdates.date2num(freq['Month'])

# Find standard scores
avg = freq['Count'].mean()
std = freq['Count'].std()
freq['Standard Count'] = (freq['Count'] - avg)/std
freq['fit'] = freq['Month'] * res.params[1] + res.params[0]
freq.head()
Out[23]:
Month Count Standard Count fit
0 18662.099174 3.075182 1.922728 1.174316
1 18692.492562 3.068928 1.915992 1.206935
2 18631.705785 3.025715 1.869447 1.141696
3 18722.885950 3.022841 1.866351 1.239555
4 18753.279339 2.943495 1.780886 1.272175
In [24]:
freq['Month'] = copy
freq['square_res'] = (freq['fit'] - freq['Standard Count'])**2
freq = freq.sort_values(by = 'Month')
copy = freq['Month'].copy()
fig,ax = plt.subplots()
ax.bar(freq['Month'], freq['Standard Count'], width = 31)
plt.title('Standardized Minecraft runs vs. Site fit')
plt.xlabel('Date Submitted')
plt.ylabel('Number of runs submitted (log scale)')
#plt.yscale('log')
ax2 = plt.twinx()
ax2.set_ylim(ax.get_ylim())
ax2.plot(freq['Month'], freq['fit'], color='k', label='Regression')
plt.show()

This seems to somewhat fit the leaderboards, but the shape of Minecraft's runs don't match nearly as well it did for the general Speedrun.com data. Also note the sum of squared residuals:

In [25]:
r_sq = freq['square_res'].sum()
r_sq
Out[25]:
36.09247732952555

They are much higher, which is a bigger deal when we remember we are using a log scale. Let's make a fit of Minecraft with its own data to see if it does a better job.

In [26]:
freq['Month']=mdates.date2num(freq['Month'])
X = freq['Month']
X = sm.add_constant(X)
mod = sm.OLS(freq['Count'], X)
res = mod.fit()
res.params
Out[26]:
const   -13.530621
Month     0.000842
dtype: float64
In [27]:
freq['res'] = res.resid
freq['fit'] = res.fittedvalues
freq['Month'] = copy
freq = freq.sort_values(by = 'Month')
fig,ax = plt.subplots()
plt.title('Runs Submitted to Minecraft Leaderboards')
plt.xlabel('Date Submitted')
plt.ylabel('Number of runs (log scale)')
ax.bar(freq['Month'], freq['Count'], width = 31)
ax2 = plt.twinx()
ax2.set_ylim(ax.get_ylim())
ax2.plot(freq['Month'], freq['fit'], color='k', label='Regression')
plt.show()

We see it roughly follows the data, but with seemingly much looser a fit. Let's examine the residuals:

In [28]:
fig,ax = plt.subplots()
ax.bar(copy, freq['res'], width = 31)
plt.title('Residuals of Minecraft Linear Model')
plt.xlabel('Date Submitted')
plt.ylabel('Residual of log-count')
plt.show()

Compared to before, this model sees much higher residuals for Minecraft, with the same problems about breaking the assumptions about linear models. However a less sharp model could still be significant, let's look at the p-values

In [29]:
res.summary2().tables[1]['P>|t|']
Out[29]:
const    8.032060e-21
Month    3.208438e-23
Name: P>|t|, dtype: float64

These are miniscule, showing there is certainly a relationship between number of speedruns and time. However these values are quite different than those of the overall speedrunning community so there must be something about minecraft affecting its results. While quarantine obviously played a part in Minecraft's surge, this would have played a similar role in the overall speedrunning scene yet we see a difference. So, I suggest another variable which boosted Minecraft only until its peak: Dream. To summarize, Dream is a very popular Minecraft Youtuber who, from 2019 through 2020, was Minecraft's most popular speedrunner. However, in December of 2020 it was found that Dream had cheated on his speedruns leading to him publically disavowing the community on Speedrun.com. We can notice how close December 2020 is to the peak we see of Minecraft speedrunning, so perhaps this could be an explantory variable.

Dream uploaded his first world record on March 16, 2020, and the proof of his cheating were published on December 11, 2020. Thus we can consider time between these two to be "peak dream influence." We can add whether a month occured between these two dates to our model.

In [30]:
freq['Month']=mdates.date2num(freq['Month'])
start = mdates.datestr2num('03/16/2020')
#mdates.date2num(freq['Month']) 'Mar 16, 2020'
end = mdates.datestr2num('12/11/2020')
freq['Dream'] = freq['Month'].between(start,end)
# 1 if month occurs in 'peak Dream,' 0 otherwise
freq['Dream'] = freq['Dream'].apply(lambda x: 1 if x else 0)
freq.head()
Out[30]:
Month Count Standard Count fit square_res res Dream
94 16078.661157 0.000000 -1.389596 0.006438 0.043588 -0.006438 0
76 16109.054545 0.477121 -0.875682 0.032027 0.476200 0.445095 0
90 16139.447934 0.301030 -1.065352 0.057616 0.218820 0.243414 0
96 16169.841322 0.000000 -1.389596 0.083205 0.012303 -0.083205 0
80 16200.234711 0.477121 -0.875682 0.108794 0.350716 0.368327 0

We want to add an interaction term, as we are suggesting that Dream brought an influx new runners, as well as continued to bring popularity to the community. Thus the growth of speedrunning with respect to time would have changed as well.

In [31]:
y,X = dmatrices('Count ~ Month*Dream',freq, return_type = 'dataframe')
y = np.ravel(y)
X.head()
Out[31]:
Intercept Month Dream Month:Dream
94 1.0 16078.661157 0.0 0.0
76 1.0 16109.054545 0.0 0.0
90 1.0 16139.447934 0.0 0.0
96 1.0 16169.841322 0.0 0.0
80 1.0 16200.234711 0.0 0.0

Now we fit the model and plot.

In [32]:
mod = sm.OLS(y,X)
fit = mod.fit()
fit.params
Out[32]:
Intercept     -12.196165
Month           0.000762
Dream         -35.893774
Month:Dream     0.001983
dtype: float64

Note how the the month*Dream term is much higher than the Month term

In [33]:
freq['res'] = fit.resid
freq['fit'] = fit.fittedvalues
freq['Month'] = copy
freq = freq.sort_values(by = 'Month')
fig,ax = plt.subplots()
ax.bar(freq['Month'], freq['Count'], width = 31)
plt.title('Dream Minecraft Model')
plt.ylabel('Number of Runs (log scale)')
plt.xlabel('Date')
ax2 = plt.twinx()
ax2.set_ylim(ax.get_ylim())
ax2.plot(freq['Month'], freq['fit'], color='k', label='Regression')
plt.show()

fig,ax = plt.subplots()
freq['exp_fit'] = freq['fit'].apply(lambda x: 10**x)
plt.bar(*zip(*counts.items()), width = 31)
plt.title('Dream Minecraft Model')
plt.xlabel('Date Submitted')
plt.ylabel('Number of runs')
ax2 = plt.twinx()
ax2.set_ylim(ax.get_ylim())
ax2.plot(freq['Month'], freq['exp_fit'], color='k', label='Regression')
plt.show()

We see on both scales this model fits much more closely than the previous, let's examine the residual and p-values:

In [34]:
r_sq = fit.resid.apply(lambda x: x**2)
sum(r_sq)
Out[34]:
24.85337402573328
In [35]:
fit.summary2().tables[1]['P>|t|']
Out[35]:
Intercept      1.069543e-18
Month          6.716696e-21
Dream          3.788544e-01
Month:Dream    3.690292e-01
Name: P>|t|, dtype: float64

This residual is significantly smaller, but these p-values suggest the dream sweetspot isn't significant, despite how nice the graph looks. But what if we also added the the time after Dream was caught cheating, when he had told his fans not to associate with the leaderboard anymore? This is the period after he was caught cheating, so any data after December 11 would fall in line.

In [36]:
freq['Month']=mdates.date2num(freq['Month'])
freq['Cheat'] = freq['Month'].apply(lambda x: 1 if x > end else 0)
freq.head()
Out[36]:
Month Count Standard Count fit square_res res Dream exp_fit Cheat
94 16078.661157 0.000000 -1.389596 0.058446 0.043588 -0.058446 0 1.144051 0
76 16109.054545 0.477121 -0.875682 0.081610 0.476200 0.395511 0 1.206731 0
90 16139.447934 0.301030 -1.065352 0.104775 0.218820 0.196255 0 1.272844 0
96 16169.841322 0.000000 -1.389596 0.127940 0.012303 -0.127940 0 1.342579 0
80 16200.234711 0.477121 -0.875682 0.151105 0.350716 0.326016 0 1.416135 0
In [37]:
y,X = dmatrices('Count ~ Month*Dream + Month*Cheat',freq, return_type = 'dataframe')
y = np.ravel(y)
mod = sm.OLS(y,X)
fit = mod.fit()
fit.params
Out[37]:
Intercept      -6.464687
Month           0.000423
Dream         -41.625253
Month:Dream     0.002322
Cheat          61.912920
Month:Cheat    -0.003224
dtype: float64
In [38]:
freq['res'] = fit.resid
freq['fit'] = fit.fittedvalues
freq['Month'] = copy
freq = freq.sort_values(by = 'Month')
fig,ax = plt.subplots()
ax.bar(freq['Month'], freq['Count'], width = 31)
plt.title('Dream popularity AND cheating Minecraft model')
plt.ylabel('Number of runs (log scale)')
plt.xlabel('Date')
ax2 = plt.twinx()
ax2.set_ylim(ax.get_ylim())
ax2.plot(freq['Month'], freq['fit'], color='k', label='Regression')
plt.show()

fig,ax = plt.subplots()
#ax.bar(tot_freq['Month'], tot_freq['Count'], width = 31)
freq['exp_fit'] = freq['fit'].apply(lambda x: 10**x)
plt.bar(*zip(*counts.items()), width = 31)
plt.title('Dream popularity AND cheating Minecraft model')
plt.xlabel('Date Submitted')
plt.ylabel('Number of runs submitted')
ax2 = plt.twinx()
ax2.set_ylim(ax.get_ylim())
ax2.plot(freq['Month'], freq['exp_fit'], color='k', label='Regression')
plt.show()

We see a much tighter fit around the 'hump', and what about the p-values for this fit?

In [39]:
fit.summary2().tables[1]['P>|t|']
Out[39]:
Intercept      2.191631e-07
Month          9.302297e-09
Dream          1.660017e-01
Month:Dream    1.535356e-01
Cheat          5.530645e-06
Month:Cheat    7.928875e-06
Name: P>|t|, dtype: float64

Well, the Dream sweetspot still isn't significant, but the Dream 'cheatspot' certainly seems so! This suggests the time after Dream was caught cheating is certainly correlated with a decrease in speedrun submissions in the following year. Finally, let's plot the residuals.

In [41]:
freq['res'] = fit.resid
fig,ax = plt.subplots()
ax.bar(copy, freq['res'], width = 30)
plt.title('New Model Residuals')
plt.ylabel('Residual')
plt.show()

We see the residuals appear much more closely centered around zero, with much more uniform variance and less variation with respect to time, which is what we are looking for in a linear model.

Discussion and Future Research¶

It appears from these regressions that Dream may have had an impact on Minecraft speedrunning, but not in the way I was expecting. There may be room for argument that Dream brought more people into speedrunning with his world record and fame, but there is a much stronger argument to be made that his cheating and subsequent departure from the community was at least correlated with a shrinkage in the community overall. This would explain why Minecraft seemingly nosedived in 2021 where Speedrunning.com did not. However there are other reasons for why Minecraft's speedruns may have been irregular that may be included in future research in this area. This includes the demographics of Minecraft players, which includes largely children, for whom quarantine keeping them in and out of school would cause their presence (or lack thereof) to be more pronounced. There are also other aspects like game genre which could have an effect. It would also be useful to run this this analysis choosing other points in time than those relating to dream, to ensure the significance that was found isn't a matter of overfitting.