Vtuber Superchat Data Analysis - A Data Science Pipeline Tutorial

If you want to follow along, the associated files are hosted here.

Table of Contents

  1. Introduction
    1. What are Vtubers?
    2. Why care?
  2. Data Collection
    1. Part 2 - Scraping
  3. Data Management and Representation
  4. Exploratory Data Analysis
    1. Correlation between time of day and superchats
      1. Findings
      2. Possible Future Work
    2. Correlation between view count and superchats
      1. Regression Warning
      2. Residuals Warning
      3. Video Game Streams
      4. Findings
  5. Machine Learning
  6. Summary/Conclusions
  7. Possible Future Work


What are Vtubers?

Vtubers, short for "Virtual Youtubers," are internet personalities who use virtual avatars (usually using anime-style avatars, which can be animated in Live2D, 3D, both, or neither). Originally mostly popular in Japan, in recent years they have become more popular overseas, to the point that many westerners and "normal streamers" have adopted Vtuber personas themselves.

The Vtuber industry has explosively grown in the last 2-3 years, with entire agencies such as Hololive and Nijisanji being formed to manage and promote their Vtubers.

As stated on Wikipedia, YouTube's 2020 Culture and Trends report highlights VTubers as one of the notable trends of that year, with 1.5 billion views per month by October.

In summary, if you are familiar with Twitch streamers, you can think of Vtubers as similar personas but with anime avatars instead of showing their physical selves on camera.

Hololive Vtuber Hoshimachi Suisei (left) playing Minecraft with fellow Vtuber Shiranui Flare (right)


These Vtubers primarily make money through memberships where viewers can pay to get certain perks, and through superchats where users can pay money to highlight their message. (Joining a membership can also have a message associated with it, so they are classified as superchats as well). This is big business - at my time of writing (5/1/2021), 15 of the top 20 most superchatted channels worldwide were those of Vtubers (https://playboard.co/en/youtube-ranking/most-superchatted-all-channels-in-worldwide-total). Furthermore, the top Vtubers make upwards of \$1,000,000 USD in superchat revenue in 2020, and the top 193 channels all made over \$100,000 USD (again, many of whom were Vtubers). (Of course, not all of this money goes directly to the Vtuber - YouTube themselves, and the agency the Vtuber works for [if any], both take a cut).

In this project, I aim to analyze the superchat earnings of various Vtubers, and hope to see if I can find any trends or indicators that might correlate with higher superchat earnings.

Why care?

As stated earlier, Vtubers are a growing business (from 2000 active Vtubers in May 2018 to over 10,000 in January 2020), and there are many people who hope to become Vtubers. The industry is so profitable that agencies have been formed to manage them. Both current and aspiring Vtubers thus stand to gain from knowing how to maximize their superchat earnings, or predict how much they might make based on how they plan to stream. This analysis aims to provide insight into both of those aspects. For fans, I feel that knowing this data is interesting in and of itself, and others have done studies regarding vtubers for fun. For example, one reddit user conducted a study of the languages used in various vtubers' chats. I hope to also post this study to reddit and get gilded.

This study could also probably be extended to non-vtuber streamers, to potentially increase the audience for who finds this data useful.

For non-vtuber fans (and vtuber fans too!), you might learn about the data science pipeline, ranging from data collection to exploratory data analysis to some regression with machine learning.

Data Collection

Of course, the first thing we need to do is collect the data we want to analyze. To my knowledge, no such dataset on Vtuber superchat data (publicly) exists, so we need to create it ourselves. To do this, we will use a combination of the Youtube Data API, web scraping, and the chat-downloader library (pip install chat-downloader) (this tool also seems to do web scraping in the background).

For the dataset, I picked 50 different VTubers, most of which were under the Vtuber talent agency Hololive (because they are by far the most popular agency), but also with some from other agencies and also a few independent Vtubers. The selection was not exactly scientific; I mostly just chose VTubers I liked or had heard of before, but made sure to get some spread in their popularities/subscriber counts. Of each VTuber, I would attempt to retrieve data of their last 50 videos. However, only livestreams with live chat replay would be counted towards the data, so this resulted in closer to an average of 45 videos per Vtuber (for a total of over 2200 individual videos). To start, some data on each of the 50 Vtubers - their anglicized name, group affiliation, and channel ID - was manually populated into a CSV.

Next, some functions were created to facilitate the collection of data.

get_superchats would return timestamps of each superchat, paid sticker message, or membership message (each of which is paid, and classified as a superchat), as well as a sum of the dollar value of said superchats for a given video URL. This function actually takes a while to run (on the scale of minutes to almost an hour per video, depending on the video's length)

Unfortunately I did not have the foresight to retain the currency data or the individual dollar values of each superchat.

get_last_50_videos will return a list of the last 50 videos for a given vtuber's channel, of which the return value will be fed into get_all_vids_details and subsequently get_video_details to return some metadata for each video.

The data we will be logging are the following fields:

Field Description Ended up being used
video_name The title of the video
video_id Youtube's id for the video
description The description of the video, as written by the Vtuber
published_at Youtube's id for the video
video_start_time The time at which the livestream actually started
video_end_time The time at which the livestream actually ended
num_superchats The amount of individual superchats received in a video
val_superchats The total value of superchats received in a video, in USD
locale Supposedly the language in which the video was made, but seems inaccurate
viewcount The number of views the video received.1
tags The tags that YouTube has assigned the video (gaming, entertainment, etc)
timestamps Timestamps for each individual superchat in the video

1 Note that this includes views after the livestream ended and were watched as a video, but this amount is comparatively very small so we can assume it is just the livestream viewers.

This code is the body of the data retrieval. As stated earlier, it took a very long time - so much so that I gave up on running it single-threaded in a Jupyter notebook. Instead, I rented two GCP instances, copied the code (as well as the above snippets) into new python scripts, and ran each one in parallel (3 scripts running on one machine, 2 on the other). Each script would ingest 10 Vtubers' worth of data (one would do the first 10 in the csv, one would do the next 10, etc.)

If anyone is interested, the two VMs were hosted using Google Compute Engine on us-east-4, with settings General-purpose e2-medium and running the default Debian GNU/Linux 10 (buster).

After waiting the whole day and night...

After running it for ~20 hours on the two GCP instances, we have all the data. They are stored as data_10.h5, data_20.h5, data_30.h5, data_40.h5, and data_49.h5 (for 50 total Vtubers). We can open them and see that our code worked!

Data Collection, Continued - Web Scraping

Although there is no API-based way to determine what exact game is being played in a stream, oftentimes YouTube will display the game title below the video. (This is not 100% foolproof, as I saw at least one instance of a game video that was not labelled, but it's about as good as we can get. Manually going through each video would be way too much of a hassle).

Monster Hunter Rise picture

Thus, we can use the videos that the API did label as some sort of video game, put them into HTTP requests, and scrape said page, then parse out the name of the games we want. If the video wasn't about a game, we simply populate it with N/A.

Next, we merge the 5 individual parts of our dataframe into one single large dataframe. We probably should have done this before the previous step, but of course it's better late than never. With this, we're also able to easily see how many videos we were able to successfully ingest and parse: 2223 videos! Given that many of the 2500 videos we checked weren't live streams (so we can't use them), this is great because it offers us a lot of data points to work with. (If, for example, 80% of the videos we parsed hadn't been live streams, we would be left with only 500 data points, which would be a lot less meaningful to work with).

And just in case, I saved this combined dataframe to a file, so we can start from here later. This is the end of the data collection/curation process - next up is parsing.

Data Management + Representation

Well, we now have all of the data we need to conduct our studies on, but a lot of it is not yet ready to work with. For example, the video start and end times are given in ISO8601 format strings (with a T between the date and the time, and a Z indicating that the time zone is UTC+0. We can't immediately work with these strings and need to change them to datetime objects first. We also need to do the same with the superchat timestamps, which are given as timestamps in microseconds. We need to also change those to datetimes.

Since this part is run separately from the previous ones (so that we don't need to do that gruesome data collection again), we start by opening the file we saved earlier.

First up is changing all the timestamps to datetime objects. The datetime.utcfromtimestamp function (and most other timestamp utilities) prefer timestamps to be in seconds, and we don't really need the granularity of microseconds anyways. Therefore we divide each timestamp by 1 million before converting it to a datetime.

Next, we convert the ISO strings to datetimes as well.

It'll also probably help for us to have the length of each video down the line. We can subtract the start time from the end time of the video very easily now that they're datetime objects, and then populate the column accordingly. We also need to convert some columns that look numeric - but are actually represented by strings - into numeric values.

Finally, we make a few more columns that reflect the data from other rows, but numerically so that we can work with them more easily. For example, we create an is_gaming column based on the tag data (similar to how we looked for streams to web scrape for the game title), with a value of 1 if it was a gaming stream and 0 if not. This is called a categorical variable - each stream either falls into the category or not, and there isn't really an ordering between if it is a gaming stream or not. We also make a vtuber_ordinal and affiliation_ordinal for each vtuber, based on their subscriber count and agency affiliation, respectively. Vtubers with lower subscriber counts would receive a smaller ordinal number, and ones with higher subscriber counts would get a higher number (based on their ranking among all 50 of the Vtubers in my dataset). Similarly, I ordered the agency affiliations by my best estimate of their size and popularities, with more popular agencies receiving higher affiliation ordinals. As implied by their column names, these are ordinal variables - similar to categorical variables, except now there is a clear ordering between each of the categories (popularity).

Here is a useful resource to learn more about the differences between the kinds of variables we use.

With that, our data representations are complete, and we can move on to exploratory data analysis!

Exploratory Data Analysis

Finally, we can start to analyze our data! We would like to see if there are any ways to predict/approximate the amount of superchat earnings that a Vtuber will make, or had made in a stream where this information is not available. Formally, we wish to test the null hypothesis:

$$H_0: \textrm{There is no relationship between any of the factors we analyze and the superchat earnings}$$

against the alternative hypothesis:

$$H_1: \textrm{There is some sort of relationship between any of the factors we analyze and the superchat earnings}$$

at the significance level $\alpha = 0.05$.

As an initial step, we can use correlation matrices to determine if any of the values in our dataframe appear to be correlated. We use both a Pearson correlation (to correlate between continuous values), as well as a Spearman test for the categorical/ordinal variables (vtuber_ordinal, is_gaming).

Here is a resource on Pearson vs Spearman correlation.