Vtuber Superchat Data Analysis - A Data Science Pipeline Tutorial

If you want to follow along, the associated files are hosted here.

Table of Contents

  1. Introduction
    1. What are Vtubers?
    2. Why care?
  2. Data Collection
    1. Part 2 - Scraping
  3. Data Management and Representation
  4. Exploratory Data Analysis
    1. Correlation between time of day and superchats
      1. Findings
      2. Possible Future Work
    2. Correlation between view count and superchats
      1. Regression Warning
      2. Residuals Warning
      3. Video Game Streams
      4. Findings
  5. Machine Learning
  6. Summary/Conclusions
  7. Possible Future Work

Introduction

What are Vtubers?

Vtubers, short for "Virtual Youtubers," are internet personalities who use virtual avatars (usually using anime-style avatars, which can be animated in Live2D, 3D, both, or neither). Originally mostly popular in Japan, in recent years they have become more popular overseas, to the point that many westerners and "normal streamers" have adopted Vtuber personas themselves.

The Vtuber industry has explosively grown in the last 2-3 years, with entire agencies such as Hololive and Nijisanji being formed to manage and promote their Vtubers.

As stated on Wikipedia, YouTube's 2020 Culture and Trends report highlights VTubers as one of the notable trends of that year, with 1.5 billion views per month by October.

In summary, if you are familiar with Twitch streamers, you can think of Vtubers as similar personas but with anime avatars instead of showing their physical selves on camera.

Hololive Vtuber Hoshimachi Suisei (left) playing Minecraft with fellow Vtuber Shiranui Flare (right)

stats

These Vtubers primarily make money through memberships where viewers can pay to get certain perks, and through superchats where users can pay money to highlight their message. (Joining a membership can also have a message associated with it, so they are classified as superchats as well). This is big business - at my time of writing (5/1/2021), 15 of the top 20 most superchatted channels worldwide were those of Vtubers (https://playboard.co/en/youtube-ranking/most-superchatted-all-channels-in-worldwide-total). Furthermore, the top Vtubers make upwards of \$1,000,000 USD in superchat revenue in 2020, and the top 193 channels all made over \$100,000 USD (again, many of whom were Vtubers). (Of course, not all of this money goes directly to the Vtuber - YouTube themselves, and the agency the Vtuber works for [if any], both take a cut).

In this project, I aim to analyze the superchat earnings of various Vtubers, and hope to see if I can find any trends or indicators that might correlate with higher superchat earnings.

Why care?

As stated earlier, Vtubers are a growing business (from 2000 active Vtubers in May 2018 to over 10,000 in January 2020), and there are many people who hope to become Vtubers. The industry is so profitable that agencies have been formed to manage them. Both current and aspiring Vtubers thus stand to gain from knowing how to maximize their superchat earnings, or predict how much they might make based on how they plan to stream. This analysis aims to provide insight into both of those aspects. For fans, I feel that knowing this data is interesting in and of itself, and others have done studies regarding vtubers for fun. For example, one reddit user conducted a study of the languages used in various vtubers' chats. I hope to also post this study to reddit and get gilded.

This study could also probably be extended to non-vtuber streamers, to potentially increase the audience for who finds this data useful.

For non-vtuber fans (and vtuber fans too!), you might learn about the data science pipeline, ranging from data collection to exploratory data analysis to some regression with machine learning.

Data Collection

Of course, the first thing we need to do is collect the data we want to analyze. To my knowledge, no such dataset on Vtuber superchat data (publicly) exists, so we need to create it ourselves. To do this, we will use a combination of the Youtube Data API, web scraping, and the chat-downloader library (pip install chat-downloader) (this tool also seems to do web scraping in the background).

For the dataset, I picked 50 different VTubers, most of which were under the Vtuber talent agency Hololive (because they are by far the most popular agency), but also with some from other agencies and also a few independent Vtubers. The selection was not exactly scientific; I mostly just chose VTubers I liked or had heard of before, but made sure to get some spread in their popularities/subscriber counts. Of each VTuber, I would attempt to retrieve data of their last 50 videos. However, only livestreams with live chat replay would be counted towards the data, so this resulted in closer to an average of 45 videos per Vtuber (for a total of over 2200 individual videos). To start, some data on each of the 50 Vtubers - their anglicized name, group affiliation, and channel ID - was manually populated into a CSV.

Next, some functions were created to facilitate the collection of data.

get_superchats would return timestamps of each superchat, paid sticker message, or membership message (each of which is paid, and classified as a superchat), as well as a sum of the dollar value of said superchats for a given video URL. This function actually takes a while to run (on the scale of minutes to almost an hour per video, depending on the video's length)

Unfortunately I did not have the foresight to retain the currency data or the individual dollar values of each superchat.

get_last_50_videos will return a list of the last 50 videos for a given vtuber's channel, of which the return value will be fed into get_all_vids_details and subsequently get_video_details to return some metadata for each video.

The data we will be logging are the following fields:

Field Description Ended up being used
video_name The title of the video
video_id Youtube's id for the video
description The description of the video, as written by the Vtuber
published_at Youtube's id for the video
video_start_time The time at which the livestream actually started
video_end_time The time at which the livestream actually ended
num_superchats The amount of individual superchats received in a video
val_superchats The total value of superchats received in a video, in USD
locale Supposedly the language in which the video was made, but seems inaccurate
viewcount The number of views the video received.1
tags The tags that YouTube has assigned the video (gaming, entertainment, etc)
timestamps Timestamps for each individual superchat in the video

1 Note that this includes views after the livestream ended and were watched as a video, but this amount is comparatively very small so we can assume it is just the livestream viewers.

This code is the body of the data retrieval. As stated earlier, it took a very long time - so much so that I gave up on running it single-threaded in a Jupyter notebook. Instead, I rented two GCP instances, copied the code (as well as the above snippets) into new python scripts, and ran each one in parallel (3 scripts running on one machine, 2 on the other). Each script would ingest 10 Vtubers' worth of data (one would do the first 10 in the csv, one would do the next 10, etc.)

If anyone is interested, the two VMs were hosted using Google Compute Engine on us-east-4, with settings General-purpose e2-medium and running the default Debian GNU/Linux 10 (buster).

After waiting the whole day and night...

After running it for ~20 hours on the two GCP instances, we have all the data. They are stored as data_10.h5, data_20.h5, data_30.h5, data_40.h5, and data_49.h5 (for 50 total Vtubers). We can open them and see that our code worked!

Data Collection, Continued - Web Scraping

Although there is no API-based way to determine what exact game is being played in a stream, oftentimes YouTube will display the game title below the video. (This is not 100% foolproof, as I saw at least one instance of a game video that was not labelled, but it's about as good as we can get. Manually going through each video would be way too much of a hassle).

Monster Hunter Rise picture

Thus, we can use the videos that the API did label as some sort of video game, put them into HTTP requests, and scrape said page, then parse out the name of the games we want. If the video wasn't about a game, we simply populate it with N/A.

Next, we merge the 5 individual parts of our dataframe into one single large dataframe. We probably should have done this before the previous step, but of course it's better late than never. With this, we're also able to easily see how many videos we were able to successfully ingest and parse: 2223 videos! Given that many of the 2500 videos we checked weren't live streams (so we can't use them), this is great because it offers us a lot of data points to work with. (If, for example, 80% of the videos we parsed hadn't been live streams, we would be left with only 500 data points, which would be a lot less meaningful to work with).

And just in case, I saved this combined dataframe to a file, so we can start from here later. This is the end of the data collection/curation process - next up is parsing.

Data Management + Representation

Well, we now have all of the data we need to conduct our studies on, but a lot of it is not yet ready to work with. For example, the video start and end times are given in ISO8601 format strings (with a T between the date and the time, and a Z indicating that the time zone is UTC+0. We can't immediately work with these strings and need to change them to datetime objects first. We also need to do the same with the superchat timestamps, which are given as timestamps in microseconds. We need to also change those to datetimes.

Since this part is run separately from the previous ones (so that we don't need to do that gruesome data collection again), we start by opening the file we saved earlier.

First up is changing all the timestamps to datetime objects. The datetime.utcfromtimestamp function (and most other timestamp utilities) prefer timestamps to be in seconds, and we don't really need the granularity of microseconds anyways. Therefore we divide each timestamp by 1 million before converting it to a datetime.

Next, we convert the ISO strings to datetimes as well.

It'll also probably help for us to have the length of each video down the line. We can subtract the start time from the end time of the video very easily now that they're datetime objects, and then populate the column accordingly. We also need to convert some columns that look numeric - but are actually represented by strings - into numeric values.

Finally, we make a few more columns that reflect the data from other rows, but numerically so that we can work with them more easily. For example, we create an is_gaming column based on the tag data (similar to how we looked for streams to web scrape for the game title), with a value of 1 if it was a gaming stream and 0 if not. This is called a categorical variable - each stream either falls into the category or not, and there isn't really an ordering between if it is a gaming stream or not. We also make a vtuber_ordinal and affiliation_ordinal for each vtuber, based on their subscriber count and agency affiliation, respectively. Vtubers with lower subscriber counts would receive a smaller ordinal number, and ones with higher subscriber counts would get a higher number (based on their ranking among all 50 of the Vtubers in my dataset). Similarly, I ordered the agency affiliations by my best estimate of their size and popularities, with more popular agencies receiving higher affiliation ordinals. As implied by their column names, these are ordinal variables - similar to categorical variables, except now there is a clear ordering between each of the categories (popularity).

Here is a useful resource to learn more about the differences between the kinds of variables we use.

With that, our data representations are complete, and we can move on to exploratory data analysis!

Exploratory Data Analysis

Finally, we can start to analyze our data! We would like to see if there are any ways to predict/approximate the amount of superchat earnings that a Vtuber will make, or had made in a stream where this information is not available. Formally, we wish to test the null hypothesis:

$$H_0: \textrm{There is no relationship between any of the factors we analyze and the superchat earnings}$$

against the alternative hypothesis:

$$H_1: \textrm{There is some sort of relationship between any of the factors we analyze and the superchat earnings}$$

at the significance level $\alpha = 0.05$.

As an initial step, we can use correlation matrices to determine if any of the values in our dataframe appear to be correlated. We use both a Pearson correlation (to correlate between continuous values), as well as a Spearman test for the categorical/ordinal variables (vtuber_ordinal, is_gaming).

Here is a resource on Pearson vs Spearman correlation.

We see a few things. Obviously, num_superchats and val_superchats are highly correlated: the more superchats you have, the more total superchat income you will have. So are stream_start_hour and stream_end_hour: end hour is almost always after start hour (unless the day wraps around), and since most streams are 1-2 hours long, there's going to be a pretty evident trend. This can actually be seen in the scatter matrix.

Perhaps more of interest is the moderate correlation between the view count and the number and values of superchats (but the much lower correlation between view count and average superchat value), which implies that streams with more viewers get more superchats. This is to be expected, but we will have to see in what way later.

Also of note is the correlation between is_gaming and video_length. This is also intuitive for those familiar with Vtubers, because longer streams are often video game streams.

The vtuber_ordinal is also quite correlated with num_superchats, val_superchats, and viewcount. Of course, we would expect it to be very highly correlated with viewcount, since we ranked the ordinal based on subscriber count (and usually, higher subscriber counts mean higher view counts). The other two correlations also make sense: more popular Vtubers probably make more money.

Finally, affiliation_ordinal is highly correlated with vtuber_ordinal (obviously, since we know which affiliation each Vtuber is in), and thus it is also correlated with the superchats they earn.

While most of the Vtubers are targeted towards a Japanese or South East Asian audience (with similar timezones), there is also Hololive EN in our dataset, targeted towards English speakers (and in particular, North America). Perhaps looking at only these Vtubers might give us some insight.

Interestingly, we see that stream_start_hour is slightly (negatively) correlated with num_superchats and viewcount (but not so much val_superchats. Perhaps this means that afternoon streams get more views and superchats than morning streams? (Keep in mind hours are measured in UTC). Alternatively maybe one of the Vtubers streams a lot at a certain time and skews the data. Either way, it is worth looking into.

Correlation between time of day and superchats

Although the initial correlation matrix didn't really seem to show any correlation between the time of day of streams and the amount of superchats sent, there still might be a relationship. Streams usually last for at least an hour, with many being 6+ hours long, and this data is not initially reflected in our dataframe. So a 12-hour-long stream might have the majority of the superchats concentrated in the evening, and this would not be indicated in the correlation matrix. We did take timestamps of each individual superchat, though, so we can manually plot them out and see.

Unfortunately, I did not scrape individual dollar values of each superchat (only the sum for a given stream), which is sort of a regret but as I am not keen on running the scraping code for another two days we will make do with only the number of superchats for this (rather than dollar amounts). Most superchats average around \$5-\$10 anyways, and outlier "red superchats" worth \$100+ are few and presumably very hard to predict, so this might be for the better anyways. Future work could also scrape individual superchat amounts if desired.

Note that "time of day" varies around the world. While most of the Vtubers I chose to put in this dataset are tailored for Japanese audiences (and a few Indonesian, but the time zones for that are similar enough), there are also a few (5) targeted towards Western audiences. In this case, I feel it would be prudent to split them into two groups: Hololive EN (tailored for Western audiences), and not Hololive EN (for everyone else). Let's start with the non-EN group for now, since we will have more data there.

Note: the blue line represents the number of streams in the dataset that were live at any given minute, and uses the scale on the left (for example, at 14:00 UTC there were over 700 streams live in the dataset). The yellow bars represent the total number of superchats that came in during an hour, and uses the scale on the right (for example, the first yellow bar shows that around 2500 superchats came in between 00:00 and 01:00 UTC in the dataset).

There's a lot of information going on with this graph, so let's dissect it.

The line graph in itself is pretty interesting: we can see that on the hour, the number of streams spike. This makes sense because Vtubers often stream on a schedule, and usually start their streams on the hour. We can also see a rapid drop in streams during each hour, presumably as other streams end. It is also interesting to see that most streams happen near 11:00PM JST, which sounds a bit late to be "prime time." However, after I consulted with some other Vtuber fans, they confirmed this was indeed the case due to a combination of Japan's workplace culture resulting in workers getting home late, and poor sleep schedules on many Vtubers' parts.

We can see that while the amount of superchats do generally follow the times when more Vtubers are streaming, there are some deviations. Between 04:00-11:00 UTC (1:00-8:00PM JST) there are comparatively fewer superchats (as the bars are significantly below the line here). Meanwhile, between 15:00-18:00 UTC (12:00-3:00AM JST) there are comparatively more superchats, with the bars above the line. My hypothesis is that this also has to do with viewer behavior: 1:00-8:00PM JST is the afternoon and early evening, when many workers are still at work or going home, and are not able to catch streams - and thus unable to superchat. On the other hand, dedicated fans (and the ones who are more likely to superchat) are willing to stay up late to watch their Vtubers, so the number of superchats remains high even into the wee hours of the night. Furthermore, western audiences who choose to watch these streamers (which is not an insignficant amount) also may be waking up around this time, increasing the chances of more superchats (Of course, this is just speculation).

For the EN Vtubers, it is a lot harder to draw conclusions about the data. The main reason for this is that there are only 5 Vtubers considered here (as opposed to 45 in the previous graph), so the individual differences between each one significantly impacts our data (for example, one of them might be earning a lot more superchats than the rest, so they would greatly skew the data). As such, we cannot make any definitive guesses about this data, but I will say that it does seem like most of the superchats come in at around 04:00-06:00 UTC (Which is 12:00-2:00AM Eastern Time, or 9:00-11:00PM Pacific Time), which again reflects our previous hypothesis that more superchats tend to come in during the late evening (relative to the amount of streams that occur during said time).

Time of Day and Superchats: Summary

Although we did not do any quantitative analyses here nor find anything groundbreaking, we did corroborate some concepts that might have seemed like common sense.

Possible Future Work

While we qualitatively asserted that the shape of the "superchats per hour" distribution was similar but different to the "streams live at any time" distribution, we did not back this up with numbers. In future work, one could use the 2-sample Kolmogorov-Smirnov test (another link for more info) to test if the two datasets have the same shape.

Relationship between view count and superchat earnings

This might be an interesting plot to see: how does the amount of views of a livestream relate to the actual superchat earnings it generated? We would certainly expect a more popular livestream with more viewers to earn more income, but is this actually what we observe? Is it linear? We can plot view count against superchat earnings for each stream and find out.

Note: View count can be somewhat misleading, since it shows all views including those after the livestream ended. However, the vast majority of views should come during the livestream, so it should not impact our data by much.

Hmmm... It doesn't seem like this graph is very informative. While most of the values are concentrated in relatively low-view, low-earnings videos, there are plenty of videos with much higher earnings and much higher view counts. In this case, it might help to plot each axis on a logarithmic scale. This is called a log-log plot.

Now this is very interesting! We can see a few interesting trends here:

Let's try doing a regression on this data to see if we can find out exactly what this trend is!

The following regression is not what we want. It has been left as a demonstration.

Well, there certainly seems to be some sort of correlation, as evidenced by the p-value, but the r-value (the correlation coefficient, which essentially tells us how much of the variance is accounted for by our model) leaves a bit to be desired as it is somewhat low. Now, this isn't really a big problem, but let's plot our line of best fit along with our graph to see if it checks out.

Wait a second, that isn't even a straight line! This is because, on a log-log plot, things that appear to be linear are actually monomials of the form $y = ax^k$. Although this regression is okay, it definitely isn't what we wanted, and really misses the mark on the lower-view-count, less-superchatted videos.

This illustrates why it is important to double-check our code by plotting it and making sure it actually looks reasonable.

To fix this issue of having a curved line that should be straight, we can either:

The second one should be a lot easier, so let's do that.

Note that a few streams earned $0 in superchats (exactly 66 of the streams). This is most likely due to one of a few reasons:

In any case, here are a few ways of dealing with this:

These points only constitute 3% of the data anyways, so whether we keep or ignore them should not affect our results. In this case, I chose to ignore them.

Already, we can see the r-value is higher. This is good, as it indicates it matches up with our data better. Let's plot it:

We can also just use the transformed variables themselves. This will be very important later, when we do operations on the data, and especially when we get into the machine learning. One problem with transforming variables is that some data might become less intuitively understandable, but thankfully in our case it is quite easy to transform the data back - just take the exp of the transformed view count and superchat earnings.

We can also plot the residuals, to see if our regression is good.

We can split our residuals into 4 equally-wide groups (based on the view count) and make a violin plot to check for normality.

Great! The residuals seem quite random and somewhat uniformly distributed, so it's probable that our line of best fit is actually quite good (when applied to the log-log data). It does seem like the lower-viewed videos are not as well fitted by our model, as the rest of the data, but only slightly.


Warning! The following section describes a pitfall that you should avoid.

Next up I will show some code where I screwed up and attempted to take the residuals of the original data, which made the residuals look a lot worse. This actually goes against the whole point of transforming the data in the first place (so that our model fits better). It isn't really relevant anymore, but it's a good learning experience to cover, and we get to talk about a big word: "heteroscedasticity."

Unfortunately, this residual plot exhibits heteroscedasticity, which means that the residuals tend to increase as the prediction increases (as view count increases). This is not necessarily a bad thing, since we're bound to get higher errors with more popular streams just due to the nature of the data, so it might be hard to model. Usually a good way to fix this is transforming the variables (which we did, and that fixes this issue!)

Note that most of the residuals are above y=0 because our model can only finitely under-value a stream (since there are no negative superchats), but there is no upper limit on how much a video can be superchatted.

Here is a great resource about the kinds of issues we might see in a residuals plot, including heteroscedasticity.

Here is a resouce for working with and minimizing heteroscedasticity. It involves using generalized least squares instead of ordinary least squares for the regression. We might use it later in this tutorial.

Back to regularly scheduled data science


Anyways, from our basic regression, it seems like superchat earnings can be roughly modelled by the equation:

$y = 0.0663126 x^{0.758142}$

Where x is the number of viewers and y is the superchat revenue.

Video game streams - better or worse?

Next, let's see if video game streams do better or worse than this average.

Interesting. The r-value is higher (giving us a $r^2 \approx 0.4$), implying that this regression is a better fit for this subset of data than our original regression line was for the entire dataset, but that's probably to be expected. Also, this regression line has a greater slope but lower intercept than the original regression. A possible hypothesis to explain this is that in general people do not superchat as much on game streams, as evidenced by the fact that most of the data points have fewer views than the intersection of the two regression lines. However, for popular games that a lot of people enjoy (and thus also watch), they are more willing to shell out money. Of course, checking whether this is the case or not is out of scope of this project.

Of course, the residuals exhibit similar behavior as before, and there no reason to believe that this is a bad fit.

Next, let's take a closer look at the most popular games of the past few weeks.

Seems like there are 6 games that were streamed over 25 times among all Vtubers in this study. We can plot these all to see if anything interesting is apparent immediately.

No trends jump out yet. Let's take a closer look at each one, and add regression lines. (Residuals have been omitted for brevity, but if we really wanted to be rigorous we could add those).

While it does seem like some of these games have significantly different slopes (in particular, the Among Us) streams, it is important to note that there were only 24 points used for this stream's regression, so it's highly likely that the regression for this is inaccurate. Overall, we're starting to get too few data points in each category to be reliable, so we should probably end the analysis here for views vs. superchat earnings, and conclude that individual game titles make little difference on superchat earnings.

Note: I talked with a Vtuber fan friend of mine who said that Among Us streams tend to not get many superchats. This seems to be corroborated by the data.

Views vs Superchats Findings, summarized

Machine Learning

Let's see if machine learning can use multiple of the factors we've looked at and seen some correlation with to accurately model the amount of superchat earnings a stream will make!

First up is deciding the type of model we want to use. Since we want to predict a quantity, we want to use regression. scikit-learn provides a graphic for which algorithms to use:

sklearn algo cheat sheet

From our previous findings, there weren't too many factors that seemed to make a big impact on superchat earnings. We'll use video_length, viewcount (normalized), is_gaming, and vtuber_ordinal to predict val_superchats (normalized). According to the chart, we should use Lasso or ElasticNet1, but I actually tried both of them and neither of them provided as good of a result as plain old linear regression (which is equivalent to Lasso(alpha=0). Thus, we will just use a LinearRegression on these variables.

While it is possible that we could perhaps transform the variables further, such as through a StandardScaler to scale our data to mean 0 and variance 1, in order to improve our regression (and make it work better with Lasso or ElasticNet), this comes with big drawbacks. It might become a lot harder to interpret the results of our regression, and ultimately less informative about the result as a whole. Right now, the only transformation we have is taking the log of the viewcount and the log of the superchat value: so to transform it back, you just take exp(value). This is still very understandable for humans. However, if we were to scale the data to unit variance and shift all the data around, it would be a lot harder to make sense of. For this reason, we will not apply any further preprocessing.

1 Essentially, Lasso and ElasticNet build upon the standard LinearRegression (which uses ordinary least squares as its loss function to minimize), where they use the $\ell_1$ norm (Lasso) or both the $\ell_1$ and $\ell_2$ norms (ElasticNet) in the loss function as well. Learn more here.

Essentially, what this model is saying is,

$$\log(\textrm{Superchat Earnings}) = 0.0033(\textrm{video_length}) + 0.315 \log(\textrm{viewcount}) - 0.785 (\textrm{is_gaming}) + 0.024(\textrm{vtuber_ordinal}) + 0.144(\textrm{affiliation_ordinal})$$

(Where log is the natural logarithm, video length is measured in minutes, and the other 3 ordinals are as defined earlier. If a Vtuber who was not used in this dataset wanted to also use this function, they would estimate their own ordinals based on their own subscriber count and their affiliated agency's popularity).

This is still quite easy for us to plug values in and get a good estimate - this would be a lot harder if we had done more transformations on our parameters.

Furthermore, recall the $r^2$ of our simpler, non-ML regression earlier (based only off view count) had a $r^2$ of about 0.25 (and it used the entire dataset!) While the $r^2$ of our new model is still somewhat low, at ~0.35, it is certainly better than before, and it only used 75% of the data to train on. We would not expect a very high score anyways, since human behavior is unpredictable. Overall, our machine learning model is indeed a better model for Vtuber earnings.

Our residuals look vaguely normal and are generally symmetric around 0, so this is good too.

Perhaps more importantly, if we drop the viewcount column (since we don't necessarily know how many viewers are going to watch a stream beforehand), we can still get a meaningful regression.

So this model gives us the equation:

$$\log(\textrm{Superchat Earnings}) = 0.0037(\textrm{video_length}) - 0.83 (\textrm{is_gaming}) + 1.895(\textrm{vtuber_ordinal}) + 0.15(\textrm{affiliation_ordinal})$$

So even with only knowledge of video_length, is_gaming, vtuber_ordinal, and affiliation_ordinal (essentially, which Vtuber one is and what group they belong to), we can get almost as good of a regression! This is important because a Vtuber might be able to estimate how much they will make only given what topic/game they plan on streaming and how long they want to stream for.

Takeaways

While we weren't able to make a one-size-fits-all formula to model the amount of money a Vtuber will make from superchats, we gained a lot of insight into the nature of superchat earnings.

Hopefully you found this interesting and learned something new!

Future Possibilities

Vtubers list

This is a list of all the Vtubers I used in my dataset, with links to their channels. It is sorted in descending order of subscribers.Sorry if your favorite isn't here!

Name Affiliation
A.I.Channel independent
Gawr Gura Ch. hololive-EN hololive_en
Korone Ch. 戌神ころね hololive
Pekora Ch. 兎田ぺこら hololive
フブキCh。白上フブキ hololive
Mori Calliope Ch. hololive-EN hololive_en
Marine Ch. 宝鐘マリン hololive
Aqua Ch. 湊あくあ hololive
Watson Amelia Ch. hololive-EN hololive_en
Rushia Ch. 潤羽るしあ hololive
HAACHAMA Ch. 赤井はあと hololive
Coco Ch. 桐生ココ hololive
Noel Ch. 白銀ノエル hololive
Okayu Ch. 猫又おかゆ hololive
Matsuri Channel 夏色まつり hololive
Takanashi Kiara Ch. hololive-EN hololive_en
Ninomae Ina'nis Ch. hololive-EN hololive_en
Suisei Channel hololive
Subaru Ch. 大空スバル hololive
Watame Ch. 角巻わため hololive
Kanata Ch. 天音かなた hololive
Botan Ch.獅白ぼたん hololive
SoraCh. ときのそらチャンネル hololive
月ノ美兎 nijisanji
Mio Channel 大神ミオ hololive
Moona Hoshinova hololive-ID hololive_id
本間ひまわり - Himawari Honma - nijisanji
犬山たまき / 佃煮のりおチャンネル independent
Towa Ch. 常闇トワ hololive
Nene Ch.桃鈴ねね hololive
Kureiji Ollie Ch. hololive-ID hololive_id
鈴原るる【にじさんじ所属】 nijisanji
リゼ・ヘルエスタ -Lize Helesta- nijisanji
アルス・アルマル -ars almal- 【にじさんじ】 nijisanji
戌亥とこ -Inui Toko- nijisanji
Ayunda Risu Ch. hololive-ID hololive_id
竜胆 尊 / Rindou Mikoto nijisanji
天野ピカミィ. Pikamee voms
Pavolia Reine Ch. hololive-ID hololive_id
Airani Iofifteen Channel hololive-ID hololive_id
鷹宮リオン / Rion Takamiya nijisanji
夢月ロア🌖Yuzuki Roa nijisanji
Anya Melfissa Ch. hololive-ID hololive_id
魔界ノりりむ nijisanji
ぽちまる:POCHI-GOYA channel independent
Roberu Ch. 夕刻ロベル holostars
緋笠トモシカ - Tomoshika Hikasa - voms
Shien Ch.影山シエン holostars
ピーナッツくん!オシャレになりたい! independent
Aruran Ch. アルランディス holostars