Project 4: Twitch User's Dataset

Hoanglan Nguyen
Apr 6, 2022
3 min read

Updated: Apr 19, 2022

Gaming became one of big industry in the entertainment world. Every year millions of dollars invested in Esports and many (new or current) companies want to invest in Esports. Streamers now became famous through showcasing and playing the games that either popular or trending which really depends on the users' interest and preferences. Now that streaming has brought everyone's attention, let's go over to analyze the users experience and time spent around one of the popular streaming site called Twitch. So how long does twitch user spent their days on this site? What is the total viewing sessions per weeks? To investigate the users

What is clustering and how does it work?

Clustering is a machine learning technique that involves the grouping of data points. It is a method of unsupervised learning and is a common technique for statistical data analysis used in many fields. With a set of data points, we can use a clustering algorithm to classify each data point into a specific group. Which means the data points The data points that are in the same group should have similar properties and/or features, while data points in different groups that highly dissimilar properties and/or features.

How does it work? In general, we can use clustering analysis to gain some valuable insights from the data by seeing what groups the data points fall into when we apply a clustering algorithm. There are many methods that is used for clustering: K-means clustering, Mean-Shift clustering, Density-based spatial clustering of applications with Noise (DBSCAN), Expectation-Maximization Clustering using Mixture Models, and Agglomerative Hierarchical Clustering.

The most popular methods of clustering is K-means clustering and Hierarchical clustering or Agglomerative clustering. K-means clustering takes on each data point that is assigned to the group initially, and that data points can gradually clustered based on similar characteristics. Agglomerative clustering is the algorithm that groups similar objects into groups called clusters. The endpoint is set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.

K-means Clustering:

Hierarchical/Agglomerative Clustering:

Introduce the data

This dataset consists of different attributes/features such as number of viewers, number of active viewers, followers gained and many other relevant columns regarding on a particular streamer. This data helps to understand what the users are like and who the users usually watch.

The features and columns that are included in this dataset are:

Channel
Watch time (minutes)
Stream time (minutes)
Peak viewers
Average Viewers
Followers
Followers gained
Viewers gained
Partnered
Mature
Language

Before I first create the data visualizations, I removed "Partnered", "Mature", and "Language" columns since they irrelevant for this project.

Data Understanding

I began by previewing the data frame and doing some data analysis. I will look over the relationship between various features.

First, I created the heatmap to get broad overview of the relationships between the attributes. What you see here is that there are strong linear relationships between.

Next, I created the pairplot to see any relationship between the attributes are correlating to each other.

Pre-processing the data

The first part of the preprocessing, I dropped the nulls to avoid errors. This should make the dataset to turn out to be relatively clean.

There are no null values so that should be great. I did removed the "Partnered", "Mature", and "Language" columns just I could run the data visualizations easily.