I have been teaching myself data science and machine learning lately using MOOCs on Coursera and Udemy. And I am perpetually watching YouTube videos! So I decided to acquire data about YouTube videos and run my newly learned DS and ML agorithms to derive insights from that data. My goal is to analyze and sort channels based on their views, trending videos, likes, and dislikes. I also wish to research factors that effect the popularity of a YouTube Video. I retrieved this data from Kaggle. Kaggle is a really good resource to get authentic data sets to play around and practice with. I did some data preparation and cleaning prior to making insights to get rid of lengthy description column, skewed rows, and insufficient or incompete rows.
Before I could connect the data to tableau or upload the csv files into MATLAB or Pandas I decided to clean the data. I noticed there were a lot of uneven or empty rows. I also noticed how the description was way too long and didn't contain information that I could use for the kind of insights I wanted to draw. I decided to delete the description column and upload the the data in SSIS to practice and perform extraction, transformation, and load(ETL)
Following is a screenshot describing the process.
I used the first conditional split to get rid of the extra
rows that don't have sufficient information in the columns that I would need for getting my desired insights. I did this by using a boolean condition checking for empty columns. This got rid of 3323 rows. All of these were just leftover truncated data from that messy description column I manually removed. I then checked for skewed data to see what caused empty columns 15-26 by changing non empty columns. This got rid of 6 identical rows. The leftover data was send to an OLE DB destination hence producing a Raw Data Table of records in SQL
The data cleaning and preparation didn't end here. My Udemy instructor Kirill Eremenko mentioned how ETL process often becomes TETL
or TETLT instead of just ETL because we continue to prep and transfer data to get the most robust model. Before turning RAW Table into a WORK Table I check if empty Columns 15 through 26
were indeed empty and conditional split was indeed successful in removing all the skewed unecessary description. Turns out columns 15-26 are empty in the filtered out data so they were safe to discard without losing any rows. So now I proceeded towards making a WRK table using Procedures in SQL as follows:
A stored procedure lets us create a WORK table using the RAW table. Upon executing this command I got the following error.
Stored procedure was fully relying on implicit conversion by SQL but this error shows that there might just be some more anomalies in our data that we need to get rid of. Since it wasn't able to convert string to date
this tells me that there is a possibility that there is a record that still has some text instead of publish or trending date. Here is how I checked further.
This command returned all the records that have texts in the dates columns> next thing, I deleted all of these records because trending date and publish date are crucial to my research and any records that don't have that information won't be of any use to me.
This command removed all those records and I now had 19772. Before excluding these rows the data 19793 rows. So 21 records were removed.
I did some additional checks to thoroughly check my data. They didn't return any anamolous rows but it's always good to do extra checks.
Some times number of views is less than total likes and everyone is leaving comments like- "YouTube you'se drunk" or "YouTube be Trippin'" and I wanted to check for that lol.
I checked to make sure there is no data where the publish date is after trending date because that would have been an error or anomaly I wouldn't want to include in my study.
Fortunately all of my additional checks failed and didn't return any rows. At this point I felt my data was clean and I could proceed towards drawing insights from my data.
I have heard from a lot of social media enthusiasts like myself that there is a certain time in the day when one tends to get more likes and views. Some of my friends say weekday after 5 or during lunch are good times to change your Facebook profile picture because apparently that's prime browsing time. Some say Saturday morning is a good time to grab likes and comments on an Instagram! I always wondered if these rules apply to YouTube videos as well. So distributed the publish time along days of the week and found the following information. So it looks like people tend to upload the most on Wednesdays. And since this is a data full of all videos that have trended at some point I am guessing they are on to something with this. I could see how people would want spend most time on YouTube on humpday. And it makes so much why most of my favorite YouTubers tend to post new videos on Wedensdays. I also ran a similar test on trending dates and that data wasn't ver interesting or insightful to me. All it said was videos usually tend to trend on Tuesdays. I guess that's when YouTube decides to update their Trending page maybe, who knows? Here is the data I got