An analysis of (my 3 years old son's) Youtube view history using Pandas




Motivation

My 3 years old boy watched a lot of videos on Youtube over the 2 years (under supervision of the parents, of course). I roughly know his taste: car cartoon, LEGO, etc, but never have the chance to look into more details. As Youtube keeps your watch history, I finally have the time to look at the data and share my exploration of the valuable data in this post.

Get and clean the data

Obtain the data

There are about 1700 videos in the history feed page, most are watched (or at least opened) by the boy. Initially I thought there should be tools available to download the data. I tried several tools but none can return me the full list. I don’t want to scrape the data myself, which basically need to deal with the authentication and ajax issues). So I simply clicked the “Load more” button multipe times to get all the videos. Finally I copied the video list using Chrome dev tools and saved it to a text file.

Extract videos from HTML

I used BeautifulSoup to extract the video meta data information from the raw html and saved the structured data in csv file.

# -*- coding: utf-8 -*-
from BeautifulSoup import BeautifulSoup
import csv
import codecs


def parse_item(soup):
    video_div = soup.find('div', {'class': 'yt-lockup-thumbnail'})
    thumbnail_span = video_div.find('span', {'class': 'yt-thumb-clip'})
    thumbnail_image = thumbnail_span.find('img')['src']
    video_time = video_div.find('span', {'class': 'video-time'}).text

    content_div = soup.find("div", {"class": "yt-lockup-content"})
    title_div = content_div.find("h3", {"class": "yt-lockup-title"})
    titlelink = title_div.find('a')
    title = titlelink.text.encode('utf-8')
    video_link = titlelink['href']

    byline_div = content_div.find("div", {"class": "yt-lockup-byline"})
    bylink = byline_div.find('a')
    by_name = bylink.text.encode('utf-8')
    if not by_name:
        by_name = ''
    by_link = bylink['href']

    meta_div = content_div.find("ul", {"class": "yt-lockup-meta-info"})
    view_count = int(meta_div.text.split()[0].replace(',', ''))

    try:
        desc_div = content_div.find("div", {"class": "yt-lockup-description yt-
ui-ellipsis yt-ui-ellipsis-2"})
        video_desc = desc_div.text.encode('utf-8')
    except:
        video_desc = ''

    item = {
        'title': title,
        'video_link': video_link,
        'thumbnail_image': thumbnail_image,
        # 'video_desc': video_desc,
        'channel_name': by_name,
        'channel_link': by_link,
        'view_count': view_count,
        'video_time': video_time
    }
    return item


def write_csv(videos):
    keys = videos[0].keys()
    f = open('videos.csv', 'wb')
    f.write(codecs.BOM_UTF8)
    dict_writer = csv.DictWriter(f, keys)
    dict_writer.writer.writerow(keys)
    dict_writer.writerows(videos)


def process_html():
    soup = BeautifulSoup(open('history.html'))
    divs = soup.findAll("div", {"class": "feed-item-main-content"})

    videos = []
    for ind, div in enumerate(divs):
        print ind
        item = parse_item(div)
        videos.append(item)
    write_csv(videos)

if __name__ == '__main__':
    process_html()

Explore the data using Pandas

In [1]:

import pandas as pd

df = pd.read_csv('videos.csv')
df = df.reindex(index=df.index[::-1])

df.head()
view_count title channel_link channel_name thumbnail_image video_link video_time
1523 21552 Create and deploy python web application in le... /channel/UCOQlyjAn_KD-3SEkUl4Sc9Q timothy crosley //i.ytimg.com/vi_webp/0L8TsmrZPLg/mqdefault.webp /watch?v=0L8TsmrZPLg 15:02
1522 6276713 로보카폴리16편 테리가 아파요 /channel/UCdlAM1KRcP5Skc84FiJG0Zw Tae hee Shin //i.ytimg.com/vi_webp/-It5hyuVQUI/mqdefault.webp /watch?v=-It5hyuVQUI 12:33
1521 4229857 로보카폴리 26화 새 친구, 후퍼.avi /channel/UChWhQByYjOwaINrEESG4HjA byulnoree //i.ytimg.com/vi_webp/vclQlXNIoos/mqdefault.webp /watch?v=vclQlXNIoos 11:18
1520 32113556 로보카 폴리 E03 콘크리트 대소동 HDTV x264 720p Known /channel/UCKnTxNGvpQW5vxhaMw7e5Gg 김우성 //i.ytimg.com/vi_webp/18oH5lFhRPk/mqdefault.webp /watch?v=18oH5lFhRPk 13:39
1519 133905 超可愛動畫 小汽車總動員3 好動小子 /channel/UCaSIarb0BNGUwg6fdPNzJnA 南強 //i.ytimg.com/vi_webp/u8Hc8GAJVL0/mqdefault.webp /watch?v=u8Hc8GAJVL0 4:59

Question 1: what are his favorite videos?

In [2]:

df.title.value_counts().head()
VooV ブーブ 変身  Transform                                          5
Cars Toon - ENGLISH - Mater's Tall Tales - Maters - McQueen - kids movie - Mater Toons - the cars    5
Cool BIG FIRE TRUCKS Kids Song | Music Video | DVD gift for child    5
[English Version] Tobot Season1 Ep.2                            4
Dickie Toys Fire Engine Garbage Truck Train Lightning McQueen Toy Crash Testing Mega Review    4
dtype: int64

Question 2: what are his favorite channels?

In [3]:

df.channel_name.value_counts().head()
TOBOTYOUNGTOYS                79
BluCollection ToyCollector    46
DC Toys Collector             43
Tayo                          28
qihuu                         28
dtype: int64

Question 3: for a single channel, is there any pattern?

I will use the TOBOTYOUNGTOYS channel as example, specifically, the [English Version] Tobot Season1 series of videos. There are 29 episodes in Season 1. The diagram at the end of this section shows that the boy watched the season twice. In each round, he can watch the whole season one episode after another (with some exceptions). On the other hand, when he watched the season at the second time, he skipped some episodes and finished the round much quicker than the first round.

In [4]:

tobot_channel_df = df[df.channel_name=='TOBOTYOUNGTOYS']
english_videos = tobot_channel_df[tobot_channel_df.title.str.startswith('[English Version]')]
english_videos['episode'] = [item[-1] for item in english_videos.title.str.split('.').tolist()]
english_videos.head()
view_count title channel_link channel_name thumbnail_image video_link video_time episode
256 167372 [English Version] Tobot Season1 Ep.1 /channel/UC1WEOQjm8tyT1pXY6xaoYZQ TOBOTYOUNGTOYS //i.ytimg.com/vi_webp/3M6bH7CHJqM/mqdefault.webp /watch?v=3M6bH7CHJqM 21:01 1
255 66753 [English Version] Tobot Season1 Ep.2 /channel/UC1WEOQjm8tyT1pXY6xaoYZQ TOBOTYOUNGTOYS //i.ytimg.com/vi_webp/CN8VcN1sVBk/mqdefault.webp /watch?v=CN8VcN1sVBk 21:08 2
254 197296 [English Version] Tobot Season1 Ep.3 /channel/UC1WEOQjm8tyT1pXY6xaoYZQ TOBOTYOUNGTOYS //i.ytimg.com/vi_webp/Qcfq1GvX17Q/mqdefault.webp /watch?v=Qcfq1GvX17Q 21:00 3
253 41275 [English Version] Tobot Season1 Ep.4 /channel/UC1WEOQjm8tyT1pXY6xaoYZQ TOBOTYOUNGTOYS //i.ytimg.com/vi_webp/R9NFJ3x0wto/mqdefault.webp /watch?v=R9NFJ3x0wto 21:07 4
252 26022 [English Version] Tobot Season1 Ep.5 /channel/UC1WEOQjm8tyT1pXY6xaoYZQ TOBOTYOUNGTOYS //i.ytimg.com/vi_webp/XEzmmYekZuY/mqdefault.webp /watch?v=XEzmmYekZuY 21:19 5

In [5]:

import matplotlib.pyplot as plt
plt.style.use('ggplot')

fig = plt.figure(figsize=(8,4))
ax = fig.add_subplot(1,1,1)
ax.plot(english_videos.episode, marker='o')
ax.set_ylabel(r'episode index')
ax.set_xlabel(r'view order')
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos: 'Ep.%d' % x))
ax.grid(True)
fig.suptitle('[English Version] Tobot Season1')
<matplotlib.text.Text at 0xaa9cdd0>

png

Question 4: how does him pick videos? does him watch one channel after

another, or watch multiple channels during the same period of time?

It turns out that the boy prefers to watch multiple series at the same time. From the diagram at the bottom of this section, it can be observed that for some channels (e.g., Tayo, which is a very simple cartoon), he only watched them when he was 2 years old. Later on after 3 years, his interests have shift to more complex cartoons, like TOBOTYOUNGTOYS.

In [6]:

top_channels = df.channel_name.value_counts().head(7)
top_channels_names = top_channels.to_dict().keys()
print top_channels
top_channels_df = df[df.channel_name.isin(top_channels_names)]
top_channels_df.head()
TOBOTYOUNGTOYS                79
BluCollection ToyCollector    46
DC Toys Collector             43
Tayo                          28
qihuu                         28
AniSky                        23
MADABOUTLEGO                  22
dtype: int64
view_count title channel_link channel_name thumbnail_image video_link video_time
1439 9162255 Gear Up and Go Lightning McQueen Buildable toy... /channel/UC7nGdI1YVbdj0lBqEeg8KxQ BluCollection ToyCollector https://s.ytimg.com/yts/img/pixel-vfl3z5WfW.gif /watch?v=sYLLxOE5GoM 4:44
1435 5617601 Talking Mack Truck Ramp Playset NEW Cars Trans... /channel/UCqdGW_m8Rim4FeMM29keDEg DC Toys Collector https://s.ytimg.com/yts/img/pixel-vfl3z5WfW.gif /watch?v=2fc8ADiMUkU 5:03
1434 5631599 Tomica Dancing Lightning Mcqueen BEAT Disney C... /channel/UC7nGdI1YVbdj0lBqEeg8KxQ BluCollection ToyCollector https://s.ytimg.com/yts/img/pixel-vfl3z5WfW.gif /watch?v=qs1sl5fkn0o 3:52
1432 4223021 Cars Mack Truck Ramp Playset - Caminhão Falant... /channel/UCqdGW_m8Rim4FeMM29keDEg DC Toys Collector https://s.ytimg.com/yts/img/pixel-vfl3z5WfW.gif /watch?v=yTaZDabwemk 1:19
1431 1153131 Disney CARS STAR WARS TOYS REVIEW Jedi Lightni... /channel/UC7nGdI1YVbdj0lBqEeg8KxQ BluCollection ToyCollector https://s.ytimg.com/yts/img/pixel-vfl3z5WfW.gif /watch?v=g-ylZVx98TY 4:43

In [15]:

max_index = top_channels_df.index.max()
x, y = [], []
for (index, row) in top_channels_df.iterrows():
    x.append(max_index - index)
    y.append(top_channels_names.index(row.channel_name)+1)
    
fig, ax = plt.subplots(figsize=(8, 3.5))
ax.scatter(x, y, alpha=0.5, color='orchid')

def formater(x, pos):
    ind = int(x)-1
    if ind in range(len(top_channels_names)):
        return top_channels_names[ind]
    else:
        return ''
ax.yaxis.set_major_formatter(FuncFormatter(formater))
ax.set_ylabel(r'channels')
ax.set_xlabel(r'view order')
ax.set_xlim([0,1600])
fig.suptitle('The view history overlap of top channels')
ax.grid(True)

png

Conclusion

Due to the incompleteness of the data provided by Youtube (for example, the time when the video was watched is not provided), many interesting questions cannot be explored. However, the data already tell me something about my child’s preference. I will return to share more when I get more data or more interesting questions.