Python3 Parse more than 30 videos at a time from youtube

Ellis Published at Dev

Ellis

I recently decided to get into parsing with python, i made up a project where i need to get data from all of a youtubers videos. I decided it would be easy to just go to the video tab in their channel and parse it for all if its links. However when i do parse it i can only get 30 videos at a time. I was wondering why this is because the link never seems to change when you load more. As well as if there was a way around it. Here is my code

import bs4 as bs
import requests

page = requests.get("/run/media/morpheous/PORTEUS/Workspace/Python/Parsing/parse.py")
soup = bs.BeautifulSoup(page.text, 'html.parser')
soup.find_all("a", "watch-view-count")
k = soup.find_all("div", "yt-uix-sessionlink yt-uix-tile-link  spf-link  yt-ui-ellipsis yt-ui-ellipsis-2")
storage = open('data.csv', 'a')
storage.write(k.get('href')
storage.close()

Any help is appreciated, thanks

Taylan Aydinli

I should first say that I agree with @jonrsharpe. Using the YouTube API is the more sensible choice.

However, if you must do this by scraping, here's a suggestion.

Let's take MKBHD's videos page as an example. The Load more button at the bottom of the page has a button tag with this attribute (You can use your browser's 'inspect element' feature to see this value):

data-uix-load-more-href="/browse_ajax?action_continuation=1&continuation=4qmFsgJAEhhVQ0JKeWNzbWR1dllFTDgzUl9VNEpyaVEaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk1yZ0JBQSUzRCUzRA%253D%253D"

When you click the Load more button, it makes an AJAX request to this /browse_ajax url. The response is a JSON object that looks like this:

{
    content_html: "the html for the videos",
    load_more_widget_html: "      \n\n\n\n    \u003cbutton class=\"yt-uix-button yt-uix-button-size-default yt-uix-button-default load-more-button yt-uix-load-more browse-items-load-more-button\" type=\"button\" onclick=\";return false;\" aria-label=\"Load more\n\" data-uix-load-more-href=\"\/browse_ajax?action_continuation=1\u0026amp;continuation=4qmFsgJAEhhVQ0JKeWNzbWR1dllFTDgzUl9VNEpyaVEaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk03Z0JBQSUzRCUzRA%253D%253D\" data-uix-load-more-target-id=\"channels-browse-content-grid\"\u003e\u003cspan class=\"yt-uix-button-content\"\u003e  \u003cspan class=\"load-more-loading hid\"\u003e\n      \u003cspan class=\"yt-spinner\"\u003e\n      \u003cspan class=\"yt-spinner-img  yt-sprite\" title=\"Loading icon\"\u003e\u003c\/span\u003e\n\nLoading...\n  \u003c\/span\u003e\n\n  \u003c\/span\u003e\n  \u003cspan class=\"load-more-text\"\u003e\n    Load more\n\n  \u003c\/span\u003e\n\u003c\/span\u003e\u003c\/button\u003e\n\n\n"
}

The content_html contains the html for the new page of videos. You can parse that to get the videos in that page. To get to the next page, you need to use the load_more_widget_html value and extract the url which again looks like:

data-uix-load-more-href="/browse_ajax?action_continuation=1&continuation=4qmFsgJAEhhVQ0JKeWNzbWR1dllFTDgzUl9VNEpyaVEaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk1yZ0JBQSUzRCUzRA%253D%253D"

The only thing in that url that changes is the value of the continuation parameter. You can keep making requests to this 'continuation' url, until the returning JSON object does not have the load_more_widget_html.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-02-23

Comments

0 comments

From Dev

Related Related

Article