I recently decided to get into parsing with python, i made up a project where i need to get data from all of a youtubers videos. I decided it would be easy to just go to the video tab in their channel and parse it for all if its links. However when i do parse it i can only get 30 videos at a time. I was wondering why this is because the link never seems to change when you load more. As well as if there was a way around it. Here is my code
import bs4 as bs
import requests
page = requests.get("/run/media/morpheous/PORTEUS/Workspace/Python/Parsing/parse.py")
soup = bs.BeautifulSoup(page.text, 'html.parser')
soup.find_all("a", "watch-view-count")
k = soup.find_all("div", "yt-uix-sessionlink yt-uix-tile-link spf-link yt-ui-ellipsis yt-ui-ellipsis-2")
storage = open('data.csv', 'a')
storage.write(k.get('href')
storage.close()
Any help is appreciated, thanks
I should first say that I agree with @jonrsharpe. Using the YouTube API is the more sensible choice.
However, if you must do this by scraping, here's a suggestion.
Let's take MKBHD's videos page as an example. The Load more button at the bottom of the page has a button
tag with this attribute (You can use your browser's 'inspect element' feature to see this value):
data-uix-load-more-href="/browse_ajax?action_continuation=1&continuation=4qmFsgJAEhhVQ0JKeWNzbWR1dllFTDgzUl9VNEpyaVEaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk1yZ0JBQSUzRCUzRA%253D%253D"
When you click the Load more button, it makes an AJAX request to this /browse_ajax
url. The response is a JSON object that looks like this:
{
content_html: "the html for the videos",
load_more_widget_html: " \n\n\n\n \u003cbutton class=\"yt-uix-button yt-uix-button-size-default yt-uix-button-default load-more-button yt-uix-load-more browse-items-load-more-button\" type=\"button\" onclick=\";return false;\" aria-label=\"Load more\n\" data-uix-load-more-href=\"\/browse_ajax?action_continuation=1\u0026amp;continuation=4qmFsgJAEhhVQ0JKeWNzbWR1dllFTDgzUl9VNEpyaVEaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk03Z0JBQSUzRCUzRA%253D%253D\" data-uix-load-more-target-id=\"channels-browse-content-grid\"\u003e\u003cspan class=\"yt-uix-button-content\"\u003e \u003cspan class=\"load-more-loading hid\"\u003e\n \u003cspan class=\"yt-spinner\"\u003e\n \u003cspan class=\"yt-spinner-img yt-sprite\" title=\"Loading icon\"\u003e\u003c\/span\u003e\n\nLoading...\n \u003c\/span\u003e\n\n \u003c\/span\u003e\n \u003cspan class=\"load-more-text\"\u003e\n Load more\n\n \u003c\/span\u003e\n\u003c\/span\u003e\u003c\/button\u003e\n\n\n"
}
The content_html
contains the html for the new page of videos. You can parse that to get the videos in that page. To get to the next page, you need to use the load_more_widget_html
value and extract the url which again looks like:
data-uix-load-more-href="/browse_ajax?action_continuation=1&continuation=4qmFsgJAEhhVQ0JKeWNzbWR1dllFTDgzUl9VNEpyaVEaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk1yZ0JBQSUzRCUzRA%253D%253D"
The only thing in that url that changes is the value of the continuation
parameter. You can keep making requests to this 'continuation' url, until the returning JSON object does not have the load_more_widget_html
.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments