Cannot Write Web Crawler in Python

Adam_G Published at Dev

Adam_G

I'm having an issue writing a basic web crawler. I'd like to write about 500 pages of raw html to files. The problem is my search is either too broad or too narrow. It either goes too deep, and never gets past the first loop, or doesn't go deep enough, and returns nothing.

I've tried playing around with the limit= parameter in find_all(), but am not having any luck with that.

Any advice would be appreciated.

from bs4 import BeautifulSoup
from urllib2 import urlopen

def crawler(seed_url):
    to_crawl = [seed_url]
    while to_crawl:
        page = to_crawl.pop()
        if page.startswith("http"):
            page_source = urlopen(page)
            s = page_source.read()

            with open(str(page.replace("/","_"))+".txt","a+") as f:
                f.write(s)
                f.close()
            soup = BeautifulSoup(s)
            for link in soup.find_all('a', href=True,limit=5):
                # print(link)
                a = link['href']
                if a.startswith("http"):
                    to_crawl.append(a)

if __name__ == "__main__":
    crawler('http://www.nytimes.com/')

1.618

I modified your function so it doesn't write to file, it just prints the urls, and this is what I got:

http://www.nytimes.com/
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/

So it looks like your could would work, but there's a redirect loop. Maybe try rewriting this as a recursive function so you're doing a depth-first search instead of a breadth-first search, which I'm pretty sure is what is happening now.

EDIT: here's a recursive function:

def recursive_crawler(url, crawled):
    if len(crawled) >= 500:
        return
    print url
    page_source = urlopen(page)
    s = page_source.read()

    #write to file here, if desired

    soup = BeautifulSoup(s)
    for link in soup.find_all('a', href=True):
        a = link['href']
        if a != url and a.startswith("http") and a not in crawled:
            crawled.add(a)
            recursive_crawler(a, crawled)

Pass it an empty set for crawled:

c = set()
recursive_crawler('http://www.nytimes.com', c)

output (I interrupted it after a few seconds):

http://www.nytimes.com
http://www.nytimes.com/content/help/site/ie8-support.html
http://international.nytimes.com
http://cn.nytimes.com
http://www.nytimes.com/
http://www.nytimes.com/pages/todayspaper/index.html
http://www.nytimes.com/video
http://www.nytimes.com/pages/world/index.html
http://www.nytimes.com/pages/national/index.html
http://www.nytimes.com/pages/politics/index.html
http://www.nytimes.com/pages/nyregion/index.html
http://www.nytimes.com/pages/business/index.html

Thanks to whoever it was who suggested using an already_crawled set

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-02-17

Comments

0 comments

From Dev

Related Related

Article

Cannot Write Web Crawler in Python

Cannot Write Web Crawler in Python

Python duplicates in web crawler

Python web crawler, depth issue

PHP vs Python For Web Crawler

A web crawler in a self-contained python file

Href extraction by Web crawler using Python

Python Web Crawler not printing any result

Python web crawler (NameError: name 'spider' is not defined)

Trying to create a simple python web crawler

Using my Python Web Crawler in my site

Python BeautifulSoup web image crawler IOError: [Errno 2] No such file or directory

Web Crawler - TooManyRedirects: Exceeded 30 redirects. (python)

Python Web Crawler, Can i do function calls from for loops?

Python - Reddit web crawler using BeautifulSoup4 returns nothing

Python Google API Web Crawler using OAuth 2

Cannot install crawler for ABP

Cannot find module 'crawler'

Python cannot find the file path to write the output

cannot write file with full path in Python

cannot write file with full path in Python

web crawler not working properly

Multithreaded Web Crawler in Java

Web crawler - following links

Abot Web Crawler Performance

Ethics using web crawler

Web crawler class not working

Multithreaded Web Crawler in Java

Abot Web Crawler Performance

Cannot write dataframe to a .csv file in python using file.write

Python Web Scrape Write Output to File