Cannot Write Web Crawler in Python

Adam_G

I'm having an issue writing a basic web crawler. I'd like to write about 500 pages of raw html to files. The problem is my search is either too broad or too narrow. It either goes too deep, and never gets past the first loop, or doesn't go deep enough, and returns nothing.

I've tried playing around with the limit= parameter in find_all(), but am not having any luck with that.

Any advice would be appreciated.

from bs4 import BeautifulSoup
from urllib2 import urlopen

def crawler(seed_url):
    to_crawl = [seed_url]
    while to_crawl:
        page = to_crawl.pop()
        if page.startswith("http"):
            page_source = urlopen(page)
            s = page_source.read()

            with open(str(page.replace("/","_"))+".txt","a+") as f:
                f.write(s)
                f.close()
            soup = BeautifulSoup(s)
            for link in soup.find_all('a', href=True,limit=5):
                # print(link)
                a = link['href']
                if a.startswith("http"):
                    to_crawl.append(a)

if __name__ == "__main__":
    crawler('http://www.nytimes.com/')
1.618

I modified your function so it doesn't write to file, it just prints the urls, and this is what I got:

http://www.nytimes.com/
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/

So it looks like your could would work, but there's a redirect loop. Maybe try rewriting this as a recursive function so you're doing a depth-first search instead of a breadth-first search, which I'm pretty sure is what is happening now.

EDIT: here's a recursive function:

def recursive_crawler(url, crawled):
    if len(crawled) >= 500:
        return
    print url
    page_source = urlopen(page)
    s = page_source.read()

    #write to file here, if desired

    soup = BeautifulSoup(s)
    for link in soup.find_all('a', href=True):
        a = link['href']
        if a != url and a.startswith("http") and a not in crawled:
            crawled.add(a)
            recursive_crawler(a, crawled)

Pass it an empty set for crawled:

c = set()
recursive_crawler('http://www.nytimes.com', c)

output (I interrupted it after a few seconds):

http://www.nytimes.com
http://www.nytimes.com/content/help/site/ie8-support.html
http://international.nytimes.com
http://cn.nytimes.com
http://www.nytimes.com/
http://www.nytimes.com/pages/todayspaper/index.html
http://www.nytimes.com/video
http://www.nytimes.com/pages/world/index.html
http://www.nytimes.com/pages/national/index.html
http://www.nytimes.com/pages/politics/index.html
http://www.nytimes.com/pages/nyregion/index.html
http://www.nytimes.com/pages/business/index.html

Thanks to whoever it was who suggested using an already_crawled set

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related