Python Using wildcard inside of strings

user3667623

I am trying to scrap data from boxofficemoviemojo.com and I have everything setup correctly. However I am receiving a logical error that I cannot figure out. Essentially I want to take the top 100 movies and write the data to a csv file.

I am currently using html from this site for testing (Other years are the same): http://boxofficemojo.com/yearly/chart/?yr=2014&p=.htm

There's a lot of code however this is the main part that I am struggling with. The code block looks like this:

def grab_yearly_data(self,page,year):
    # page is the url that was downloaded, year in this case is 2014.

    rank_pattern=r'<td align="center"><font size="2">([0-9,]*?)</font>'
    mov_title_pattern=r'(.htm">[A-Z])*?</a></font></b></td>'
    #mov_title_pattern=r'.htm">*?</a></font></b></td>' # Testing

    self.rank= [g for g in re.findall(rank_pattern,page)]
    self.mov_title=[g for g in re.findall(mov_title_pattern,page)]

self.rank works perfectly. However self.mov_title does not store the data correctly. I am suppose to receive a list with 102 elements with the movie titles. However I receive 102 empty strings: ''. The rest of the program will be pretty simple once I figure out what I am doing wrong, I just can't find the answer to my question online. I've tried to change the mov_title_pattern plenty of times and I either receive nothing or 102 empty strings. Please help I really want to move forward with my project.

alecxe

Just don't attempt to parse HTML with regex - it would save you time, most importantly - hair, and would make your life easier.

Here is a solution using BeautifulSoup HTML parser:

from bs4 import BeautifulSoup
import requests

url = 'http://boxofficemojo.com/yearly/chart/?yr=2014&p=.htm'
response = requests.get(url)

soup = BeautifulSoup(response.content)

for row in soup.select('div#body td[colspan="3"] > table[border="0"] tr')[1:-3]:
    cells = row.find_all('td')
    if len(cells) < 2:
        continue

    rank = cells[0].text
    title = cells[1].text
    print rank, title

Prints:

1 Guardians of the Galaxy
2 The Hunger Games: Mockingjay - Part 1
3 Captain America: The Winter Soldier
4 The LEGO Movie
...
98 Transcendence
99 The Theory of Everything
100 As Above/So Below

The expression inside the select() call is a CSS Selector - a convenient and powerful way of locating elements. But, since the elements on this particular page are not conveniently mapped with ids or marked with classes, we have to rely on attributes like colspan or border. [1:-3] slice is here to eliminate the header and total rows.


For this page, to get to the table you can rely on the chart element and get it's next table sibling:

for row in soup.find('div', id='chart_container').find_next_sibling('table').find_all('tr')[1:-3]:
    ...

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related