I am trying to scrap data from boxofficemoviemojo.com and I have everything setup correctly. However I am receiving a logical error that I cannot figure out. Essentially I want to take the top 100 movies and write the data to a csv file.
I am currently using html from this site for testing (Other years are the same): http://boxofficemojo.com/yearly/chart/?yr=2014&p=.htm
There's a lot of code however this is the main part that I am struggling with. The code block looks like this:
def grab_yearly_data(self,page,year):
# page is the url that was downloaded, year in this case is 2014.
rank_pattern=r'<td align="center"><font size="2">([0-9,]*?)</font>'
mov_title_pattern=r'(.htm">[A-Z])*?</a></font></b></td>'
#mov_title_pattern=r'.htm">*?</a></font></b></td>' # Testing
self.rank= [g for g in re.findall(rank_pattern,page)]
self.mov_title=[g for g in re.findall(mov_title_pattern,page)]
self.rank works perfectly. However self.mov_title does not store the data correctly. I am suppose to receive a list with 102 elements with the movie titles. However I receive 102 empty strings: ''. The rest of the program will be pretty simple once I figure out what I am doing wrong, I just can't find the answer to my question online. I've tried to change the mov_title_pattern plenty of times and I either receive nothing or 102 empty strings. Please help I really want to move forward with my project.
Just don't attempt to parse HTML with regex - it would save you time, most importantly - hair, and would make your life easier.
Here is a solution using BeautifulSoup
HTML parser:
from bs4 import BeautifulSoup
import requests
url = 'http://boxofficemojo.com/yearly/chart/?yr=2014&p=.htm'
response = requests.get(url)
soup = BeautifulSoup(response.content)
for row in soup.select('div#body td[colspan="3"] > table[border="0"] tr')[1:-3]:
cells = row.find_all('td')
if len(cells) < 2:
continue
rank = cells[0].text
title = cells[1].text
print rank, title
Prints:
1 Guardians of the Galaxy
2 The Hunger Games: Mockingjay - Part 1
3 Captain America: The Winter Soldier
4 The LEGO Movie
...
98 Transcendence
99 The Theory of Everything
100 As Above/So Below
The expression inside the select()
call is a CSS Selector
- a convenient and powerful way of locating elements. But, since the elements on this particular page are not conveniently mapped with id
s or marked with class
es, we have to rely on attributes like colspan
or border
. [1:-3]
slice is here to eliminate the header and total rows.
For this page, to get to the table you can rely on the chart element and get it's next table
sibling:
for row in soup.find('div', id='chart_container').find_next_sibling('table').find_all('tr')[1:-3]:
...
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments