Python Using wildcard inside of strings

user3667623 Published at Dev

user3667623

I am trying to scrap data from boxofficemoviemojo.com and I have everything setup correctly. However I am receiving a logical error that I cannot figure out. Essentially I want to take the top 100 movies and write the data to a csv file.

I am currently using html from this site for testing (Other years are the same): http://boxofficemojo.com/yearly/chart/?yr=2014&p=.htm

There's a lot of code however this is the main part that I am struggling with. The code block looks like this:

def grab_yearly_data(self,page,year):
    # page is the url that was downloaded, year in this case is 2014.

    rank_pattern=r'<td align="center"><font size="2">([0-9,]*?)</font>'
    mov_title_pattern=r'(.htm">[A-Z])*?</a></font></b></td>'
    #mov_title_pattern=r'.htm">*?</a></font></b></td>' # Testing

    self.rank= [g for g in re.findall(rank_pattern,page)]
    self.mov_title=[g for g in re.findall(mov_title_pattern,page)]

self.rank works perfectly. However self.mov_title does not store the data correctly. I am suppose to receive a list with 102 elements with the movie titles. However I receive 102 empty strings: ''. The rest of the program will be pretty simple once I figure out what I am doing wrong, I just can't find the answer to my question online. I've tried to change the mov_title_pattern plenty of times and I either receive nothing or 102 empty strings. Please help I really want to move forward with my project.

alecxe

Just don't attempt to parse HTML with regex - it would save you time, most importantly - hair, and would make your life easier.

Here is a solution using BeautifulSoup HTML parser:

from bs4 import BeautifulSoup
import requests

url = 'http://boxofficemojo.com/yearly/chart/?yr=2014&p=.htm'
response = requests.get(url)

soup = BeautifulSoup(response.content)

for row in soup.select('div#body td[colspan="3"] > table[border="0"] tr')[1:-3]:
    cells = row.find_all('td')
    if len(cells) < 2:
        continue

    rank = cells[0].text
    title = cells[1].text
    print rank, title

Prints:

1 Guardians of the Galaxy
2 The Hunger Games: Mockingjay - Part 1
3 Captain America: The Winter Soldier
4 The LEGO Movie
...
98 Transcendence
99 The Theory of Everything
100 As Above/So Below

The expression inside the select() call is a CSS Selector - a convenient and powerful way of locating elements. But, since the elements on this particular page are not conveniently mapped with ids or marked with classes, we have to rely on attributes like colspan or border. [1:-3] slice is here to eliminate the header and total rows.

For this page, to get to the table you can rely on the chart element and get it's next table sibling:

for row in soup.find('div', id='chart_container').find_next_sibling('table').find_all('tr')[1:-3]:
    ...

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-02-14

Comments

0 comments

From Dev

Related Related

Article

Python Using wildcard inside of strings

Python Using wildcard inside of strings

Find strings in list using wildcard

Build XPath using wildcard for changing strings

Using strings inside structures

using a wildcard/glob/minimatch in an fs.readFile (inside express app)

Using javascript's replace with wildcard to change width attribute inside string

Python find tag in XML using wildcard

Python - Using str.replace with a wildcard

Python find tag in XML using wildcard

Using Wildcard in CSV lookup with if/else statement in python

Matching strings with wildcard

Match two wildcard strings

Joining with wildcard not matching strings

Compare strings with wildcard

Matlab - Concatenate strings with wildcard

GREP for multiple strings with wildcard

Compare strings with wildcard

Count specific strings with wildcard

Makefile wildcard with variable inside

Query inside a hash with a wildcard

Using "and" and "or" operator with Python strings

using "\" character in python strings

swaping strings inside of a list using c#

Python Using IF inside FOR loop

Replacing white space inside a list of strings in python

bash wildcard with variables inside an if statement

wildcard inside string behaves strangely

Using a loop to compare strings in Python

Using strings in re expressions Python

Using xcopy with wildcard character