Parsing for text under specific tags in HTML, Python

Jackson Blankenship Published at Dev

Jackson Blankenship

How to find all the text on a page that falls under this criteria using beautiful soup?

<tr>
    <td class="d_g_l_e" style="border-right:none;”>
        <img src="/d2l/img/LP/pixel.gif" width="20" height="20" alt=“”
    </td>
    <th scope="row" class="d_gt d_ich" style="border-left:none;”>
        <div class="dco”>
            <div class="dco_c”>
                <div class="dco”>
                    <div class="dco_c”>
                        <strong> **EXTRACT THIS (NAME)** </strong>
                    </div>
                </div>
            </div>
        </div>
    </th>
<td class="d_gn d_gr d_gt”>
    <div class="dco”>
        <div class="dco_c”>
            <div class="dco”>
                <div class="dco_c" style="text-align:right;”>
                    <div style="text-align:center;display:inline;”>
                        <label id="z_c"> **EXTRACT THIS (GRADE)** </label>
                    </div>
                </div>
            </div>
        </div>
    </div>
</td>
<td class="d_gn d_gr d_gt">&nbsp;</td>
</tr>

I want the program to scan the whole html page and collect all of the variables this appear in this form. If the "tr" tag (main tag I'm looking for) has both a NAME and a GRADE underneath it, add the name to a list (List1), and then add the grade to a separate list (List2). If one of the two is missing underneath the "tr" tag, skip it, and don't record anything. So by the time the script is done scanning the page, a list would look something like:

List1 = [Grade 1, Grade 2, Grade 3, Grade 4]
List2 = [10/20, 20/40, 50/50, 33/44]

Also, the "z" label ID for the grade text changes from grade to grade, ex. z_a, z_b, z_c.

alecxe

For each tr on the page, find strong tag inside the th and label tag inside the td:

soup = BeautifulSoup(data)

for row in soup.find_all('tr'):
    grade = row.select('th strong')
    name = row.select('td label')
    if grade and name:
        print grade[0].text, name[0].text

Collected from the Internet

Please contact [email protected] to delete if infringement.