Combine items in a list until an item containing specific text is found?

GreenRaccoon23

This is going to be hard to explain.

I'm fetching some webpages with BeautifulSoup, and I'm trying to organize them into a list. I'm fetching only the elements on the page that have the class "text". Like this:

content = requests.get(url, verify=True)
soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer('p'))
filtered_soup = soup.find_all("span", {"class":["text",
                                                "indent-1"]})
line_list = [line for line in filtered_soup]
#text_list = [line.get_text() for line in filtered_soup]

This works great, but I'd also like to combine some of the items in the list. On the webpage, some of the items with class="text..." also have id="en...". They technically SHOULD be the parents of the other class="text..." elements, but the webpage has not been set up this way.

In my "line_list" list, there is an item with both class="text..." and id="en..." elements, then there are a few items with only class="text...", then there is an item with both class="text..." and id="en..." elements, and this pattern keeps repeating. Here's a way to think of it:

line_list = [A, a, a, a, B, b, b, C, c, c, c, c]

Now here's the hard part to explain. Let's say line_list[0] has both elements, line_list[1-3] only have the "class" element, and line_list[4] has both elements again. I'd like to iterate through line_list and combine the items into a single string. But when the iteration hits an item containing both "id" and "class" (i.e. line_list[4]), I'd like it to start creating a new string.

Or, if someone can think of a better way to do this, that'd be awesome. I was going to try to do this:

line_string = ''.join(line_list)
split_list = line_string.split('id="en')

But the join command complains that line_string contains tags, not strings.

I wonder if it'd be easier to do this with a dictionary? For example, make the elements that have both "class" and "id" the keys and the elements that only have "class" their values. It'd look like this:

line_dic = {A: [a, a, a], B: [b, b], C: [c, c, c, c]}

Here's example html if anyone would like to play with it:

line_list = [<span class="text 1" id="en-13987>A<span class="small-caps" style="font-variant: small-caps">A</span>,
             <span class="indent-1"><span class="indent-1-breaks">    </span><span class="text 1">a</span></span>,
             <span class="text 1">a</span>,
             <span class="text 2" id="en-13988">B<span class="small-caps" style="font-variant: small-caps">B</span>B</span>,
             <span class="indent-1"><span class="indent-1-breaks">    </span><span class="text 2">b<span class="small-caps" style="font-variant: small-caps">b</span>b</span></span>,
             <span class="text 2">b<span class="small-caps" style="font-variant: small-caps">b</span>b</span>,
             <span class="text 3" id="en-13989">C</span>,
              <span class="indent-1"><span class="indent-1-breaks">    </span><span class="text 3">c<span class="small-caps" style="font variant: small-caps">c</span>c</span></span>,
             <span class="text 3">c<span class="small-caps" style="font-variant: small-caps">c</span>c</span>,]

Awesome ideas, guys. Thanks a ton!

Alex Martelli

Not a cool one-liner, but, the following should work...:

text_list = []
current = []
for line in line_list:
    if line.get('id', '').startswith('en'):
        if current:
            text_list.append(' '.join(current))
            current = []
    current.append(line.text)
if current:
    text_list.append(' '.join(current))

For example, adding this code after a sample test-start of

import bs4

content = '''
<span class='text' class='indent-1' id='en00'>And one</span>
<span class='text' class='indent-1'>And two</span>
<span class='text' class='indent-1'>And three</span>
<span class='text' class='indent-1' id='en01'>And four</span>
<span class='text' class='indent-1'>And five</span>
'''

soup = bs4.BeautifulSoup(content)
filtered_soup = soup.find_all("span", {"class":["text", "indent-1"]})
line_list = [line for line in filtered_soup]

a for x in test_list: print(x) will display

And one And two And three
And four And five

which seems to match the desired result.

Added: here's an arguably better-factored solution, which does however end up being more verbose:

def has_id_en(elem):
    return elem.get('id', '').startswith('en')

def segment(sequence, is_head):
  current = []
  for x in sequence:
      if is_head(x):
          if current:
              yield current
              current = []
      current.append(x)
  if current:
      yield current

text_list = [' '.join(e.text for e in bunch)
             for bunch in segment(line_list, has_id_en)]

At least, this way, the segment logic is reusable for similar tasks where the items in the sequence need not be bs4 objects, and/or the way to determine whether an item needs to "head" a subsequence is different than in this specific problem.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Using jQuery to find list items containing specific text in nested lists

From Dev

Lazy computation of items in list until required element is found

From Dev

Multiple text items in List view item

From Dev

How to remove items in a generic list from start until a specific condition

From Dev

How to remove items in a generic list from start until a specific condition

From Dev

how to find specific items in a list and do things with that item

From Dev

How to add items to a specific dictionary within an item in a list with AngularFire?

From Dev

Python: idiomatic way to drop items from a list until an item matches a condition?

From Dev

List Item <li> containing image <img> not vertically aligning with other <li>'s containing text

From Dev

combine specific objects in a list

From Dev

combine specific objects in a list

From Dev

Write lambda statement to select items from a list where a property of an item (enum) is found in a list of enum values?

From Dev

combine items in different list, python

From Dev

How to overlay list item text color in specific region?

From Dev

Listview fails to display text of list items when item count exceeds 400 items

From Java

Combine a list of Observables and wait until all completed

From Dev

Enumerate on specific items of a list

From Dev

c# combine lists and mark item as found

From Dev

Find a table containing specific text

From Dev

Count of cells containing specific text

From Dev

Find a table containing specific text

From Dev

Count of cells containing specific text

From Dev

Polling an API for JSON until a specific key is found

From Dev

Polling an API for JSON until a specific key is found

From Dev

Strip Text in all List Items after Character in each list Item Python

From Dev

Search an arraylist containing class for a specific item name

From Dev

Selector for subsequent sibling of element containing specific item

From Dev

Python: Iterate over each item in nested-list-of-lists and replace specific items

From Dev

C# How to find specific item and it's related items from list

Related Related

  1. 1

    Using jQuery to find list items containing specific text in nested lists

  2. 2

    Lazy computation of items in list until required element is found

  3. 3

    Multiple text items in List view item

  4. 4

    How to remove items in a generic list from start until a specific condition

  5. 5

    How to remove items in a generic list from start until a specific condition

  6. 6

    how to find specific items in a list and do things with that item

  7. 7

    How to add items to a specific dictionary within an item in a list with AngularFire?

  8. 8

    Python: idiomatic way to drop items from a list until an item matches a condition?

  9. 9

    List Item <li> containing image <img> not vertically aligning with other <li>'s containing text

  10. 10

    combine specific objects in a list

  11. 11

    combine specific objects in a list

  12. 12

    Write lambda statement to select items from a list where a property of an item (enum) is found in a list of enum values?

  13. 13

    combine items in different list, python

  14. 14

    How to overlay list item text color in specific region?

  15. 15

    Listview fails to display text of list items when item count exceeds 400 items

  16. 16

    Combine a list of Observables and wait until all completed

  17. 17

    Enumerate on specific items of a list

  18. 18

    c# combine lists and mark item as found

  19. 19

    Find a table containing specific text

  20. 20

    Count of cells containing specific text

  21. 21

    Find a table containing specific text

  22. 22

    Count of cells containing specific text

  23. 23

    Polling an API for JSON until a specific key is found

  24. 24

    Polling an API for JSON until a specific key is found

  25. 25

    Strip Text in all List Items after Character in each list Item Python

  26. 26

    Search an arraylist containing class for a specific item name

  27. 27

    Selector for subsequent sibling of element containing specific item

  28. 28

    Python: Iterate over each item in nested-list-of-lists and replace specific items

  29. 29

    C# How to find specific item and it's related items from list

HotTag

Archive