Combine items in a list until an item containing specific text is found?

GreenRaccoon23 Published at Dev

GreenRaccoon23

This is going to be hard to explain.

I'm fetching some webpages with BeautifulSoup, and I'm trying to organize them into a list. I'm fetching only the elements on the page that have the class "text". Like this:

content = requests.get(url, verify=True)
soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer('p'))
filtered_soup = soup.find_all("span", {"class":["text",
                                                "indent-1"]})
line_list = [line for line in filtered_soup]
#text_list = [line.get_text() for line in filtered_soup]

This works great, but I'd also like to combine some of the items in the list. On the webpage, some of the items with class="text..." also have id="en...". They technically SHOULD be the parents of the other class="text..." elements, but the webpage has not been set up this way.

In my "line_list" list, there is an item with both class="text..." and id="en..." elements, then there are a few items with only class="text...", then there is an item with both class="text..." and id="en..." elements, and this pattern keeps repeating. Here's a way to think of it:

line_list = [A, a, a, a, B, b, b, C, c, c, c, c]

Now here's the hard part to explain. Let's say line_list[0] has both elements, line_list[1-3] only have the "class" element, and line_list[4] has both elements again. I'd like to iterate through line_list and combine the items into a single string. But when the iteration hits an item containing both "id" and "class" (i.e. line_list[4]), I'd like it to start creating a new string.

Or, if someone can think of a better way to do this, that'd be awesome. I was going to try to do this:

line_string = ''.join(line_list)
split_list = line_string.split('id="en')

But the join command complains that line_string contains tags, not strings.

I wonder if it'd be easier to do this with a dictionary? For example, make the elements that have both "class" and "id" the keys and the elements that only have "class" their values. It'd look like this:

line_dic = {A: [a, a, a], B: [b, b], C: [c, c, c, c]}

Here's example html if anyone would like to play with it:

line_list = [<span class="text 1" id="en-13987>A<span class="small-caps" style="font-variant: small-caps">A</span>,
             <span class="indent-1"><span class="indent-1-breaks">    </span><span class="text 1">a</span></span>,
             <span class="text 1">a</span>,
             <span class="text 2" id="en-13988">B<span class="small-caps" style="font-variant: small-caps">B</span>B</span>,
             <span class="indent-1"><span class="indent-1-breaks">    </span><span class="text 2">b<span class="small-caps" style="font-variant: small-caps">b</span>b</span></span>,
             <span class="text 2">b<span class="small-caps" style="font-variant: small-caps">b</span>b</span>,
             <span class="text 3" id="en-13989">C</span>,
              <span class="indent-1"><span class="indent-1-breaks">    </span><span class="text 3">c<span class="small-caps" style="font variant: small-caps">c</span>c</span></span>,
             <span class="text 3">c<span class="small-caps" style="font-variant: small-caps">c</span>c</span>,]

Awesome ideas, guys. Thanks a ton!

Alex Martelli

Not a cool one-liner, but, the following should work...:

text_list = []
current = []
for line in line_list:
    if line.get('id', '').startswith('en'):
        if current:
            text_list.append(' '.join(current))
            current = []
    current.append(line.text)
if current:
    text_list.append(' '.join(current))

For example, adding this code after a sample test-start of

import bs4

content = '''
<span class='text' class='indent-1' id='en00'>And one</span>
<span class='text' class='indent-1'>And two</span>
<span class='text' class='indent-1'>And three</span>
<span class='text' class='indent-1' id='en01'>And four</span>
<span class='text' class='indent-1'>And five</span>
'''

soup = bs4.BeautifulSoup(content)
filtered_soup = soup.find_all("span", {"class":["text", "indent-1"]})
line_list = [line for line in filtered_soup]

a for x in test_list: print(x) will display

And one And two And three
And four And five

which seems to match the desired result.

Added: here's an arguably better-factored solution, which does however end up being more verbose:

def has_id_en(elem):
    return elem.get('id', '').startswith('en')

def segment(sequence, is_head):
  current = []
  for x in sequence:
      if is_head(x):
          if current:
              yield current
              current = []
      current.append(x)
  if current:
      yield current

text_list = [' '.join(e.text for e in bunch)
             for bunch in segment(line_list, has_id_en)]

At least, this way, the segment logic is reusable for similar tasks where the items in the sequence need not be bs4 objects, and/or the way to determine whether an item needs to "head" a subsequence is different than in this specific problem.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2020-11-18

Comments

0 comments

From Dev

Related Related

Article

Combine items in a list until an item containing specific text is found?

Combine items in a list until an item containing specific text is found?

Using jQuery to find list items containing specific text in nested lists

Lazy computation of items in list until required element is found

Multiple text items in List view item

How to remove items in a generic list from start until a specific condition

How to remove items in a generic list from start until a specific condition

how to find specific items in a list and do things with that item

How to add items to a specific dictionary within an item in a list with AngularFire?

Python: idiomatic way to drop items from a list until an item matches a condition?

List Item <li> containing image <img> not vertically aligning with other <li>'s containing text

combine specific objects in a list

combine specific objects in a list

Write lambda statement to select items from a list where a property of an item (enum) is found in a list of enum values?

combine items in different list, python

How to overlay list item text color in specific region?

Listview fails to display text of list items when item count exceeds 400 items

Combine a list of Observables and wait until all completed

Enumerate on specific items of a list

c# combine lists and mark item as found

Find a table containing specific text

Count of cells containing specific text

Find a table containing specific text

Count of cells containing specific text

Polling an API for JSON until a specific key is found

Polling an API for JSON until a specific key is found

Strip Text in all List Items after Character in each list Item Python

Search an arraylist containing class for a specific item name

Selector for subsequent sibling of element containing specific item

Python: Iterate over each item in nested-list-of-lists and replace specific items

C# How to find specific item and it's related items from list