This is going to be hard to explain.
I'm fetching some webpages with BeautifulSoup, and I'm trying to organize them into a list. I'm fetching only the elements on the page that have the class "text". Like this:
content = requests.get(url, verify=True)
soup = BeautifulSoup(content.text, 'lxml', parse_only=SoupStrainer('p'))
filtered_soup = soup.find_all("span", {"class":["text",
"indent-1"]})
line_list = [line for line in filtered_soup]
#text_list = [line.get_text() for line in filtered_soup]
This works great, but I'd also like to combine some of the items in the list. On the webpage, some of the items with class="text..."
also have id="en..."
. They technically SHOULD be the parents of the other class="text..."
elements, but the webpage has not been set up this way.
In my "line_list" list, there is an item with both class="text..."
and id="en..."
elements, then there are a few items with only class="text..."
, then there is an item with both class="text..."
and id="en..."
elements, and this pattern keeps repeating. Here's a way to think of it:
line_list = [A, a, a, a, B, b, b, C, c, c, c, c]
Now here's the hard part to explain. Let's say line_list[0]
has both elements, line_list[1-3]
only have the "class" element, and line_list[4]
has both elements again. I'd like to iterate through line_list
and combine the items into a single string. But when the iteration hits an item containing both "id" and "class" (i.e. line_list[4]
), I'd like it to start creating a new string.
Or, if someone can think of a better way to do this, that'd be awesome. I was going to try to do this:
line_string = ''.join(line_list)
split_list = line_string.split('id="en')
But the join
command complains that line_string
contains tags, not strings.
I wonder if it'd be easier to do this with a dictionary? For example, make the elements that have both "class" and "id" the keys and the elements that only have "class" their values. It'd look like this:
line_dic = {A: [a, a, a], B: [b, b], C: [c, c, c, c]}
Here's example html if anyone would like to play with it:
line_list = [<span class="text 1" id="en-13987>A<span class="small-caps" style="font-variant: small-caps">A</span>,
<span class="indent-1"><span class="indent-1-breaks"> </span><span class="text 1">a</span></span>,
<span class="text 1">a</span>,
<span class="text 2" id="en-13988">B<span class="small-caps" style="font-variant: small-caps">B</span>B</span>,
<span class="indent-1"><span class="indent-1-breaks"> </span><span class="text 2">b<span class="small-caps" style="font-variant: small-caps">b</span>b</span></span>,
<span class="text 2">b<span class="small-caps" style="font-variant: small-caps">b</span>b</span>,
<span class="text 3" id="en-13989">C</span>,
<span class="indent-1"><span class="indent-1-breaks"> </span><span class="text 3">c<span class="small-caps" style="font variant: small-caps">c</span>c</span></span>,
<span class="text 3">c<span class="small-caps" style="font-variant: small-caps">c</span>c</span>,]
Awesome ideas, guys. Thanks a ton!
Not a cool one-liner, but, the following should work...:
text_list = []
current = []
for line in line_list:
if line.get('id', '').startswith('en'):
if current:
text_list.append(' '.join(current))
current = []
current.append(line.text)
if current:
text_list.append(' '.join(current))
For example, adding this code after a sample test-start of
import bs4
content = '''
<span class='text' class='indent-1' id='en00'>And one</span>
<span class='text' class='indent-1'>And two</span>
<span class='text' class='indent-1'>And three</span>
<span class='text' class='indent-1' id='en01'>And four</span>
<span class='text' class='indent-1'>And five</span>
'''
soup = bs4.BeautifulSoup(content)
filtered_soup = soup.find_all("span", {"class":["text", "indent-1"]})
line_list = [line for line in filtered_soup]
a for x in test_list: print(x)
will display
And one And two And three
And four And five
which seems to match the desired result.
Added: here's an arguably better-factored solution, which does however end up being more verbose:
def has_id_en(elem):
return elem.get('id', '').startswith('en')
def segment(sequence, is_head):
current = []
for x in sequence:
if is_head(x):
if current:
yield current
current = []
current.append(x)
if current:
yield current
text_list = [' '.join(e.text for e in bunch)
for bunch in segment(line_list, has_id_en)]
At least, this way, the segment
logic is reusable for similar tasks where the items in the sequence need not be bs4 objects, and/or the way to determine whether an item needs to "head" a subsequence is different than in this specific problem.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments