Replace <br> with space in BeautifulSoap output
·
Answer a question
I am scraping a few links with BeautifulSoap however, it seems to completely ignore <br> tags.
Here is the relevant portion of source code of the URL I am scraping:
<h1 class="para-title">A quick brown fox jumps over<br>the lazy dog
<span id="something"></span></h1>
Here is my BeautifulSoap code (relevant part only) to get the text within h1 tags:
soup = BeautifulSoup(page, 'html.parser')
title_box = soup.find('h1', attrs={'class': 'para-title'})
title = title_box.text.strip()
print title
This gives the following output:
A quick brown fox jumps overthe lazy dog
Whereas I am expecting:
A quick brown fox jumps over the lazy dog
How can I replace the <br> with a space in my code?
Answers
How about using the .get_text() with the separator parameter?
from bs4 import BeautifulSoup
page = '''<h1 class="para-title">A quick brown fox jumps over<br>the lazy dog
<span>some stuff here</span></h1>'''
soup = BeautifulSoup(page, 'html.parser')
title_box = soup.find('h1', attrs={'class': 'para-title'})
title = title_box.get_text(separator=" ").strip()
print (title)
Output:
print (title)
A quick brown fox jumps over the lazy dog
some stuff here
更多推荐

所有评论(0)