How to extract raw text (including comments) from HTML page with beautifulsoup python?
·
Answer a question
Lets say I have the following piece of HTML:
<html>
<body>
<p>This is a paragraph <!-- and a comment --></p>
</body>
</html>
I want to extract the whole text of the <p> tag including <!-- and a comment -->. Using .get_text() returns only "This is a paragraph".
I want the whole raw text like this: This is a paragraph <!-- and a comment -->.
How can this be achieved with beautifulsoup4?
Answers
Use decode_contents()(doc), i.e.:
from bs4 import BeautifulSoup
html = """
<html>
<body>
<p>This is a paragraph <!-- and a comment --></p>
</body>
</html>
"""
soup = BeautifulSoup(html, "html5lib")
for para_tag in soup.find_all('p'):
print(para_tag.decode_contents())
# This is a paragraph <!-- and a comment -->
更多推荐

所有评论(0)