Answer a question

Lets say I have the following piece of HTML:

<html>
<body>
<p>This is a paragraph <!-- and a comment --></p>
</body>
</html>

I want to extract the whole text of the <p> tag including <!-- and a comment -->. Using .get_text() returns only "This is a paragraph".

I want the whole raw text like this: This is a paragraph <!-- and a comment -->.

How can this be achieved with beautifulsoup4?

Answers

Use decode_contents()(doc), i.e.:

from bs4 import BeautifulSoup

html = """
<html>
<body>
<p>This is a paragraph <!-- and a comment --></p>
</body>
</html>
"""

soup = BeautifulSoup(html, "html5lib")
for para_tag in soup.find_all('p'):
    print(para_tag.decode_contents())
    # This is a paragraph <!-- and a comment -->
Logo

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐