Beautiful Soup Fails to Remove ALL Script Tags
Answer a question
I'm playing around with bs4 and I tried to scrape the following website:https://pythonbasics.org/selenium-get-html/ and I wanted to remove all of the script tags from html.
To remove script tags I used functions like:
for script in soup("script"):
script.decompose()
or
[s.extract() for s in soup.findAll('script')]
and many others I found online. They all serve to the same purpose however they fail to remove script tags such as:
<script src="/lib/jquery.js"></script>
<script src="/lib/waves.js"></script>
<script src="/lib/jquery-ui.js"></script>
<script src="/lib/jquery.tocify.js"></script>
<script src="/js/main.js"></script>
<script src="/lib/toc.js"></script>
or
<div id="disqus_thread"></div>
<script>
/**
* RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS.
* LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables
*/
var disqus_config = function () {
this.page.url = 'https://pythonbasics.org/selenium-get-html/'; // Replace PAGE_URL with your page's canonical URL variable
this.page.identifier = '_posts/selenium-get-html.md'; // Replace PAGE_IDENTIFIER with your page's unique identifier variable
};
(function() { // DON'T EDIT BELOW THIS LINE
var d = document, s = d.createElement('script');
s.src = '//https-pythonbasics-org.disqus.com/embed.js';
s.setAttribute('data-timestamp', +new Date());
(d.head || d.body).appendChild(s);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript" rel="nofollow">comments powered by Disqus.</a></noscript>
What's going on here? I found some related questions:
beautifulsoup remove all the internal javascript
BeatifulSoup4 get_text still has javascript
But answers recommend the same algorithms I used to clean these scripts and failed. And there are other people at the comments that got stuck just like me.
I looked for nltk's previous functions to use however it seems they're not valid anymore. Do you have any ideas? Why these functions fail to remove all script tags. What can we do without Regex?
Answers
This is happening because some of the <script>
tags are within HTML comments (<!-- ... -->
).
You can extract these HTML comments checking if the tags are of the type Comment
:
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(html, "html.parser")
# Find all comments on the website and remove them, most of them contain `script` tags
[
comment.extract()
for comment in soup.findAll(text=lambda text: isinstance(text, Comment))
]
# Find all other `script` tags and remove them
[tag.extract() for tag in soup.findAll("script")]
print(soup.prettify())
更多推荐
所有评论(0)