Answer a question

I'm playing around with bs4 and I tried to scrape the following website:https://pythonbasics.org/selenium-get-html/ and I wanted to remove all of the script tags from html.

To remove script tags I used functions like:

for script in soup("script"):
     script.decompose()

or

[s.extract() for s in soup.findAll('script')]

and many others I found online. They all serve to the same purpose however they fail to remove script tags such as:

<script src="/lib/jquery.js"></script>
<script src="/lib/waves.js"></script>
<script src="/lib/jquery-ui.js"></script>
<script src="/lib/jquery.tocify.js"></script>

<script src="/js/main.js"></script>
<script src="/lib/toc.js"></script>

or

<div id="disqus_thread"></div>
    <script>
        /**
         *  RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS.
         *  LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables
         */
        
        var disqus_config = function () {
            this.page.url = 'https://pythonbasics.org/selenium-get-html/';  // Replace PAGE_URL with your page's canonical URL variable
            this.page.identifier = '_posts/selenium-get-html.md'; // Replace PAGE_IDENTIFIER with your page's unique identifier variable
        };
        
        (function() {  // DON'T EDIT BELOW THIS LINE
            var d = document, s = d.createElement('script');
            
            s.src = '//https-pythonbasics-org.disqus.com/embed.js';
            
            s.setAttribute('data-timestamp', +new Date());
            (d.head || d.body).appendChild(s);
        })();
    </script>
    <noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript" rel="nofollow">comments powered by Disqus.</a></noscript>

What's going on here? I found some related questions:

beautifulsoup remove all the internal javascript

BeatifulSoup4 get_text still has javascript

But answers recommend the same algorithms I used to clean these scripts and failed. And there are other people at the comments that got stuck just like me.

I looked for nltk's previous functions to use however it seems they're not valid anymore. Do you have any ideas? Why these functions fail to remove all script tags. What can we do without Regex?

Answers

This is happening because some of the <script> tags are within HTML comments (<!-- ... -->).

You can extract these HTML comments checking if the tags are of the type Comment:

from bs4 import BeautifulSoup, Comment

soup = BeautifulSoup(html, "html.parser")

# Find all comments on the website and remove them, most of them contain `script` tags
[
    comment.extract()
    for comment in soup.findAll(text=lambda text: isinstance(text, Comment))
]

# Find all other `script` tags and remove them
[tag.extract() for tag in soup.findAll("script")]

print(soup.prettify())
Logo

学AI,认准AI Studio!GPU算力,限时免费领,邀请好友解锁更多惊喜福利 >>>

更多推荐