Beautiful Soup Fails to Remove ALL Script Tags

Mangs

90人浏览 · 2022-08-24 00:55:17

Mangs · 2022-08-24 00:55:17 发布

Answer a question

I'm playing around with bs4 and I tried to scrape the following website:https://pythonbasics.org/selenium-get-html/ and I wanted to remove all of the script tags from html.

To remove script tags I used functions like:

for script in soup("script"):
     script.decompose()

[s.extract() for s in soup.findAll('script')]

and many others I found online. They all serve to the same purpose however they fail to remove script tags such as:

<script src="/lib/jquery.js"></script>
<script src="/lib/waves.js"></script>
<script src="/lib/jquery-ui.js"></script>
<script src="/lib/jquery.tocify.js"></script>

<script src="/js/main.js"></script>
<script src="/lib/toc.js"></script>

<div id="disqus_thread"></div>
    <script>
        /**
         *  RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS.
         *  LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables
         */
        
        var disqus_config = function () {
            this.page.url = 'https://pythonbasics.org/selenium-get-html/';  // Replace PAGE_URL with your page's canonical URL variable
            this.page.identifier = '_posts/selenium-get-html.md'; // Replace PAGE_IDENTIFIER with your page's unique identifier variable
        };
        
        (function() {  // DON'T EDIT BELOW THIS LINE
            var d = document, s = d.createElement('script');
            
            s.src = '//https-pythonbasics-org.disqus.com/embed.js';
            
            s.setAttribute('data-timestamp', +new Date());
            (d.head || d.body).appendChild(s);
        })();
    </script>
    <noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript" rel="nofollow">comments powered by Disqus.</a></noscript>

What's going on here? I found some related questions:

beautifulsoup remove all the internal javascript

BeatifulSoup4 get_text still has javascript

But answers recommend the same algorithms I used to clean these scripts and failed. And there are other people at the comments that got stuck just like me.

I looked for nltk's previous functions to use however it seems they're not valid anymore. Do you have any ideas? Why these functions fail to remove all script tags. What can we do without Regex?

Answers

This is happening because some of the <script> tags are within HTML comments ().

You can extract these HTML comments checking if the tags are of the type Comment:

from bs4 import BeautifulSoup, Comment

soup = BeautifulSoup(html, "html.parser")

# Find all comments on the website and remove them, most of them contain `script` tags
[
    comment.extract()
    for comment in soup.findAll(text=lambda text: isinstance(text, Comment))
]

# Find all other `script` tags and remove them
[tag.extract() for tag in soup.findAll("script")]

print(soup.prettify())

Python

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐

求助！为什么用InsCode部署会出现无限重定向？

Python

如何重塑熊猫。系列

问题:如何重塑熊猫。系列在我看来,它就像 pandas.Series 中的一个错误。 a = pd.Series([1,2,3,4]) b = a.reshape(2,2) b b 有类型 Series 但无法显示,最后一条语句给出异常,非常冗长,最后一行是“TypeError: %d format: a number is required, not numpy.ndarray”。 b.sha

Python

在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制]

问题:在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制] 我刚刚在这里](https://keras.io/initializers/)中阅读了有关[中的 Keras 权重初始化器的信息。在文档中,只介绍了不同的初始化程序。如: model.add(Dense(64, kernel_initializer='random_normal')) 当我没有指定kernel_initia