从字符串中删除长破折号

Mangs

0人浏览 · 2022-08-28 19:14:55

Mangs · 2022-08-28 19:14:55 发布

问题:从字符串中删除长破折号

我正在尝试从网站读取 html 内容到 Python 以分析那里的文本并确定它们属于哪个类别。当我尝试使用它们时,它们进入 NoneType 时,我遇到了长破折号的问题。我已经尝试了这个网站上建议的几个修复,但没有一个奏效。

from bs4 import BeautifulSoup
import re
import urllib.request
response = urllib.request.urlopen('website-im-opening')
content = response.read().decode('utf-8')
#this does not work
content = content.translate({0x2014: None})
content = re.sub(u'\u2014','',content)
#This is other part of code
htmlcontent = BeautifulSoup(content,"html.parser")

for cont in htmlcontent.select('p'):
    if cont.has_attr('class') == False:
        print(cont.strip()) #Returns an error as text contains long dash

任何想法如何从字符串中过滤掉长破折号以便与其他文本一起使用?我可以用短破折号替换它或完全删除,它们对我来说并不重要。

谢谢!

解答

您应该在使用 bs4 提取数据后清理数据:

BS4会转换一些HTML实体,你不需要自己做。
BS4 将为您解码文件

```

response = urllib.request.urlopen('website-im-opening')

content = response.read()

htmlcontent = BeautifulSoup(content,"html.parser")

for cont in htmlcontent.find_all('p', class_=False):

    print(p.text)

```

Python

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐

求助！为什么用InsCode部署会出现无限重定向？

Python

如何重塑熊猫。系列

问题:如何重塑熊猫。系列在我看来,它就像 pandas.Series 中的一个错误。 a = pd.Series([1,2,3,4]) b = a.reshape(2,2) b b 有类型 Series 但无法显示,最后一条语句给出异常,非常冗长,最后一行是“TypeError: %d format: a number is required, not numpy.ndarray”。 b.sha

Python

在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制]

问题:在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制] 我刚刚在这里](https://keras.io/initializers/)中阅读了有关[中的 Keras 权重初始化器的信息。在文档中,只介绍了不同的初始化程序。如: model.add(Dense(64, kernel_initializer='random_normal')) 当我没有指定kernel_initia