在 Python 中使用 Scrapy 格式化文本输出

Mangs

64人浏览 · 2022-09-24 21:51:27

Mangs · 2022-09-24 21:51:27 发布

问题:在 Python 中使用 Scrapy 格式化文本输出

我正在尝试使用 Scrapy 蜘蛛抓取页面,然后将这些页面以可读形式保存到 .txt 文件中。我用来执行此操作的代码是:

def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url) 

        hxs = HtmlXPathSelector(response)

        title = hxs.select('/html/head/title/text()').extract() 
        content = hxs.select('//*[@id="content"]').extract() 

        texts = "%s\n\n%s" % (title, content) 

        soup = BeautifulSoup(''.join(texts)) 

        strip = ''.join(BeautifulSoup(pretty).findAll(text=True)) 

        filename = ("/Users/username/path/output/Hansard-" + '%s'".txt") % (title) 
        filly = open(filename, "w")
        filly.write(strip)

我在这里结合了 BeautifulSoup,因为正文包含很多我不想在最终产品中出现的 HTML(主要是链接),所以我使用 BS 去除 HTML 并只留下感兴趣的文本。

这给了我看起来像的输出

[u"School, Chandler's Ford (Hansard, 30 November 1961)"]

[u'

 \n      \n

  HC Deb 30 November 1961 vol 650 cc608-9

 \n

  608

 \n

  \n


  \n

   \n

    \xa7

   \n

    28.

   \n


     Dr. King


   \n

    \n            asked the Minister of Education what is the price at which the Hampshire education authority is acquiring the site for the erection of Oakmount Secondary School, Chandler\'s Ford; and why he refused permission to acquire this site in 1954.\n

   \n

  \n

 \n      \n

  \n


  \n

   \n

    \xa7

   \n


     Sir D. Eccles


   \n

    \n            I understand that the authority has paid \xa375,000 for this site.\n            \n

虽然我希望输出看起来像:

    School, Chandler's Ford (Hansard, 30 November 1961)

          HC Deb 30 November 1961 vol 650 cc608-9

          608

            28.

Dr. King asked the Minister of Education what is the price at which the Hampshire education authority is acquiring the site for the erection of Oakmount Secondary School, Chandler's Ford; and why he refused permission to acquire this site in 1954.

Sir D. Eccles I understand that the authority has paid £375,000 for this site.

所以我基本上在寻找如何删除换行符\n,收紧所有内容,并将任何特殊字符转换为正常格式。

解答

我在代码评论中的回答:

import re
import codecs

#...
#...
#extract() returns list, so you need to take first element
title = hxs.select('/html/head/title/text()').extract() [0]
content = hxs.select('//*[@id="content"]')
#instead of using BeautifulSoup for this task, you can use folowing
content = content.select('string()').extract()[0]

#simply delete duplicating spaces and newlines, maybe you need to adjust this expression
cleaned_content = re.sub(ur'(\s)\s+', ur'\1', content, flags=re.MULTILINE + re.UNICODE)

texts = "%s\n\n%s" % (title, cleaned_content) 

#look's like typo in filename creation
#filename ....

#and my preferable way to write file with encoding
with codecs.open(filename, 'w', encoding='utf-8') as output:
    output.write(texts)

Python

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐

求助！为什么用InsCode部署会出现无限重定向？

Python

如何重塑熊猫。系列

问题:如何重塑熊猫。系列在我看来,它就像 pandas.Series 中的一个错误。 a = pd.Series([1,2,3,4]) b = a.reshape(2,2) b b 有类型 Series 但无法显示,最后一条语句给出异常,非常冗长,最后一行是“TypeError: %d format: a number is required, not numpy.ndarray”。 b.sha

Python

在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制]

问题:在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制] 我刚刚在这里](https://keras.io/initializers/)中阅读了有关[中的 Keras 权重初始化器的信息。在文档中,只介绍了不同的初始化程序。如: model.add(Dense(64, kernel_initializer='random_normal')) 当我没有指定kernel_initia