问题:在 Python 中使用 Scrapy 格式化文本输出

我正在尝试使用 Scrapy 蜘蛛抓取页面,然后将这些页面以可读形式保存到 .txt 文件中。我用来执行此操作的代码是:

def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url) 

        hxs = HtmlXPathSelector(response)

        title = hxs.select('/html/head/title/text()').extract() 
        content = hxs.select('//*[@id="content"]').extract() 

        texts = "%s\n\n%s" % (title, content) 

        soup = BeautifulSoup(''.join(texts)) 

        strip = ''.join(BeautifulSoup(pretty).findAll(text=True)) 

        filename = ("/Users/username/path/output/Hansard-" + '%s'".txt") % (title) 
        filly = open(filename, "w")
        filly.write(strip) 

我在这里结合了 BeautifulSoup,因为正文包含很多我不想在最终产品中出现的 HTML(主要是链接),所以我使用 BS 去除 HTML 并只留下感兴趣的文本。

这给了我看起来像的输出

[u"School, Chandler's Ford (Hansard, 30 November 1961)"]

[u'

 \n      \n

  HC Deb 30 November 1961 vol 650 cc608-9

 \n

  608

 \n

  \n


  \n

   \n

    \xa7

   \n

    28.

   \n


     Dr. King


   \n

    \n            asked the Minister of Education what is the price at which the Hampshire education authority is acquiring the site for the erection of Oakmount Secondary School, Chandler\'s Ford; and why he refused permission to acquire this site in 1954.\n

   \n

  \n

 \n      \n

  \n


  \n

   \n

    \xa7

   \n


     Sir D. Eccles


   \n

    \n            I understand that the authority has paid \xa375,000 for this site.\n            \n

虽然我希望输出看起来像:

    School, Chandler's Ford (Hansard, 30 November 1961)

          HC Deb 30 November 1961 vol 650 cc608-9

          608

            28.

Dr. King asked the Minister of Education what is the price at which the Hampshire education authority is acquiring the site for the erection of Oakmount Secondary School, Chandler's Ford; and why he refused permission to acquire this site in 1954.

Sir D. Eccles I understand that the authority has paid £375,000 for this site.

所以我基本上在寻找如何删除换行符\n,收紧所有内容,并将任何特殊字符转换为正常格式。

解答

我在代码评论中的回答:

import re
import codecs

#...
#...
#extract() returns list, so you need to take first element
title = hxs.select('/html/head/title/text()').extract() [0]
content = hxs.select('//*[@id="content"]')
#instead of using BeautifulSoup for this task, you can use folowing
content = content.select('string()').extract()[0]

#simply delete duplicating spaces and newlines, maybe you need to adjust this expression
cleaned_content = re.sub(ur'(\s)\s+', ur'\1', content, flags=re.MULTILINE + re.UNICODE)

texts = "%s\n\n%s" % (title, cleaned_content) 

#look's like typo in filename creation
#filename ....

#and my preferable way to write file with encoding
with codecs.open(filename, 'w', encoding='utf-8') as output:
    output.write(texts)
Logo

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐