Python 3 Beautifulsoup 从 gov.uk 网页上抓取县名

Mangs

0人浏览 · 2022-08-28 17:36:21

Mangs · 2022-08-28 17:36:21 发布

问题:Python 3 Beautifulsoup 从 gov.uk 网页上抓取县名

如果有任何帮助,我将不胜感激!

我正在尝试将这个网页上的县名(https://www.gov.uk/guidance/full-list-of-local-restriction-tiers-by-area)刮成四个相应的列表:Tier1 ,第 2 层,第 3 层,第 4 层。

问题是我如何浏览页面......这就是我设置汤的方式。

from bs4 import BeautifulSoup
url = "https://www.gov.uk/guidance/full-list-of-local-restriction-tiers-by-area"
headers = {...}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")

我试过找到 h2s,然后遍历兄弟姐妹,find_all_next 等,但我没有任何运气。

Endstate 我正在尝试将每个县放入一个如下所示的 CSV:(颜色映射如下第 1 层:绿色,第 2 层:黄色,第 3 层:琥珀色,第 4 层:红色)

县

国家

等级

颜色

锡利群岛

英国

绿色的

拉特兰

英国

琥珀色

等等

更新:作为要提取的数据的最小示例:

from bs4 import BeautifulSoup
html = '''<div class="govspeak">
<ul>
  <li>case detection rates in all age groups</li>
</ul>
<h2 id="tier-1-medium-alert">Tier 1: Medium alert</h2>
<h3 id="south-west">South West</h3>
<ul>
  <li>Isles of Scilly</li>
</ul>
<h2 id="tier-2-high-alert">Tier 2: High alert</h2>
<p>No areas are currently in Tier 2.</p>
<h2 id="tier-3-very-high-alert">Tier 3: Very High alert</h2>
<h3 id="east-midlands">East Midlands</h3>
<ul>
  <li>Rutland</li>
</ul>
<h3 id="north-west">North West</h3>
<ul>
  <li>Liverpool City Region</li>
</ul>
</div>'''

soup = BeautifulSoup(html, "lxml")
h2 = soup.find_all('h2')
# Whats the best way to find related li tags?

解答

问题是 HTML 是H2和UL是扁平结构。有很多方法可以提取数据。例如在每个元素上执行For循环。

soup.find('div', {"class": "govspeak"})- 查找父 div(包含h2和li)。
container.find_all('li')- 查找所有li。
x.fetchPrevious('h2')[0].text.strip()- 查找h2之前的第一个[0](并删除所有空格)。
if x.fetchPrevious('h2')[0].findParent('div', {"class": "govspeak"})- 过滤掉任何没有出现在父 div 中的h2。 (因为fetchPrevious会从字面上找到前一个)。
namedtuple(我称之为CountyTierModel)将抓取的数据存储为数组。
re.search("(?<=Tier )\d(?=:)", x.tier)-RegEx从h2标题中获取数字。

抓取数据示例:

from collections import namedtuple
import re
import requests
from bs4 import BeautifulSoup
CountyTierModel = namedtuple('CountyTiers', ['tier', 'county'])

url = "https://www.gov.uk/guidance/full-list-of-local-restriction-tiers-by-area"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
container = soup.find('div', {"class": "govspeak"})
results = [CountyTierModel(x.fetchPrevious('h2')[0].text.strip(), x.text.strip()) for x in container.find_all('li') 
                if x.fetchPrevious('h2') and x.fetchPrevious('h2')[0].findParent('div', {"class": "govspeak"})]

# Here you can write yor code to convert to CSV & provide mapping for Country & Color.
for x in results:
    # Regex to extract number from H2 title based on starting with 'Tier ' + number + ':'
    m = re.search("(?<=Tier )\d(?=:)", x.tier)
    print(f"{m.group(0)} - {x.county}")

输出:

1 - Isles of Scilly
3 - Rutland
3 - Liverpool City Region
3 - Bath and North East Somerset
3 - Bristol
3 - Cornwall
3 - Devon, Plymouth and Torbay
3 - Dorset
3 - North Somerset
3 - South Gloucestershire
3 - Wiltshire
3 - Herefordshire
3 - Shropshire, and Telford and Wrekin
3 - Worcestershire
3 - City of York and North Yorkshire
3 - The Humber: East Riding of Yorkshire, Kingston upon Hull/Hull, North East Lincolnshire and North Lincolnshire
3 - South Yorkshire (Barnsley, Doncaster, Rotheram, Sheffield)
3 - West Yorkshire (Bradford, Calderdale, Kirklees, Leeds, Wakefield)
4 - Derby and Derbyshire
4 - Leicester City and Leicestershire
4 - Lincolnshire
4 - Northamptonshire
4 - Nottingham and Nottinghamshire
4 - Bedford, Central Bedfordshire, Luton and Milton Keynes
4 - Cambridgeshire
4 - Essex, Southend-on-Sea and Thurrock
4 - Hertfordshire
4 - Norfolk
4 - Peterborough
4 - Suffolk
4 - All 32 London boroughs plus City of London
4 - North East Combined Authority (this area includes the local authorities of County Durham, Gateshead, South Tyneside and Sunderland)
4 - North of Tyne Combined Authority (this area includes the local authorities of Newcastle-upon-Tyne, North Tyneside and Northumberland)
4 - Tees Valley Combined Authority (this area includes the local authorities of Darlington, Hartlepool, Middlesbrough, Redcar and Cleveland, and Stockton-on-Tees)
4 - Cumbria
4 - Greater Manchester
4 - Lancashire, Blackburn with Darwen, and Blackpool
4 - Warrington and Cheshire Region
4 - Berkshire
4 - Brighton and Hove, East Sussex and West Sussex
4 - Buckinghamshire
4 - Hampshire, Southampton and Portsmouth
4 - Isle of Wight
4 - Kent and Medway
4 - Oxfordshire
4 - Surrey
4 - Bournemouth, Christchurch and Poole
4 - Gloucestershire (Cheltenham, Cotswold, Forest of Dean, Gloucester City, Stroud and Tewkesbury)
4 - Somerset (Mendip, Sedgemoor, Somerset West and Taunton, and South Somerset)
4 - Swindon
4 - Birmingham, Dudley, Sandwell, Walsall and Wolverhampton
4 - Coventry
4 - Solihull
4 - Staffordshire and Stoke-on-Trent
4 - Warwickshire

注意:保持问题的重点;我只添加了用于抓取的代码。提取到 CSV 应该是一个单独的问题。

Python

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐

求助！为什么用InsCode部署会出现无限重定向？

Python

如何重塑熊猫。系列

问题:如何重塑熊猫。系列在我看来,它就像 pandas.Series 中的一个错误。 a = pd.Series([1,2,3,4]) b = a.reshape(2,2) b b 有类型 Series 但无法显示,最后一条语句给出异常,非常冗长,最后一行是“TypeError: %d format: a number is required, not numpy.ndarray”。 b.sha

Python

在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制]

问题:在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制] 我刚刚在这里](https://keras.io/initializers/)中阅读了有关[中的 Keras 权重初始化器的信息。在文档中,只介绍了不同的初始化程序。如: model.add(Dense(64, kernel_initializer='random_normal')) 当我没有指定kernel_initia