Python 3 Beautifulsoup 从 gov.uk 网页上抓取县名
问题:Python 3 Beautifulsoup 从 gov.uk 网页上抓取县名
如果有任何帮助,我将不胜感激!
我正在尝试将这个网页上的县名(https://www.gov.uk/guidance/full-list-of-local-restriction-tiers-by-area)刮成四个相应的列表:Tier1 ,第 2 层,第 3 层,第 4 层。
问题是我如何浏览页面......这就是我设置汤的方式。
from bs4 import BeautifulSoup
url = "https://www.gov.uk/guidance/full-list-of-local-restriction-tiers-by-area"
headers = {...}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
我试过找到 h2s,然后遍历兄弟姐妹,find_all_next 等,但我没有任何运气。
Endstate 我正在尝试将每个县放入一个如下所示的 CSV:(颜色映射如下第 1 层:绿色,第 2 层:黄色,第 3 层:琥珀色,第 4 层:红色)
县
国家
等级
颜色
锡利群岛
英国
1
绿色的
拉特兰
英国
3
琥珀色
等等
更新:作为要提取的数据的最小示例:
from bs4 import BeautifulSoup
html = '''<div class="govspeak">
<ul>
<li>case detection rates in all age groups</li>
</ul>
<h2 id="tier-1-medium-alert">Tier 1: Medium alert</h2>
<h3 id="south-west">South West</h3>
<ul>
<li>Isles of Scilly</li>
</ul>
<h2 id="tier-2-high-alert">Tier 2: High alert</h2>
<p>No areas are currently in Tier 2.</p>
<h2 id="tier-3-very-high-alert">Tier 3: Very High alert</h2>
<h3 id="east-midlands">East Midlands</h3>
<ul>
<li>Rutland</li>
</ul>
<h3 id="north-west">North West</h3>
<ul>
<li>Liverpool City Region</li>
</ul>
</div>'''
soup = BeautifulSoup(html, "lxml")
h2 = soup.find_all('h2')
# Whats the best way to find related li tags?
解答
问题是 HTML 是H2和UL是扁平结构。有很多方法可以提取数据。例如在每个元素上执行For循环。
-
soup.find('div', {"class": "govspeak"})- 查找父 div(包含h2和li)。 -
container.find_all('li')- 查找所有li。 -
x.fetchPrevious('h2')[0].text.strip()- 查找h2之前的第一个[0](并删除所有空格)。 -
if x.fetchPrevious('h2')[0].findParent('div', {"class": "govspeak"})- 过滤掉任何没有出现在父 div 中的h2。 (因为fetchPrevious会从字面上找到前一个)。 -
namedtuple(我称之为CountyTierModel)将抓取的数据存储为数组。 -
re.search("(?<=Tier )\d(?=:)", x.tier)-RegEx从h2标题中获取数字。
抓取数据示例:
from collections import namedtuple
import re
import requests
from bs4 import BeautifulSoup
CountyTierModel = namedtuple('CountyTiers', ['tier', 'county'])
url = "https://www.gov.uk/guidance/full-list-of-local-restriction-tiers-by-area"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
container = soup.find('div', {"class": "govspeak"})
results = [CountyTierModel(x.fetchPrevious('h2')[0].text.strip(), x.text.strip()) for x in container.find_all('li')
if x.fetchPrevious('h2') and x.fetchPrevious('h2')[0].findParent('div', {"class": "govspeak"})]
# Here you can write yor code to convert to CSV & provide mapping for Country & Color.
for x in results:
# Regex to extract number from H2 title based on starting with 'Tier ' + number + ':'
m = re.search("(?<=Tier )\d(?=:)", x.tier)
print(f"{m.group(0)} - {x.county}")
输出:
1 - Isles of Scilly
3 - Rutland
3 - Liverpool City Region
3 - Bath and North East Somerset
3 - Bristol
3 - Cornwall
3 - Devon, Plymouth and Torbay
3 - Dorset
3 - North Somerset
3 - South Gloucestershire
3 - Wiltshire
3 - Herefordshire
3 - Shropshire, and Telford and Wrekin
3 - Worcestershire
3 - City of York and North Yorkshire
3 - The Humber: East Riding of Yorkshire, Kingston upon Hull/Hull, North East Lincolnshire and North Lincolnshire
3 - South Yorkshire (Barnsley, Doncaster, Rotheram, Sheffield)
3 - West Yorkshire (Bradford, Calderdale, Kirklees, Leeds, Wakefield)
4 - Derby and Derbyshire
4 - Leicester City and Leicestershire
4 - Lincolnshire
4 - Northamptonshire
4 - Nottingham and Nottinghamshire
4 - Bedford, Central Bedfordshire, Luton and Milton Keynes
4 - Cambridgeshire
4 - Essex, Southend-on-Sea and Thurrock
4 - Hertfordshire
4 - Norfolk
4 - Peterborough
4 - Suffolk
4 - All 32 London boroughs plus City of London
4 - North East Combined Authority (this area includes the local authorities of County Durham, Gateshead, South Tyneside and Sunderland)
4 - North of Tyne Combined Authority (this area includes the local authorities of Newcastle-upon-Tyne, North Tyneside and Northumberland)
4 - Tees Valley Combined Authority (this area includes the local authorities of Darlington, Hartlepool, Middlesbrough, Redcar and Cleveland, and Stockton-on-Tees)
4 - Cumbria
4 - Greater Manchester
4 - Lancashire, Blackburn with Darwen, and Blackpool
4 - Warrington and Cheshire Region
4 - Berkshire
4 - Brighton and Hove, East Sussex and West Sussex
4 - Buckinghamshire
4 - Hampshire, Southampton and Portsmouth
4 - Isle of Wight
4 - Kent and Medway
4 - Oxfordshire
4 - Surrey
4 - Bournemouth, Christchurch and Poole
4 - Gloucestershire (Cheltenham, Cotswold, Forest of Dean, Gloucester City, Stroud and Tewkesbury)
4 - Somerset (Mendip, Sedgemoor, Somerset West and Taunton, and South Somerset)
4 - Swindon
4 - Birmingham, Dudley, Sandwell, Walsall and Wolverhampton
4 - Coventry
4 - Solihull
4 - Staffordshire and Stoke-on-Trent
4 - Warwickshire
注意:保持问题的重点;我只添加了用于抓取的代码。提取到 CSV 应该是一个单独的问题。
更多推荐

所有评论(0)