问题:Python 3 Beautifulsoup 从 gov.uk 网页上抓取县名

如果有任何帮助,我将不胜感激!

我正在尝试将这个网页上的县名(https://www.gov.uk/guidance/full-list-of-local-restriction-tiers-by-area)刮成四个相应的列表:Tier1 ,第 2 层,第 3 层,第 4 层。

问题是我如何浏览页面......这就是我设置汤的方式。

from bs4 import BeautifulSoup
url = "https://www.gov.uk/guidance/full-list-of-local-restriction-tiers-by-area"
headers = {...}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")

我试过找到 h2s,然后遍历兄弟姐妹,find_all_next 等,但我没有任何运气。

Endstate 我正在尝试将每个县放入一个如下所示的 CSV:(颜色映射如下第 1 层:绿色,第 2 层:黄色,第 3 层:琥珀色,第 4 层:红色)

国家

等级

颜色

锡利群岛

英国

1

绿色的

拉特兰

英国

3

琥珀色

等等

更新:作为要提取的数据的最小示例:

from bs4 import BeautifulSoup
html = '''<div class="govspeak">
<ul>
  <li>case detection rates in all age groups</li>
</ul>
<h2 id="tier-1-medium-alert">Tier 1: Medium alert</h2>
<h3 id="south-west">South West</h3>
<ul>
  <li>Isles of Scilly</li>
</ul>
<h2 id="tier-2-high-alert">Tier 2: High alert</h2>
<p>No areas are currently in Tier 2.</p>
<h2 id="tier-3-very-high-alert">Tier 3: Very High alert</h2>
<h3 id="east-midlands">East Midlands</h3>
<ul>
  <li>Rutland</li>
</ul>
<h3 id="north-west">North West</h3>
<ul>
  <li>Liverpool City Region</li>
</ul>
</div>'''

soup = BeautifulSoup(html, "lxml")
h2 = soup.find_all('h2')
# Whats the best way to find related li tags?

解答

问题是 HTML 是H2UL是扁平结构。有很多方法可以提取数据。例如在每个元素上执行For循环。

  • soup.find('div', {"class": "govspeak"})- 查找父 div(包含h2li)。

  • container.find_all('li')- 查找所有li

  • x.fetchPrevious('h2')[0].text.strip()- 查找h2之前的第一个[0](并删除所有空格)。

  • if x.fetchPrevious('h2')[0].findParent('div', {"class": "govspeak"})- 过滤掉任何没有出现在父 div 中的h2。 (因为fetchPrevious会从字面上找到前一个)。

  • namedtuple(我称之为CountyTierModel)将抓取的数据存储为数组。

  • re.search("(?<=Tier )\d(?=:)", x.tier)-RegExh2标题中获取数字。

抓取数据示例:

from collections import namedtuple
import re
import requests
from bs4 import BeautifulSoup
CountyTierModel = namedtuple('CountyTiers', ['tier', 'county'])

url = "https://www.gov.uk/guidance/full-list-of-local-restriction-tiers-by-area"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
container = soup.find('div', {"class": "govspeak"})
results = [CountyTierModel(x.fetchPrevious('h2')[0].text.strip(), x.text.strip()) for x in container.find_all('li') 
                if x.fetchPrevious('h2') and x.fetchPrevious('h2')[0].findParent('div', {"class": "govspeak"})]

# Here you can write yor code to convert to CSV & provide mapping for Country & Color.
for x in results:
    # Regex to extract number from H2 title based on starting with 'Tier ' + number + ':'
    m = re.search("(?<=Tier )\d(?=:)", x.tier)
    print(f"{m.group(0)} - {x.county}")

输出:

1 - Isles of Scilly
3 - Rutland
3 - Liverpool City Region
3 - Bath and North East Somerset
3 - Bristol
3 - Cornwall
3 - Devon, Plymouth and Torbay
3 - Dorset
3 - North Somerset
3 - South Gloucestershire
3 - Wiltshire
3 - Herefordshire
3 - Shropshire, and Telford and Wrekin
3 - Worcestershire
3 - City of York and North Yorkshire
3 - The Humber: East Riding of Yorkshire, Kingston upon Hull/Hull, North East Lincolnshire and North Lincolnshire
3 - South Yorkshire (Barnsley, Doncaster, Rotheram, Sheffield)
3 - West Yorkshire (Bradford, Calderdale, Kirklees, Leeds, Wakefield)
4 - Derby and Derbyshire
4 - Leicester City and Leicestershire
4 - Lincolnshire
4 - Northamptonshire
4 - Nottingham and Nottinghamshire
4 - Bedford, Central Bedfordshire, Luton and Milton Keynes
4 - Cambridgeshire
4 - Essex, Southend-on-Sea and Thurrock
4 - Hertfordshire
4 - Norfolk
4 - Peterborough
4 - Suffolk
4 - All 32 London boroughs plus City of London
4 - North East Combined Authority (this area includes the local authorities of County Durham, Gateshead, South Tyneside and Sunderland)
4 - North of Tyne Combined Authority (this area includes the local authorities of Newcastle-upon-Tyne, North Tyneside and Northumberland)
4 - Tees Valley Combined Authority (this area includes the local authorities of Darlington, Hartlepool, Middlesbrough, Redcar and Cleveland, and Stockton-on-Tees)
4 - Cumbria
4 - Greater Manchester
4 - Lancashire, Blackburn with Darwen, and Blackpool
4 - Warrington and Cheshire Region
4 - Berkshire
4 - Brighton and Hove, East Sussex and West Sussex
4 - Buckinghamshire
4 - Hampshire, Southampton and Portsmouth
4 - Isle of Wight
4 - Kent and Medway
4 - Oxfordshire
4 - Surrey
4 - Bournemouth, Christchurch and Poole
4 - Gloucestershire (Cheltenham, Cotswold, Forest of Dean, Gloucester City, Stroud and Tewkesbury)
4 - Somerset (Mendip, Sedgemoor, Somerset West and Taunton, and South Somerset)
4 - Swindon
4 - Birmingham, Dudley, Sandwell, Walsall and Wolverhampton
4 - Coventry
4 - Solihull
4 - Staffordshire and Stoke-on-Trent
4 - Warwickshire

注意:保持问题的重点;我只添加了用于抓取的代码。提取到 CSV 应该是一个单独的问题

Logo

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐