问题:Beautifulsoup - 收集 href 链接并创建链接列表

我试图收集枪支列表中的所有链接(在这种情况下为 2 页)并打印 1)长度和 2)链接本身。

我收到错误:列表对象没有属性选择

from bs4 import BeautifulSoup
import requests
import csv
import pandas
from pandas import DataFrame
import re
import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"

page = 1
all_links = []
url="https://www.gunstar.co.uk/view-trader/global-rifle-snipersystems/58782?page={}"

with requests.Session() as session:
  while True:
    print(url.format(page))
    res=session.get(url.format(page))
    soup=BeautifulSoup(res.content,'html.parser')
    gun_details = soup.select('div.details')
    for link in gun_details.select('a'):
     all_links.append("https://www.gunstar.co.uk" + link['href'])
    if len(soup.select(".nav_next"))==0:
        break
    page += 1

如果我从响应中删除 .content 我得到响应没有 len。

如果我将 .text 添加到 soup.select('div.details') 中,我会得到与上面类似的结果。

我确定我在某个相当简单的地方出错了,只是似乎看不到它-当尝试点击 html 的特定部分时,为什么 select 和 findAll 不起作用?

解答

试试下面的代码。

from bs4 import BeautifulSoup
import requests
import csv
import pandas
from pandas import DataFrame
import re
import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"

page = 1

url="https://www.gunstar.co.uk/view-trader/global-rifle-snipersystems/58782?page={}"

with requests.Session() as session:
  while True:
    all_links=[]
    print(url.format(page))
    res=session.get(url.format(page))
    soup=BeautifulSoup(res.content,'html.parser')
    gun_details = soup.select('div.details')
    for link in gun_details:
     all_links.append("https://www.gunstar.co.uk" + link.select_one('a')['href'])
    print(all_links)
    if len(soup.select(".nav_next"))==0:
        break
    page += 1

输出:

https://www.gunstar.co.uk/view-trader/global-rifle-snipersystems/58782?page=1
['https://www.gunstar.co.uk/mauser-m96-lightning-hunter-straight-pull-270-rifles/rifles/1083802', 'https://www.gunstar.co.uk/magtech-586-12-bore-gauge-pump-action/Shotguns/1083784', 'https://www.gunstar.co.uk/merkel-kr1-bolt-action-308-rifles/rifles/1083786', 'https://www.gunstar.co.uk/christensen-arms-r93-carbon-bolt-action-7-mm-rifles/rifles/1083788', 'https://www.gunstar.co.uk/voere-lbw-luxus-bolt-action-308-rifles/rifles/1083792', 'https://www.gunstar.co.uk/voere-2155-bolt-action-243-rifles/rifles/1083797', 'https://www.gunstar.co.uk/voere-2155-2155-synthetic-bolt-action-308-rifles/rifles/1083798', 'https://www.gunstar.co.uk/mauser-m96-lightning-hunter-straight-pull-7-mm-rifles/rifles/1083799', 'https://www.gunstar.co.uk/blaser-lrs2-straight-pull-308-rifles/rifles/1084397', 'https://www.gunstar.co.uk/remington-700-s-s-barrel-only-bolt-action-300-win-mag-rifles/rifles/1084432']
https://www.gunstar.co.uk/view-trader/global-rifle-snipersystems/58782?page=2
['https://www.gunstar.co.uk/pfeiffer-waffen-handy-hunter-sr2-single-shot-300-win-mag-rif/rifles/1084433', 'https://www.gunstar.co.uk/sabatti-10-22-mod-sporter-semi-auto-22-rifles/rifles/1084442', 'https://www.gunstar.co.uk/voere-lbw-m-sniper-rifle-bolt-action-308-rifles/rifles/1084454', 'https://www.gunstar.co.uk/snipersystems-zoom-gun-light-kit-lamping/Accessories/1130763']

获取所有链接的另一种方法。

from bs4 import BeautifulSoup
import requests
import csv
import pandas
from pandas import DataFrame
import re
import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"

page = 1
all_links = []
url="https://www.gunstar.co.uk/view-trader/global-rifle-snipersystems/58782?page={}"

with requests.Session() as session:
  while True:

    print(url.format(page))
    res=session.get(url.format(page))
    soup=BeautifulSoup(res.content,'html.parser')
    gun_details = soup.select('div.details > a')
    for link in gun_details:
     all_links.append("https://www.gunstar.co.uk" + link['href'])

    if len(soup.select(".nav_next"))==0:
        break
    page += 1

print(all_links)
Logo

学AI,认准AI Studio!GPU算力,限时免费领,邀请好友解锁更多惊喜福利 >>>

更多推荐