如何在 BS4 中为多个容器使用循环时从容器中提取多行?
问题:如何在 BS4 中为多个容器使用循环时从容器中提取多行?
我正在尝试从http://www.rotoworld.com/teams/injuries/nba/all/获取当前的 NBA 伤病数据。我编写了一个 python 脚本(如下),它正确地提取了每个团队容器的团队和第一行数据,但不是每个容器的所有玩家。我对 Python 很陌生,但花了很多时间试图找到解决方案,不幸的是没有找到任何解决问题的方法。我希望这不是一个太新手的问题!
有人可以帮我提取每支球队的所有球员数据吗?
另外,如果对改进我的脚本有任何其他建议,请告诉我!我很高兴终于开始使用 Python 工作!
先感谢您!
import requests
from bs4 import BeautifulSoup as bs
#Define URL to fetch
url = 'http://www.rotoworld.com/teams/injuries/nba/all/'
#Make requests
data = requests.get(url)
# To force American English (en-US) when necessary
headers = {"Accept-Language": "en-US, en;q=0.5"}
#Create BeautifulSoup object
soup = bs(data.text, 'html.parser')
# Lists to store scraped data
teams = []
players = []
reports = []
return_dates = []
injury_dates = []
injuries = []
positions = []
statuses = []
# Extract data from individual containers
for container in team_containers:
# Team Name
team = container.a.text
teams.append(team)
# Player Name [First, Last]
player = container.table.a.text
players.append(player)
# Player Report
report = container.find('div', attrs = { 'class':'report'}).text
reports.append(report)
# Player Return
return_date = container.find('div', attrs = { 'class':'impact'}).text
return_dates.append(return_date)
# Player Injury Dates
injury_date = container.find('div', attrs = { 'class':'date'}).text
injury_dates.append(injury_date)
# Player Injury Details
injury = container.find('div', attrs = { 'class':'playercard'}).span.text
injuries.append(injury)
# Player Position
position= container.table.find_all('td')[9].text
positions.append(position)
# Player Status
status = container.table.find_all('td')[10].text
statuses.append(status)
import pandas as pd
test_df = pd.DataFrame({'team': teams,
'player': players,
'report': reports,
'return_date': return_dates,
'injury_date': injury_dates,
'injury': injuries,
'position': positions,
'status': statuses})
print(test_df.info())
test_df
当前结果:* 27 个容器 - 每个团队一个(如果团队有 1 次以上受伤),包含团队表中的第一个球员 * 姓名、报告、POS、日期、受伤、返回作为记录的字段
预期结果:* 27 个容器 - 每支球队一个(如果球队有 1 次以上的伤病),包含球队表中的所有球员 * 姓名、报告、POS、日期、受伤、返回作为标题行和记录的字段
解答
当你注意到 html 代码有一个<table>标签时,你可以让 Pandas 用.read_html()做很多工作
所以我以这种方式解决了问题,使用 Pandas 来获取表格。那时我唯一遇到的问题是在那里获取团队名称。因此,我使用 BeautifulSoup 将团队名称按照表格出现的顺序放入列表中,然后将其与 Panda 返回的数据框列表相匹配。
所以我将两个版本都放在这里:1)没有团队名称,2)有团队名称:
没有团队名称
import pandas as pd
url = 'http://www.rotoworld.com/teams/injuries/nba/all/'
# Get All Tables
tables = pd.read_html(url)
results = pd.DataFrame()
for table in tables:
temp_df = table[1:]
temp_df.columns = table.iloc[0]
temp_df = temp_df.dropna(axis=1,how='all')
results = results.append(temp_df).reset_index(drop=True)
输出:
print(results)
0 Name ... Returns
0 Kent Bazemore ... Targeting mid-January
1 Taurean Prince ... Day-to-day
2 Rondae Hollis-Jefferson ... Day-to-day
3 Dzanan Musa ... Day-to-day
4 Allen Crabbe ... Targeting mid-January
5 Caris LeVert ... Targeting February
6 Marcus Morris ... Day-to-day
7 Kyrie Irving ... Day-to-day
8 Robert Williams ... day-to-day
9 Aron Baynes ... Targeting MLK Day
10 Malik Monk ... Day-to-day
11 Cody Zeller ... Targeting All-Star break
12 Jeremy Lamb ... day-to-day
13 Bobby Portis ... Targeting mid-January
14 Denzel Valentine ... Out for season
15 Matthew Dellavedova ... Day-to-day
16 Ante Zizic ... Day-to-day
17 David Nwaba ... Day-to-day
18 J.R. Smith ... Out indefinitely
19 John Henson ... Out Indefinitely
20 Kevin Love ... Targeting mid-January
21 Will Barton ... week-to-week
22 Jarred Vanderbilt ... Out indefinitely
23 Michael Porter Jr. ... Out indefinitely
24 Isaiah Thomas ... Targeting December
25 Zaza Pachulia ... Day-to-day
26 Ish Smith ... Out Indefinitely
27 Henry Ellenson ... Out indefinitely
28 Damian Jones ... Out Indefinitely
29 DeMarcus Cousins ... Targeting January?
.. ... ... ...
[74 rows x 6 columns]
包含带有团队名称的列
import bs4
import requests
import pandas as pd
url = 'http://www.rotoworld.com/teams/injuries/nba/all/'
#Make requests
data = requests.get(url)
# To force American English (en-US) when necessary
headers = {"Accept-Language": "en-US, en;q=0.5"}
#Create BeautifulSoup object
soup = bs4.BeautifulSoup(data.text, 'html.parser')
当我“检查”页面/html 时,我注意到团队名称出现在标签<div class "player">下,就在<table> 标签之前。所以这告诉我团队名称文本继续他们的表格(并且在网站上的显示中也很明显)所以我找到所有带有<div class "player">的标签并将它们存储在team_containers
team_containers = soup.find_all('div', {'class':'player'})
如果我打印长度print (len(team_containers))我看到有 27 个元素。如果我在执行tables = pd.read_html(url)之后查看tables的长度,我也会得到 27。所以这可能并非所有情况都是如此,但我对我的假设充满信心,即team_containers中有 27 个元素,而tables,表示它们应该匹配,并且也应该是相同的顺序。
所以接下来,我遍历team_containers以提取文本,并将其放入列表中。我使用了列表理解,但你可以做一个for循环:
teams = [ team.text for team in team_containers ]
是相同的:
teams = []
for team in team_containers:
teams.append(team.text)
这给了我列表中的元素:
['Atlanta Hawks', 'Brooklyn Nets', 'Boston Celtics', 'Charlotte Hornets', 'Chicago Bulls', 'Cleveland Cavaliers', 'Denver Nuggets', 'Detroit Pistons', 'Golden State Warriors', 'Houston Rockets', 'Indiana Pacers', 'Los Angeles Clippers', 'Los Angeles Lakers', 'Memphis Grizzlies', 'Miami Heat', 'Minnesota Timberwolves', 'Milwaukee Bucks', 'New Orleans Pelicans', 'New York Knicks', 'Oklahoma City Thunder', 'Orlando Magic', 'Philadelphia 76ers', 'San Antonio Spurs', 'Sacramento Kings', 'Toronto Raptors', 'Utah Jazz', 'Washington Wizards']
您必须了解列表中的每个元素都有一个索引/位置,从 0 开始。所以'Atlanta Hawks'将是teams[0],'Brooklyn Nets'是teams[1],等等。
我初始化了一个最终的results数据帧,从 0 开始我的索引/位置,将所有表附加到一个results中,并在遍历tables时遍历我的teams列表。
results = pd.DataFrame()
idx = 0
然后遍历我的tables
for table in tables:
现在我的索引是0(这是'Atlanta Hawks')。我将其存储在一个名为 team 的变量中。
team = teams[idx]
我想要我的表中的所有行,除了标题之后的第一行。我将其存储为一些临时数据框,我称之为temp_df以使用它,然后附加到我的结果中。
temp_df = table[1:]
我将temp_df的标题/列命名为table的第一行
temp_df.columns = table.iloc[0]
我重命名了temp_df的第二列,因为它是“Nan”(它实际上是报告在网站上的位置,但它并没有拉出那个,我想我将把它用作我的"Team"列
temp_df = temp_df.rename(columns={ temp_df.columns[1]: "Team" })
我将'Atlanta Hawks'分配为该值,以便"Team"列填充'Atlanta Hawks'
temp_df['Team'] = team
现在我已经完成了,在下一次迭代中,我想要我的团队列表中的下一个索引位置,所以我将它增加 1,所以下一个循环将是'teams[1]'这将是'Brooklyn Nets'表
idx += 1
并将该临时数据框附加到我的最终结果中。然后它再次经历这个过程,在我的tables中的下一个元素上,现在我的索引设置为 1,以便它将'Brooklyn Nets'填充为我的team变量
results = results.append(temp_df).reset_index(drop=True)
所以完整的代码:
import bs4
import requests
import pandas as pd
url = 'http://www.rotoworld.com/teams/injuries/nba/all/'
#Make requests
data = requests.get(url)
# To force American English (en-US) when necessary
headers = {"Accept-Language": "en-US, en;q=0.5"}
#Create BeautifulSoup object
soup = bs4.BeautifulSoup(data.text, 'html.parser')
# Get Team Names in Order as Tables Appear
team_containers = soup.find_all('div', {'class':'player'})
teams = [ team.text for team in team_containers ]
# Get All Tables
tables = pd.read_html(url)
results = pd.DataFrame()
idx = 0
for table in tables:
team = teams[idx]
temp_df = table[1:]
temp_df.columns = table.iloc[0]
temp_df = temp_df.rename(columns={ temp_df.columns[1]: "Team" })
temp_df['Team'] = team
idx += 1
results = results.append(temp_df).reset_index(drop=True)
输出:
print(results)
0 Name ... Returns
0 Kent Bazemore ... Targeting mid-January
1 Taurean Prince ... Day-to-day
2 Rondae Hollis-Jefferson ... Day-to-day
3 Dzanan Musa ... Day-to-day
4 Allen Crabbe ... Targeting mid-January
5 Caris LeVert ... Targeting February
6 Marcus Morris ... Day-to-day
7 Kyrie Irving ... Day-to-day
8 Robert Williams ... day-to-day
9 Aron Baynes ... Targeting MLK Day
10 Malik Monk ... Day-to-day
11 Cody Zeller ... Targeting All-Star break
12 Jeremy Lamb ... day-to-day
13 Bobby Portis ... Targeting mid-January
14 Denzel Valentine ... Out for season
15 Matthew Dellavedova ... Day-to-day
16 Ante Zizic ... Day-to-day
17 David Nwaba ... Day-to-day
18 J.R. Smith ... Out indefinitely
19 John Henson ... Out Indefinitely
20 Kevin Love ... Targeting mid-January
21 Will Barton ... week-to-week
22 Jarred Vanderbilt ... Out indefinitely
23 Michael Porter Jr. ... Out indefinitely
24 Isaiah Thomas ... Targeting December
25 Zaza Pachulia ... Day-to-day
26 Ish Smith ... Out Indefinitely
27 Henry Ellenson ... Out indefinitely
28 Damian Jones ... Out Indefinitely
29 DeMarcus Cousins ... Targeting January?
.. ... ... ...
[74 rows x 7 columns]
更多推荐

所有评论(0)