Scraping home teams
Answer a question
I'm working on a project where I want to scrape NBA match statistics for the 2019/20 season from https://www.basketball-reference.com/leagues/NBA_2020_games.html for the months of October to August.
I focus solely on match outcomes for Home and Away teams and not player/team stats specifically and therefore I need box score data for every match using the tables "Basic Box Score Stats".
Problem: When scraping the box scores I only manage to gather the data for Away teams, since it's the first table in the box score link and I simply have to specify the table using the index [0] (it's static). For the Home team, the table index seems to change depending on whether there was Over Time (OT) or not - and sometimes due to other unspecified changes (it's somewhat dynamic).
Question: How can I best use a loop to gather box scores for both Away and Home teams in every month? Or, how do I collect data for the Home team in each box score?
Example of a box score page for a match without Over Time: https://www.basketball-reference.com/boxscores/201910220LAC.html
Example of a box score page for a match with Over Time: https://www.basketball-reference.com/boxscores/201910220TOR.html
In the latter example, the table-index for the Home team changes depending on the preceding number of tables (tables containing data on e.g. Over Time etc.). Usually it's the 8th table without OT and with OT its different.
My code that successfully (and consistently) gets the data for Away teams is the following:
box_score_example_url='http://www.basketball-reference.com//boxscores/201910230POR.html'
dfbox[]
for eachBox in box_score_example_url:
dfz = pd.read_html(eachBox)
dfbox.append(dfz[0])
boxbox_awayteam = pd.concat(dfbox)
boxbox_awayteam
I'm out of ideas for this one since no table seems to have a specific id or class in the HTML code. This is my first web scraping project and my first question posed on Stackoverflow, so bare with me.
Answers
You can use BeautifulSoup and CSS selector [id$="-game-basic"] table to select only the two basic tables and then load these tables with pd.read_html():
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.basketball-reference.com/boxscores/201910220TOR.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
my_tables = soup.select('[id$="-game-basic"] table')
df_1 = pd.read_html(str(my_tables[0]))[0].droplevel(0, axis=1)
df_2 = pd.read_html(str(my_tables[1]))[0].droplevel(0, axis=1)
print(df_1)
print(df_2)
Prints:
Starters MP ... PTS +/-
0 Jrue Holiday 41:05 ... 13 -14
1 Brandon Ingram 35:06 ... 22 -19
2 J.J. Redick 27:03 ... 16 -14
3 Lonzo Ball 24:50 ... 8 -7
4 Derrick Favors 20:46 ... 6 -12
5 Reserves MP ... PTS +/-
6 Josh Hart 28:10 ... 15 -1
7 Nicolò Melli 19:37 ... 14 +11
8 Kenrich Williams 18:02 ... 3 +11
9 Frank Jackson 13:51 ... 9 +7
10 Jahlil Okafor 12:29 ... 8 -7
11 E'Twaun Moore 12:06 ... 5 -1
12 Nickeil Alexander-Walker 11:55 ... 3 +6
13 Jaxson Hayes Did Not Play ... Did Not Play Did Not Play
14 Team Totals 265 ... 122 NaN
[15 rows x 21 columns]
Starters MP ... PTS +/-
0 Kyle Lowry 44:59 ... 22 -1
1 Fred VanVleet 44:21 ... 34 +18
2 Pascal Siakam 38:09 ... 34 +5
3 OG Anunoby 35:48 ... 11 +12
4 Marc Gasol 31:55 ... 6 -2
5 Reserves MP ... PTS +/-
6 Norman Powell 28:38 ... 5 +2
7 Serge Ibaka 26:00 ... 13 +6
8 Terence Davis 15:10 ... 5 0
9 Matt Thomas Did Not Play ... Did Not Play Did Not Play
10 Chris Boucher Did Not Play ... Did Not Play Did Not Play
11 Stanley Johnson Did Not Play ... Did Not Play Did Not Play
12 Malcolm Miller Did Not Play ... Did Not Play Did Not Play
13 Dewan Hernandez Did Not Play ... Did Not Play Did Not Play
14 Team Totals 265 ... 130 NaN
[15 rows x 21 columns]
EDIT: To put this function in a loop, you can use this example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.basketball-reference.com/leagues/NBA_2020_games.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
def get_tables(url):
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
my_tables = soup.select('[id$="-game-basic"] table')
df_1 = pd.read_html(str(my_tables[0]))[0].droplevel(0, axis=1)
df_2 = pd.read_html(str(my_tables[1]))[0].droplevel(0, axis=1)
return df_1, df_2
for a in soup.select('.filter a'):
u = 'https://www.basketball-reference.com' + a['href']
print(u)
soup2 = BeautifulSoup(requests.get(u).content, 'html.parser')
for a2 in soup2.select('td a[href^="/boxscores/"]'):
u2 = 'https://www.basketball-reference.com' + a2['href']
t1, t2 = get_tables(u2)
print(u2)
print(t1)
print(t2)
print('-' * 80)
Prints:
https://www.basketball-reference.com/leagues/NBA_2020_games-october.html
https://www.basketball-reference.com/boxscores/201910220TOR.html
Starters MP ... PTS +/-
0 Jrue Holiday 41:05 ... 13 -14
1 Brandon Ingram 35:06 ... 22 -19
2 J.J. Redick 27:03 ... 16 -14
3 Lonzo Ball 24:50 ... 8 -7
4 Derrick Favors 20:46 ... 6 -12
5 Reserves MP ... PTS +/-
6 Josh Hart 28:10 ... 15 -1
7 Nicolò Melli 19:37 ... 14 +11
8 Kenrich Williams 18:02 ... 3 +11
9 Frank Jackson 13:51 ... 9 +7
10 Jahlil Okafor 12:29 ... 8 -7
11 E'Twaun Moore 12:06 ... 5 -1
12 Nickeil Alexander-Walker 11:55 ... 3 +6
13 Jaxson Hayes Did Not Play ... Did Not Play Did Not Play
14 Team Totals 265 ... 122 NaN
[15 rows x 21 columns]
Starters MP ... PTS +/-
0 Kyle Lowry 44:59 ... 22 -1
1 Fred VanVleet 44:21 ... 34 +18
2 Pascal Siakam 38:09 ... 34 +5
3 OG Anunoby 35:48 ... 11 +12
4 Marc Gasol 31:55 ... 6 -2
5 Reserves MP ... PTS +/-
6 Norman Powell 28:38 ... 5 +2
7 Serge Ibaka 26:00 ... 13 +6
8 Terence Davis 15:10 ... 5 0
9 Matt Thomas Did Not Play ... Did Not Play Did Not Play
10 Chris Boucher Did Not Play ... Did Not Play Did Not Play
11 Stanley Johnson Did Not Play ... Did Not Play Did Not Play
12 Malcolm Miller Did Not Play ... Did Not Play Did Not Play
13 Dewan Hernandez Did Not Play ... Did Not Play Did Not Play
14 Team Totals 265 ... 130 NaN
[15 rows x 21 columns]
--------------------------------------------------------------------------------
https://www.basketball-reference.com/boxscores/201910220LAC.html
Starters MP ... PTS +/-
0 Anthony Davis 37:22 ... 25 +3
1 LeBron James 36:00 ... 18 -8
2 Danny Green 32:20 ... 28 +7
...and so on.
更多推荐

所有评论(0)