Why does python parsing table BeatifulSoup do not work on this website as intended?
Answer a question
I am stuck on this website. I've done some small codes to learn about BeatifulSoup for the past week, I did some research on how to use it and the respective official documentation. Not only that, but review some tutorials and videos on how to parse a table from websites. I've parsed data from tables using the methods soup.find() and soup.select() from several websites such as:
- Games engine website
- MLB stats website
- Wikipedia
for example, for the MLB stats website I used the following code:
from urllib.request import urlopen as ureq
from bs4 import BeautifulSoup as bs
def connection(url):
uclient = ureq(url)
page_html = uclient.read()
uclient.close()
soup = bs(page_html, "html.parser")
return(soup)
soup = connection('https://baseballsavant.mlb.com/team/146')
table = soup.findAll("div", {"class": "table-savant"}) #<--using method soup.find()
#table = soup.select("div.table-savant") #<-----------------using method soup.select()
for n in range(len(table)):
if (n==9): break
content = table[n]
columns = content.find("thead").find_all("th")
column_names = [str(c.string).strip() for c in columns]
table_rows = soup.findAll("tbody")[n].find_all("tr")
l = []
for tr in table_rows:
td = tr.find_all("td")
row = [str(tr.text).strip() for tr in td]
l.append(row)
print(l)
Then convert them into a data frame. But there is one particular website that I can not retrieve the data of the tables. I've tried just printing the content with find():
def connection(url):
uclient = ureq(url)
page_html = uclient.read()
uclient.close()
soup = bs(page_html, "html.parser")
return(soup)
soup = connection('https://baseballsavant.mlb.com/preview?game_pk=634607&game_date=2021-4-4')
table = soup.findAll("div", {"class": "table-savant"}) #<--using method soup.find()
print(table)
result: []
With select():
table = soup.select("div.table-savant")
print(table)
result: []
With select() using CSS path from this post:
table = soup.select('#preview > div:nth-of-type(1) > div:nth-of-type(2) > div:nth-of-type(3) > table:nth-of-type(1) > tbody:nth-of-type(2) > tr:nth-of-type(2) > td:nth-of-type(3)')
print(table)
result: []
I want to retrieve the stats from the players, but I'm lost. Any suggestion will be highly appreciated. Thank you.
Answers
Problem: The page uses javascript to fetch and display the content, so you cannot just use requests or other similars because javascript code would not be executed.
Solution: use selenium in order to load the page then parse the content with BeautifulSoup.
Sample code here:
from selenium import webdriver
d = webdriver.Chrome()
d.get(url)
bs = BeautifulSoup(d.page_source)
To use webdriver.Chrome you will also have to download chromedriver from here and put the executable in the same folder of your project or in PATH.
更多推荐

所有评论(0)