Answer a question

I am stuck on this website. I've done some small codes to learn about BeatifulSoup for the past week, I did some research on how to use it and the respective official documentation. Not only that, but review some tutorials and videos on how to parse a table from websites. I've parsed data from tables using the methods soup.find() and soup.select() from several websites such as:

  1. Games engine website
  2. MLB stats website
  3. Wikipedia

for example, for the MLB stats website I used the following code:

from urllib.request import urlopen as ureq
from bs4 import BeautifulSoup as bs
    
def connection(url):
    uclient = ureq(url)
    page_html = uclient.read()
    uclient.close()
    soup = bs(page_html, "html.parser")
    return(soup)


soup = connection('https://baseballsavant.mlb.com/team/146')
   
table = soup.findAll("div", {"class": "table-savant"})  #<--using method soup.find()
#table = soup.select("div.table-savant") #<-----------------using method soup.select()   

for n in range(len(table)):
    if (n==9): break 
    content = table[n]
    columns = content.find("thead").find_all("th")    
    column_names = [str(c.string).strip() for c in columns] 
    table_rows = soup.findAll("tbody")[n].find_all("tr")
    l = [] 
    for tr in table_rows:
        td = tr.find_all("td")
        row = [str(tr.text).strip() for tr in td]
        l.append(row)
    print(l) 

Then convert them into a data frame. But there is one particular website that I can not retrieve the data of the tables. I've tried just printing the content with find():

def connection(url):
    uclient = ureq(url)
    page_html = uclient.read()
    uclient.close()
    soup = bs(page_html, "html.parser")
    return(soup) 

soup = connection('https://baseballsavant.mlb.com/preview?game_pk=634607&game_date=2021-4-4')
   
table = soup.findAll("div", {"class": "table-savant"})  #<--using method soup.find()
print(table)

result: []

With select():

table = soup.select("div.table-savant") 
print(table)

result: []

With select() using CSS path from this post:

table = soup.select('#preview > div:nth-of-type(1) > div:nth-of-type(2) > div:nth-of-type(3) > table:nth-of-type(1) > tbody:nth-of-type(2) > tr:nth-of-type(2) > td:nth-of-type(3)')
print(table)
    
result: []

I want to retrieve the stats from the players, but I'm lost. Any suggestion will be highly appreciated. Thank you.

Answers

Problem: The page uses javascript to fetch and display the content, so you cannot just use requests or other similars because javascript code would not be executed.
Solution: use selenium in order to load the page then parse the content with BeautifulSoup.
Sample code here:

from selenium import webdriver
d = webdriver.Chrome()
d.get(url)
bs = BeautifulSoup(d.page_source)

To use webdriver.Chrome you will also have to download chromedriver from here and put the executable in the same folder of your project or in PATH.

Logo

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐