Why does python parsing table BeatifulSoup do not work on this website as intended?

Mangs

0人浏览 · 2022-08-24 20:48:14

Mangs · 2022-08-24 20:48:14 发布

Answer a question

I am stuck on this website. I've done some small codes to learn about BeatifulSoup for the past week, I did some research on how to use it and the respective official documentation. Not only that, but review some tutorials and videos on how to parse a table from websites. I've parsed data from tables using the methods soup.find() and soup.select() from several websites such as:

Games engine website
MLB stats website
Wikipedia

for example, for the MLB stats website I used the following code:

from urllib.request import urlopen as ureq
from bs4 import BeautifulSoup as bs
    
def connection(url):
    uclient = ureq(url)
    page_html = uclient.read()
    uclient.close()
    soup = bs(page_html, "html.parser")
    return(soup)


soup = connection('https://baseballsavant.mlb.com/team/146')
   
table = soup.findAll("div", {"class": "table-savant"})  #<--using method soup.find()
#table = soup.select("div.table-savant") #<-----------------using method soup.select()   

for n in range(len(table)):
    if (n==9): break 
    content = table[n]
    columns = content.find("thead").find_all("th")    
    column_names = [str(c.string).strip() for c in columns] 
    table_rows = soup.findAll("tbody")[n].find_all("tr")
    l = [] 
    for tr in table_rows:
        td = tr.find_all("td")
        row = [str(tr.text).strip() for tr in td]
        l.append(row)
    print(l)

Then convert them into a data frame. But there is one particular website that I can not retrieve the data of the tables. I've tried just printing the content with find():

def connection(url):
    uclient = ureq(url)
    page_html = uclient.read()
    uclient.close()
    soup = bs(page_html, "html.parser")
    return(soup) 

soup = connection('https://baseballsavant.mlb.com/preview?game_pk=634607&game_date=2021-4-4')
   
table = soup.findAll("div", {"class": "table-savant"})  #<--using method soup.find()
print(table)

result: []

With select():

table = soup.select("div.table-savant") 
print(table)

result: []

With select() using CSS path from this post:

table = soup.select('#preview > div:nth-of-type(1) > div:nth-of-type(2) > div:nth-of-type(3) > table:nth-of-type(1) > tbody:nth-of-type(2) > tr:nth-of-type(2) > td:nth-of-type(3)')
print(table)
    
result: []

I want to retrieve the stats from the players, but I'm lost. Any suggestion will be highly appreciated. Thank you.

Answers

Problem: The page uses javascript to fetch and display the content, so you cannot just use requests or other similars because javascript code would not be executed.
Solution: use selenium in order to load the page then parse the content with BeautifulSoup.
Sample code here:

from selenium import webdriver
d = webdriver.Chrome()
d.get(url)
bs = BeautifulSoup(d.page_source)

To use webdriver.Chrome you will also have to download chromedriver from here and put the executable in the same folder of your project or in PATH.

Python

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐

求助！为什么用InsCode部署会出现无限重定向？

Python

如何重塑熊猫。系列

问题:如何重塑熊猫。系列在我看来,它就像 pandas.Series 中的一个错误。 a = pd.Series([1,2,3,4]) b = a.reshape(2,2) b b 有类型 Series 但无法显示,最后一条语句给出异常,非常冗长,最后一行是“TypeError: %d format: a number is required, not numpy.ndarray”。 b.sha

Python

在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制]

问题:在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制] 我刚刚在这里](https://keras.io/initializers/)中阅读了有关[中的 Keras 权重初始化器的信息。在文档中,只介绍了不同的初始化程序。如: model.add(Dense(64, kernel_initializer='random_normal')) 当我没有指定kernel_initia