BeautifulSoup and Pandas read_html is not pulling all of the rows in a table
·
Answer a question
When I am scraping a table from a website, it is missing the bottom 5 rows of data and I do not know how to pull them. I am using a combination of BeautifulSoup and Selenium. I thought that they were not loading, so I tried scrolling to the bottom with Selenium, but that still did not work.
Code trials:
site = 'https://fbref.com//en/comps/15/10733/schedule/2020-2021-League-One'
PATH = my_path
driver = webdriver.Chrome(PATH)
driver.get(site)
webpage = bs.BeautifulSoup(driver.page_source, features='html.parser')
table = webpage.find('table', {'class': 'stats_table sortable min_width now_sortable'})
print(table.prettify())
df = pd.read_html(str(table))[0]
print(df.tail())
Please could you help with scraping the full table?
Answers
Using only Selenium to pull all the rows from the table within the website you need to induce WebDriverWait for the visibility_of_element_located() and using DataFrame from Pandas you can use the following Locator Strategy:
-
Using CSS_SELECTOR:
tabledata = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table.stats_table.sortable.min_width.now_sortable"))).get_attribute("outerHTML") tabledf = pd.read_html(tabledata) print(tabledf) -
Using XPATH:
driver.get('https://fbref.com//en/comps/15/10733/schedule/2020-2021-League-One') data = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='stats_table sortable min_width now_sortable']"))).get_attribute("outerHTML") df = pd.read_html(data) print(df) -
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC -
Console Output:
[ Round Wk Day ... Referee Match Report Notes 0 Regular Season 1 Sat ... Charles Breakspear Match Report NaN 1 Regular Season 1 Sat ... Andrew Davies Match Report NaN 2 Regular Season 1 Sat ... Kevin Johnson Match Report NaN 3 Regular Season 1 Sat ... Anthony Backhouse Match Report NaN 4 Regular Season 1 Sat ... Marc Edwards Match Report NaN .. ... ... ... ... ... ... ... 685 Semi-finals NaN Tue ... Robert Madley Match Report Leg 1 of 2 686 Semi-finals NaN Wed ... Craig Hicks Match Report Leg 1 of 2 687 Semi-finals NaN Fri ... Keith Stroud Match Report Leg 2 of 2; Blackpool won 688 Semi-finals NaN Sat ... Michael Salisbury Match Report Leg 2 of 2; Lincoln City won 689 Final NaN Sun ... Tony Harrington Match Report NaN [690 rows x 13 columns]]
更多推荐

所有评论(0)