Python Web Scraper/Crawler - HTML Tables to Excel Spreadsheet

Mangs

1人浏览 · 2022-08-24 20:11:07

Mangs · 2022-08-24 20:11:07 发布

Answer a question

I'm trying to make a web scraper that will pull tables from a website and then paste them onto an excel spreadsheet. I'm an EXTREME beginner at Python (and coding in general) - literally started learning a couple days ago.

So, how do I make this web scraper/crawler? Here's the code that I have:

import csv
import requests
from BeautifulSoup import BeautifulSoup

url = 'https://www.techpowerup.com/gpudb/?mobile=0&released%5B%5D=y14_c&released%5B%5D=y11_14&generation=&chipname=&interface=&ushaders=&tmus=&rops=&memsize=&memtype=&buswidth=&slots=&powerplugs=&sort=released&q='
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'processors'})

list_of_rows = []
for row in table.findAll('tr')[1:]:
list_of_cells = []
for cell in row.findAll('td'):
    text = cell.text.replace('&nbsp;', '')
    list_of_cells.append(text)
list_of_rows.append(list_of_cells)

outfile = open("./GPU.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Product Name", "GPU Chip", "Released", "Bus", "Memory", "GPU clock", "Memory clock", "Shaders/TMUs/ROPs"])
writer.writerows(list_of_rows)

Now the program WORKS for the website present in the code above.

Now, I want to scrape the tables from the following website: https://www.techpowerup.com/gpudb/2990/radeon-rx-560d

Note that there are several tables on this page. What should I add/change to get the program to work on this page? I'm trying to get all of the tables, but if anyone could help me even get one of them, I would appreciate it so much!

Answers

Essentially, you just need to modify the code you have in your question to account for the fact the site has several tables!

What is really neat (or, dare I say, beautiful) about BeautifulSoup (BS) is the findAll method! This creates a BS object that you can iterate over!

So, say you have 5 tables in your source. You could conceivably run tables = soup.findAll("table"), which would return a list of every table object in the source's code! You could then iterate over that BS object and pull information out of each respective table.

Your code could look something like this:

import csv
import requests
import bs4

url = 'https://www.techpowerup.com/gpudb/2990/radeon-rx-560d'
response = requests.get(url)
html = response.content

soup = bs4.BeautifulSoup(html, "lxml")

tables = soup.findAll("table")

tableMatrix = []
for table in tables:
    #Here you can do whatever you want with the data! You can findAll table row headers, etc...
    list_of_rows = []
    for row in table.findAll('tr')[1:]:
        list_of_cells = []
        for cell in row.findAll('td'):
            text = cell.text.replace('&nbsp;', '')
            list_of_cells.append(text)
        list_of_rows.append(list_of_cells)
    tableMatrix.append((list_of_rows, list_of_cells))
print(tableMatrix)

This code works, though I will note that I did not add any of the CSV file formatting that the original code had! You'll have to redesign that however it works for you. But I commented at the location where you have absolute liberty to do whatever you please for each table in the source. You could decide to findAll("th") elements in each table object and populate your CSV file like that, or you could extract the information from the cells themselves. Right now I save the cell data for each table in a tuple, which I append to the list tableMatrix.

I hope this helps you on your Python and BeautifulSoup adventure!

Sources:

BeautifulSoup extracting data from multiple tables
Python Web Scraper/Crawler - HTML Tables to Excel Spreadsheet
BeautifulSoup4 Docs

Python

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐

求助！为什么用InsCode部署会出现无限重定向？

Python

如何重塑熊猫。系列

问题:如何重塑熊猫。系列在我看来,它就像 pandas.Series 中的一个错误。 a = pd.Series([1,2,3,4]) b = a.reshape(2,2) b b 有类型 Series 但无法显示,最后一条语句给出异常,非常冗长,最后一行是“TypeError: %d format: a number is required, not numpy.ndarray”。 b.sha

Python

在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制]

问题:在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制] 我刚刚在这里](https://keras.io/initializers/)中阅读了有关[中的 Keras 权重初始化器的信息。在文档中,只介绍了不同的初始化程序。如: model.add(Dense(64, kernel_initializer='random_normal')) 当我没有指定kernel_initia