使用beautifulsoup从维基百科表中获取列

Mangs

0人浏览 · 2022-08-28 09:31:08

Mangs · 2022-08-28 09:31:08 发布

问题:使用beautifulsoup从维基百科表中获取列

source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soup = BeautifulSoup(source_code.text)
tables = soup.find_all("table")

我正在尝试从Taylor Swift 的唱片的“单曲列表”表中获取歌曲名称列表

该表没有唯一的类或 ID。我能想到的唯一独特之处是“单曲列表...”周围的标题标签

主要艺人单曲列表,包括选定的排行榜位置、销售数据和证书

我试过了:

table = soup.find_all("caption")

但它什么也没返回,我假设标题不是 bs4 中的可识别标签?

解答

它实际上与findAll()和find_all()无关。findAll()在BeautifulSoup3中使用,并留在BeautifulSoup4中_出于兼容性原因_,引用自bs4的源代码:

def find_all(self, name=None, attrs={}, recursive=True, text=None,
             limit=None, **kwargs):
    generator = self.descendants
    if not recursive:
        generator = self.children
    return self._find_all(name, attrs, text, limit, generator, **kwargs)

findAll = find_all       # BS3

而且,还有一种更好的方法来获取单曲列表,它依赖于带有id="Singles"的span元素,它表示Singles段落的开始。然后,使用find_next_sibling()得到span标签的父标签之后的第一个表。然后,用scope="row"获取所有th元素:

from bs4 import BeautifulSoup
import requests


source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soup = BeautifulSoup(source_code.content)

table = soup.find('span', id='Singles').parent.find_next_sibling('table')
for single in table.find_all('th', scope='row'):
    print(single.text)

印刷:

"Tim McGraw"
"Teardrops on My Guitar"
"Our Song"
"Picture to Burn"
"Should've Said No"
"Change"
"Love Story"
"White Horse"
"You Belong with Me"
"Fifteen"
"Fearless"
"Today Was a Fairytale"
"Mine"
"Back to December"
"Mean"
"The Story of Us"
"Sparks Fly"
"Ours"
"Safe & Sound"
(featuring The Civil Wars)
"Long Live"
(featuring Paula Fernandes)
"Eyes Open"
"We Are Never Ever Getting Back Together"
"Ronan"
"Begin Again"
"I Knew You Were Trouble"
"22"
"Highway Don't Care"
(with Tim McGraw)
"Red"
"Everything Has Changed"
(featuring Ed Sheeran)
"Sweeter Than Fiction"
"The Last Time"
(featuring Gary Lightbody)
"Shake It Off"
"Blank Space"

Python

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐

求助！为什么用InsCode部署会出现无限重定向？

Python

如何重塑熊猫。系列

问题:如何重塑熊猫。系列在我看来,它就像 pandas.Series 中的一个错误。 a = pd.Series([1,2,3,4]) b = a.reshape(2,2) b b 有类型 Series 但无法显示,最后一条语句给出异常,非常冗长,最后一行是“TypeError: %d format: a number is required, not numpy.ndarray”。 b.sha

Python

在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制]

问题:在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制] 我刚刚在这里](https://keras.io/initializers/)中阅读了有关[中的 Keras 权重初始化器的信息。在文档中,只介绍了不同的初始化程序。如: model.add(Dense(64, kernel_initializer='random_normal')) 当我没有指定kernel_initia