BeautifulSoup: Is there a way to set the starting point of find_all() method?

Mangs

0人浏览 · 2022-08-25 01:18:11

Mangs · 2022-08-25 01:18:11 发布

Answer a question

Given a soup I need to get n elements with class="foo".

This can be done by:

soup.find_all(class_='foo', limit=n)

However, this is a slow process, as the elements I'm trying to find are located at the very bottom of the document.

Here is my code:

    main_num = 1
    main_page = 'https://rawdevart.com/search/?page={p_num}&ctype_inc=0'
    # get_soup returns bs4 soup of a link
    main_soup = get_soup(main_page.format(p_num=main_num))
    
    # get_last_page returns the number of pages which is 64
    last_page_num = get_last_page(main_soup) 
    for sub_num in range(1, last_page_num+1):
        sub_soup = get_soup(main_page.format(p_num=sub_num))
        arr_links = sub_soup.find_all(class_='head')
        # process arr_links

Answers

The class head is an attribute of the a tag on this page, so I assume you want to grab all follow links and keep moving thru all the search pages.

Here's how you might want to get that done:

import requests
from bs4 import BeautifulSoup

base_url = "https://rawdevart.com"

total_pages = BeautifulSoup(
    requests.get(f"{base_url}/search/?page=1&ctype_inc=0").text,
    "html.parser",
).find(
    "small",
    class_="d-block text-muted",
).getText().split()[2]

pages = [
    f"{base_url}/search/?page={n}&ctype_inc=0"
    for n in range(1, int(total_pages) + 1)
]

all_follow_links = []

for page in pages[:2]:
    r = requests.get(page).text
    all_follow_links.extend(
        [
            f'{base_url}{a["href"]}' for a in
            BeautifulSoup(r, "html.parser").find_all("a", class_="head")
        ]
    )

print(all_follow_links)

Output:

https://rawdevart.com/comic/my-death-flags-show-no-sign-ending/
https://rawdevart.com/comic/tsuki-ga-michibiku-isekai-douchuu/
https://rawdevart.com/comic/im-not-a-villainess-just-because-i-can-control-darkness-doesnt-mean-im-a-bad-person/
https://rawdevart.com/comic/tensei-kusushi-wa-isekai-wo-meguru/
https://rawdevart.com/comic/iceblade-magician-rules-over-world/
https://rawdevart.com/comic/isekai-demo-bunan-ni-ikitai-shoukougun/
https://rawdevart.com/comic/every-class-has-been-mass-summoned-i-strongest-under-disguise-weakest-merchant/
https://rawdevart.com/comic/isekai-onsen-ni-tensei-shita-ore-no-kounou-ga-tondemosugiru/
https://rawdevart.com/comic/kubo-san-wa-boku-mobu-wo-yurusanai/
https://rawdevart.com/comic/gabriel-dropout/
and more ...

Note: to get all the pages just remove the slicing from this line:

for page in pages[:2]:
    # the rest of the loop body

So it looks like this:

for page in pages:
    # the rest of the loop body

Python

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐

求助！为什么用InsCode部署会出现无限重定向？

Python

如何重塑熊猫。系列

问题:如何重塑熊猫。系列在我看来,它就像 pandas.Series 中的一个错误。 a = pd.Series([1,2,3,4]) b = a.reshape(2,2) b b 有类型 Series 但无法显示,最后一条语句给出异常,非常冗长,最后一行是“TypeError: %d format: a number is required, not numpy.ndarray”。 b.sha

Python

在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制]

问题:在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制] 我刚刚在这里](https://keras.io/initializers/)中阅读了有关[中的 Keras 权重初始化器的信息。在文档中,只介绍了不同的初始化程序。如: model.add(Dense(64, kernel_initializer='random_normal')) 当我没有指定kernel_initia