Scrolling to the bottom of a div in puppeteer not working

Mangs

536人浏览 · 2022-09-08 02:19:37

Mangs · 2022-09-08 02:19:37 发布

Answer a question

So I'm trying to scrape all the concerts in the boxed off area in the picture below:

https://i.stack.imgur.com/7QIMM.jpg

The problem is the list only presents the first 10 options until you scroll down in that specific div to the bottom, and then it dynamically presents more until there are no more results. I tried following the link below's answer but couldn't get it to scroll down to present all the 'concerts':

How to scroll inside a div with Puppeteer?

Here's my basic code:

const browser = await puppeteerExtra.launch({ args: [                
    '--no-sandbox'                                                  
    ]});

async function functionName() {
    const page = await browser.newPage();
    await preparePageForTests(page);
    page.once('load', () => console.log('Page loaded!'));
    await page.goto(`https://www.google.com/search?q=concerts+near+poughkeepsie&client=safari&rls=en&uact=5&ibp=htl;events&rciv=evn&sa=X&fpstate=tldetail`);   

    const resultList = await page.waitForSelector(".odIJnf"); 
    const scrollableSection = await page.waitForSelector("#Q5Vznb");    //I think this is the div that contains all the concert items.
    const results = await page.$$(".odIJnf");  //this needs to be iterable to be used in the for loop

//this is where I'd like to scroll down the div all the way to the bottom

    for (let i = 0; i < results.length; i++) {
      const result = await (await results[i].getProperty('innerText')).jsonValue();
      console.log(result)
    }
}

Answers

As you mention in your question, when you run page.$$, you get back an array of ElementHandle. From Puppeteer's documentation:

ElementHandle represents an in-page DOM element. ElementHandles can be created with the page.$ method.

This means you can iterate over them, but you also have to run evaluate() or $eval() over each element to access the DOM element.

I see from your snippet that you are trying to access the parent div that handles the list scroll event. The problem is that this page seems to be using auto-generated classes and ids. This might make your code brittle or not work properly. It would be best to try and access the ul, li, div's direct.

I've created this snippet that can get ITEMS amounts of concerts from the site:

const puppeteer = require('puppeteer')

/**
 * Constants
 */
const ITEMS = process.env.ITEMS   || 50
const URL   = process.env.URL     || "https://www.google.com/search?q=concerts+near+poughkeepsie&client=safari&rls=en&uact=5&ibp=htl;events&rciv=evn&sa=X&fpstate=tldetail"

/**
 * Main
 */
main()
  .then( ()    => console.log("Done"))
  .catch((err) => console.error(err))

/**
 * Functions
 */
async function main() {
  const browser = await puppeteer.launch({ args: ["--no-sandbox"] })
  const page = await browser.newPage()
  
  await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36')
  await page.goto(URL)
 
  const results = await getResults(page)
  console.log(results)
  
  await browser.close()
}

async function getResults(page) {
  await page.waitForSelector("ul")
  const ul  = (await page.$$("ul"))[0]
  const div = (await ul.$x("../../.."))[0]
  const results = []
  
  const recurse = async () => {
    // Recurse exit clause
    if (ITEMS <= results.length) {
      return
    }

    const $lis = await page.$$("li")
    // Slicing this way will avoid duplicating the result. It also has
    // the benefit of not having to handle the refresh interval until
    // new concerts are available.
    const lis = $lis.slice(results.length, Math.Infinity)
    for (let li of lis) {
      const result = await li.evaluate(node => node.innerText)
      results.push(result)
    }
    // Move the scroll of the parent-parent-parent div to the bottom
    await div.evaluate(node => node.scrollTo(0, node.scrollHeight))
    await recurse()
  }
  // Start the recursive function
  await recurse()
 
  return results
}

By studying the page structure, we see that the ul for the list is nested in three divs deep from the div that handles the scroll. We also know that there are only two uls on the page, and the first is the one we want. That is what we do on these lines:

  const ul  = (await page.$$("ul"))[0]
  const div = (await ul.$x("../../.."))[0]

The $x function evaluates the XPath expression relative to the document as its context node*. It allows us to traverse the DOM tree until we find the div that we need. We then run a recursive function until we get the items that we want.

Taken from the docs.

Python

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐

求助！为什么用InsCode部署会出现无限重定向？

Python

如何重塑熊猫。系列

问题:如何重塑熊猫。系列在我看来,它就像 pandas.Series 中的一个错误。 a = pd.Series([1,2,3,4]) b = a.reshape(2,2) b b 有类型 Series 但无法显示,最后一条语句给出异常,非常冗长,最后一行是“TypeError: %d format: a number is required, not numpy.ndarray”。 b.sha

Python

在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制]

问题:在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制] 我刚刚在这里](https://keras.io/initializers/)中阅读了有关[中的 Keras 权重初始化器的信息。在文档中,只介绍了不同的初始化程序。如: model.add(Dense(64, kernel_initializer='random_normal')) 当我没有指定kernel_initia