Scrolling to the bottom of a div in puppeteer not working
Answer a question
So I'm trying to scrape all the concerts in the boxed off area in the picture below:
https://i.stack.imgur.com/7QIMM.jpg
The problem is the list only presents the first 10 options until you scroll down in that specific div to the bottom, and then it dynamically presents more until there are no more results. I tried following the link below's answer but couldn't get it to scroll down to present all the 'concerts':
How to scroll inside a div with Puppeteer?
Here's my basic code:
const browser = await puppeteerExtra.launch({ args: [
'--no-sandbox'
]});
async function functionName() {
const page = await browser.newPage();
await preparePageForTests(page);
page.once('load', () => console.log('Page loaded!'));
await page.goto(`https://www.google.com/search?q=concerts+near+poughkeepsie&client=safari&rls=en&uact=5&ibp=htl;events&rciv=evn&sa=X&fpstate=tldetail`);
const resultList = await page.waitForSelector(".odIJnf");
const scrollableSection = await page.waitForSelector("#Q5Vznb"); //I think this is the div that contains all the concert items.
const results = await page.$$(".odIJnf"); //this needs to be iterable to be used in the for loop
//this is where I'd like to scroll down the div all the way to the bottom
for (let i = 0; i < results.length; i++) {
const result = await (await results[i].getProperty('innerText')).jsonValue();
console.log(result)
}
}
Answers
As you mention in your question, when you run page.$$
, you get back an array of ElementHandle
. From Puppeteer's documentation:
ElementHandle represents an in-page DOM element. ElementHandles can be created with the
page.$
method.
This means you can iterate over them, but you also have to run evaluate()
or $eval()
over each element to access the DOM element.
I see from your snippet that you are trying to access the parent div
that handles the list scroll
event. The problem is that this page seems to be using auto-generated classes
and ids
. This might make your code brittle or not work properly. It would be best to try and access the ul
, li
, div
's direct.
I've created this snippet that can get ITEMS
amounts of concerts from the site:
const puppeteer = require('puppeteer')
/**
* Constants
*/
const ITEMS = process.env.ITEMS || 50
const URL = process.env.URL || "https://www.google.com/search?q=concerts+near+poughkeepsie&client=safari&rls=en&uact=5&ibp=htl;events&rciv=evn&sa=X&fpstate=tldetail"
/**
* Main
*/
main()
.then( () => console.log("Done"))
.catch((err) => console.error(err))
/**
* Functions
*/
async function main() {
const browser = await puppeteer.launch({ args: ["--no-sandbox"] })
const page = await browser.newPage()
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36')
await page.goto(URL)
const results = await getResults(page)
console.log(results)
await browser.close()
}
async function getResults(page) {
await page.waitForSelector("ul")
const ul = (await page.$$("ul"))[0]
const div = (await ul.$x("../../.."))[0]
const results = []
const recurse = async () => {
// Recurse exit clause
if (ITEMS <= results.length) {
return
}
const $lis = await page.$$("li")
// Slicing this way will avoid duplicating the result. It also has
// the benefit of not having to handle the refresh interval until
// new concerts are available.
const lis = $lis.slice(results.length, Math.Infinity)
for (let li of lis) {
const result = await li.evaluate(node => node.innerText)
results.push(result)
}
// Move the scroll of the parent-parent-parent div to the bottom
await div.evaluate(node => node.scrollTo(0, node.scrollHeight))
await recurse()
}
// Start the recursive function
await recurse()
return results
}
By studying the page structure, we see that the ul
for the list is nested in three div
s deep from the div
that handles the scroll
. We also know that there are only two ul
s on the page, and the first is the one we want. That is what we do on these lines:
const ul = (await page.$$("ul"))[0]
const div = (await ul.$x("../../.."))[0]
The $x
function evaluates the XPath expression relative to the document as its context node*. It allows us to traverse the DOM tree until we find the div
that we need. We then run a recursive function until we get the items that we want.
- Taken from the docs.
更多推荐
所有评论(0)