Puppeteer.js - scraping list of domains and storing data
·
Answer a question
I have some problems with scraping list of urls and storing data from them at array. I think my main problem is the puppeter and browser context.
I want to check anchor texts, but only at anchors with specific string on href attribute.
Steps:
- Initialize chromium setting
- Use for loop on array with urls
- At for loop, select all occurrences of such urls and scrape anchor texts
- And now I have problem with storing it at some variable with scope beyond for loop and in puppeteer context
I used a few methods page.evaluate(), page.evaluateHandle(), page.$$(), page.$$eval(), but my issues are:
- Going to new page seems to overwritten previous value, because of Chromium reloading
- I don't have a proper idea how to use spread operator / push to cloned array at puppetter.js context.
The ideal way I think is passing data every time from browser context variables to puppeter.js context variable.
Would be glad for any tips / solutions :) Code below
Index.js file:
const puppeteer = require("puppeteer");
const jsonFile = require("./example.json");
const numberOfUrls = jsonFile.urls.length;
const urlsArray = jsonFile.urls;
(async () => {
try {
// initial settings for Chromium
const browser = await puppeteer.launch({
defaultViewport: null,
headless: false,
devtools: true,
});
const page = await browser.newPage();
await page.setUserAgent(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
);
await page.setViewport({ width: 0, height: 0, deviceScaleFactor: 1 });
// for loop on urls list
for (let i = 0; i < numberOfUrls; i++) {
await page.goto(urlsArray[i]);
await page.waitFor(1000);
const elements = await page.$$eval( `a[href*="https://mysuperdomain.com/"]`, elements => elements.map(el => el.innerText));
console.log(await {url: urlsArray[i],
urlsTexts: elements});
}
//end for loop
} catch (error) {
console.log(`Catched error: ${error}`);
}
})();
example.json file:
{
"urls": [
"https://exampledomain1.com/something/",
"https://exampledomain2.com/something/",
"https://exampledomain3.com/something/"
]
}
Preferred output:
[{
url: 'https://exampledomain1.com/something/',
urlsTexts: [ 'learn more', 'go to our partner' ]
},
{
url: 'https://exampledomain2.com/something/',
urlsTexts: [ 'go to mysuperdomain', 'check on mysuperdomain.com' ]
}]
Answers
You just need a few corrections:
// for loop on urls list
const result = [];
for (let i = 0; i < numberOfUrls; i++) {
await page.goto(urlsArray[i]);
await page.waitFor(1000);
const elements = await page.$$eval( `a[href*="https://mysuperdomain.com/"]`, elements => elements.map(el => el.innerText));
result.push({ url: urlsArray[i], urlsTexts: elements });
}
console.log(result);
//end for loop
更多推荐

所有评论(0)