Answer a question

I have some problems with scraping list of urls and storing data from them at array. I think my main problem is the puppeter and browser context.

I want to check anchor texts, but only at anchors with specific string on href attribute.

Steps:

  1. Initialize chromium setting
  2. Use for loop on array with urls
  3. At for loop, select all occurrences of such urls and scrape anchor texts
  4. And now I have problem with storing it at some variable with scope beyond for loop and in puppeteer context

I used a few methods page.evaluate(), page.evaluateHandle(), page.$$(), page.$$eval(), but my issues are:

  1. Going to new page seems to overwritten previous value, because of Chromium reloading
  2. I don't have a proper idea how to use spread operator / push to cloned array at puppetter.js context.

The ideal way I think is passing data every time from browser context variables to puppeter.js context variable.

Would be glad for any tips / solutions :) Code below

Index.js file:

const puppeteer = require("puppeteer");
const jsonFile = require("./example.json");
const numberOfUrls = jsonFile.urls.length;
const urlsArray = jsonFile.urls;

(async () => {
  try {
    // initial settings for Chromium
    const browser = await puppeteer.launch({
      defaultViewport: null,
      headless: false,
      devtools: true,
    });
    const page = await browser.newPage();
    await page.setUserAgent(
      "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
    );

    await page.setViewport({ width: 0, height: 0, deviceScaleFactor: 1 });

    // for loop on urls list
    for (let i = 0; i < numberOfUrls; i++) {
      await page.goto(urlsArray[i]);
      await page.waitFor(1000);
      const elements = await page.$$eval( `a[href*="https://mysuperdomain.com/"]`,  elements => elements.map(el => el.innerText));
      console.log(await {url: urlsArray[i],
         urlsTexts: elements});

    }
    //end for loop
  } catch (error) {
    console.log(`Catched error: ${error}`);
  }

})();

example.json file:

{
    "urls": [
        "https://exampledomain1.com/something/",
        "https://exampledomain2.com/something/",
        "https://exampledomain3.com/something/"
    ]
}

Preferred output:

[{
  url: 'https://exampledomain1.com/something/',
  urlsTexts: [ 'learn more', 'go to our partner' ]
},
{
  url: 'https://exampledomain2.com/something/',
  urlsTexts: [ 'go to mysuperdomain', 'check on mysuperdomain.com' ]
}]

Answers

You just need a few corrections:

// for loop on urls list

const result = [];

for (let i = 0; i < numberOfUrls; i++) {
  await page.goto(urlsArray[i]);
  await page.waitFor(1000);
  const elements = await page.$$eval( `a[href*="https://mysuperdomain.com/"]`,  elements => elements.map(el => el.innerText));
  result.push({ url: urlsArray[i], urlsTexts: elements });
}

console.log(result);

//end for loop
Logo

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐