Python: Why Selenium is not scraping the last webpage in the loop with Regex?

Mangs

0人浏览 · 2022-08-24 18:58:41

Mangs · 2022-08-24 18:58:41 发布

Answer a question

I am building a simple Selenium scraper. It should check for existence of a "contact" link, and then, if it exists, parse it for emails with Regex. If not, parse the very same page in which Selenium lands. The problem is that although for the first three (randomly chosen) websites, the program gets the emails available, but for the last one, it NOT ONLY not scrape the page for emails, but also does not even close the browser. However, the loop seems to come to an end anyway, as output is "success". What am I doing wrong and why is it not scraping the last page in the dicti_pretty_links list? Code and output below:

import re
from selenium import webdriver
from bs4 import BeautifulSoup
import time, random

global scrapedEmails
scrapedEmails = []

#dicti_pretty_links = ['http://ayuda.ticketea.com/en/contact-us/','https://www.youtube.com/t/contact_us','http://www.haysplc.com/','http://madrid.usembassy.gov']
#http://www.iberia.com, http://madrid.usembassy.gov
dicti_pretty_links = ['http://www.haysplc.com/','https://www.youtube.com/t/contact_us','http://madrid.usembassy.gov','http://ayuda.ticketea.com/en/contact-us/',]

for el in dicti_pretty_links:   #This converts page into Selenium object
                browser = webdriver.Firefox()
                page = browser.get(el)
                time.sleep(random.uniform(0.5,1.5))
                try:                                #Tries to open "contact" link
                    contact_link = browser.find_element_by_partial_link_text('ontact')
                    if contact_link:
                        contact_link.click()
                except:
                    continue
                html = browser.page_source          #Loads up the page for Regex search
                soup = BeautifulSoup(html,'lxml')
                time.sleep(random.uniform(0.5,1.5))
                emailRegex = re.compile(r'([a-zA-Z0-9_.+]+@[a-zA-Z0-9_.+.+]+)', re.VERBOSE)
                mo = emailRegex.findall(html)
                print('THIS BELOW IS SEL_emails_MO for',el)
                print(mo)
                for el in mo:
                    if el not in scrapedEmails:     #Checks if emails is/adds to ddbb
                        scrapedEmails.append(el)
                browser.close()
print(100*'-')
print('This below is scrappedEmails list')
print(scrapedEmails)

And this is the output of running the program above:

C:\Users\SK\AppData\Local\Programs\Python\Python35-32\python.exe C:/Users/SK/PycharmProjects/untitled/temperase
THIS BELOW IS SEL_emails_MO for http://www.haysplc.com/
['customerservice@hays.com', 'customerservice@hays.com', 'ir@hays.com', 'ir@hays.com', 'cosec@hays.com', 'cosec@hays.com', 'hays@team365.co.uk', 'hays@team365.co.uk']
THIS BELOW IS SEL_emails_MO for https://www.youtube.com/t/contact_us
['press@youtube.com.']
THIS BELOW IS SEL_emails_MO for http://madrid.usembassy.gov
['visasmadrid@state.gov', 'visasmadrid@state.gov', 'visasmadrid@state.gov', 'ivmadrid@state.gov', 'ivmadrid@state.gov', 'ivmadrid@state.gov', 'askACS@state.gov', 'askacs@state.gov', 'askACS@state.gov']
----------------------------------------------------------------------------------------------------
This below is scrappedEmails list
['customerservice@hays.com', 'ir@hays.com', 'cosec@hays.com', 'hays@team365.co.uk', 'press@youtube.com.', 'visasmadrid@state.gov', 'ivmadrid@state.gov', 'askACS@state.gov', 'askacs@state.gov']

Process finished with exit code 0

Answers

The problem is that on the http://ayuda.ticketea.com/en/contact-us/ page, there is no link (a element) with "ontact" partial link text. The browser.find_element_by_partial_link_text() call fails with a NoSuchElementException and the loop continues.

If you don't want to continue the loop if no link found, but instead try searching for email addresses on the current page, silently ignore the exception but don't continue the loop:

try:                                
    contact_link = browser.find_element_by_partial_link_text('ontact')
    if contact_link:
        contact_link.click()
except:
    print("No Contact link found")

Python

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐

求助！为什么用InsCode部署会出现无限重定向？

Python

如何重塑熊猫。系列

问题:如何重塑熊猫。系列在我看来,它就像 pandas.Series 中的一个错误。 a = pd.Series([1,2,3,4]) b = a.reshape(2,2) b b 有类型 Series 但无法显示,最后一条语句给出异常,非常冗长,最后一行是“TypeError: %d format: a number is required, not numpy.ndarray”。 b.sha

Python

在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制]

问题:在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制] 我刚刚在这里](https://keras.io/initializers/)中阅读了有关[中的 Keras 权重初始化器的信息。在文档中,只介绍了不同的初始化程序。如: model.add(Dense(64, kernel_initializer='random_normal')) 当我没有指定kernel_initia