Answer a question

I am building a simple Selenium scraper. It should check for existence of a "contact" link, and then, if it exists, parse it for emails with Regex. If not, parse the very same page in which Selenium lands. The problem is that although for the first three (randomly chosen) websites, the program gets the emails available, but for the last one, it NOT ONLY not scrape the page for emails, but also does not even close the browser. However, the loop seems to come to an end anyway, as output is "success". What am I doing wrong and why is it not scraping the last page in the dicti_pretty_links list? Code and output below:

import re
from selenium import webdriver
from bs4 import BeautifulSoup
import time, random

global scrapedEmails
scrapedEmails = []

#dicti_pretty_links = ['http://ayuda.ticketea.com/en/contact-us/','https://www.youtube.com/t/contact_us','http://www.haysplc.com/','http://madrid.usembassy.gov']
#http://www.iberia.com, http://madrid.usembassy.gov
dicti_pretty_links = ['http://www.haysplc.com/','https://www.youtube.com/t/contact_us','http://madrid.usembassy.gov','http://ayuda.ticketea.com/en/contact-us/',]

for el in dicti_pretty_links:   #This converts page into Selenium object
                browser = webdriver.Firefox()
                page = browser.get(el)
                time.sleep(random.uniform(0.5,1.5))
                try:                                #Tries to open "contact" link
                    contact_link = browser.find_element_by_partial_link_text('ontact')
                    if contact_link:
                        contact_link.click()
                except:
                    continue
                html = browser.page_source          #Loads up the page for Regex search
                soup = BeautifulSoup(html,'lxml')
                time.sleep(random.uniform(0.5,1.5))
                emailRegex = re.compile(r'([a-zA-Z0-9_.+]+@[a-zA-Z0-9_.+.+]+)', re.VERBOSE)
                mo = emailRegex.findall(html)
                print('THIS BELOW IS SEL_emails_MO for',el)
                print(mo)
                for el in mo:
                    if el not in scrapedEmails:     #Checks if emails is/adds to ddbb
                        scrapedEmails.append(el)
                browser.close()
print(100*'-')
print('This below is scrappedEmails list')
print(scrapedEmails)

And this is the output of running the program above:

C:\Users\SK\AppData\Local\Programs\Python\Python35-32\python.exe C:/Users/SK/PycharmProjects/untitled/temperase
THIS BELOW IS SEL_emails_MO for http://www.haysplc.com/
['customerservice@hays.com', 'customerservice@hays.com', 'ir@hays.com', 'ir@hays.com', 'cosec@hays.com', 'cosec@hays.com', 'hays@team365.co.uk', 'hays@team365.co.uk']
THIS BELOW IS SEL_emails_MO for https://www.youtube.com/t/contact_us
['press@youtube.com.']
THIS BELOW IS SEL_emails_MO for http://madrid.usembassy.gov
['visasmadrid@state.gov', 'visasmadrid@state.gov', 'visasmadrid@state.gov', 'ivmadrid@state.gov', 'ivmadrid@state.gov', 'ivmadrid@state.gov', 'askACS@state.gov', 'askacs@state.gov', 'askACS@state.gov']
----------------------------------------------------------------------------------------------------
This below is scrappedEmails list
['customerservice@hays.com', 'ir@hays.com', 'cosec@hays.com', 'hays@team365.co.uk', 'press@youtube.com.', 'visasmadrid@state.gov', 'ivmadrid@state.gov', 'askACS@state.gov', 'askacs@state.gov']

Process finished with exit code 0

Answers

The problem is that on the http://ayuda.ticketea.com/en/contact-us/ page, there is no link (a element) with "ontact" partial link text. The browser.find_element_by_partial_link_text() call fails with a NoSuchElementException and the loop continues.

If you don't want to continue the loop if no link found, but instead try searching for email addresses on the current page, silently ignore the exception but don't continue the loop:

try:                                
    contact_link = browser.find_element_by_partial_link_text('ontact')
    if contact_link:
        contact_link.click()
except:
    print("No Contact link found")
Logo

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐