Python: Why Selenium is not scraping the last webpage in the loop with Regex?
Answer a question
I am building a simple Selenium scraper. It should check for existence of a "contact" link, and then, if it exists, parse it for emails with Regex. If not, parse the very same page in which Selenium lands. The problem is that although for the first three (randomly chosen) websites, the program gets the emails available, but for the last one, it NOT ONLY not scrape the page for emails, but also does not even close the browser. However, the loop seems to come to an end anyway, as output is "success". What am I doing wrong and why is it not scraping the last page in the dicti_pretty_links list? Code and output below:
import re
from selenium import webdriver
from bs4 import BeautifulSoup
import time, random
global scrapedEmails
scrapedEmails = []
#dicti_pretty_links = ['http://ayuda.ticketea.com/en/contact-us/','https://www.youtube.com/t/contact_us','http://www.haysplc.com/','http://madrid.usembassy.gov']
#http://www.iberia.com, http://madrid.usembassy.gov
dicti_pretty_links = ['http://www.haysplc.com/','https://www.youtube.com/t/contact_us','http://madrid.usembassy.gov','http://ayuda.ticketea.com/en/contact-us/',]
for el in dicti_pretty_links: #This converts page into Selenium object
browser = webdriver.Firefox()
page = browser.get(el)
time.sleep(random.uniform(0.5,1.5))
try: #Tries to open "contact" link
contact_link = browser.find_element_by_partial_link_text('ontact')
if contact_link:
contact_link.click()
except:
continue
html = browser.page_source #Loads up the page for Regex search
soup = BeautifulSoup(html,'lxml')
time.sleep(random.uniform(0.5,1.5))
emailRegex = re.compile(r'([a-zA-Z0-9_.+]+@[a-zA-Z0-9_.+.+]+)', re.VERBOSE)
mo = emailRegex.findall(html)
print('THIS BELOW IS SEL_emails_MO for',el)
print(mo)
for el in mo:
if el not in scrapedEmails: #Checks if emails is/adds to ddbb
scrapedEmails.append(el)
browser.close()
print(100*'-')
print('This below is scrappedEmails list')
print(scrapedEmails)
And this is the output of running the program above:
C:\Users\SK\AppData\Local\Programs\Python\Python35-32\python.exe C:/Users/SK/PycharmProjects/untitled/temperase
THIS BELOW IS SEL_emails_MO for http://www.haysplc.com/
['customerservice@hays.com', 'customerservice@hays.com', 'ir@hays.com', 'ir@hays.com', 'cosec@hays.com', 'cosec@hays.com', 'hays@team365.co.uk', 'hays@team365.co.uk']
THIS BELOW IS SEL_emails_MO for https://www.youtube.com/t/contact_us
['press@youtube.com.']
THIS BELOW IS SEL_emails_MO for http://madrid.usembassy.gov
['visasmadrid@state.gov', 'visasmadrid@state.gov', 'visasmadrid@state.gov', 'ivmadrid@state.gov', 'ivmadrid@state.gov', 'ivmadrid@state.gov', 'askACS@state.gov', 'askacs@state.gov', 'askACS@state.gov']
----------------------------------------------------------------------------------------------------
This below is scrappedEmails list
['customerservice@hays.com', 'ir@hays.com', 'cosec@hays.com', 'hays@team365.co.uk', 'press@youtube.com.', 'visasmadrid@state.gov', 'ivmadrid@state.gov', 'askACS@state.gov', 'askacs@state.gov']
Process finished with exit code 0
Answers
The problem is that on the http://ayuda.ticketea.com/en/contact-us/ page, there is no link (a element) with "ontact" partial link text. The browser.find_element_by_partial_link_text() call fails with a NoSuchElementException and the loop continues.
If you don't want to continue the loop if no link found, but instead try searching for email addresses on the current page, silently ignore the exception but don't continue the loop:
try:
contact_link = browser.find_element_by_partial_link_text('ontact')
if contact_link:
contact_link.click()
except:
print("No Contact link found")
更多推荐

所有评论(0)