Scrapy集成Selenium ChromeDriver
官网chromedriverchromedriver-downloadsRunning Selenium Headless with Chrome安装chrome浏览器1、windows可通过 帮助->关于Google Chrome查看已安装的Chrome版本2、linuxTODO下载chromdriver下载链接:https://sites.google.com/a/chromium.or
目录
参考:
官网chromedriver
chromedriver-downloads
Running Selenium Headless with Chrome
一、安装chrome浏览器
1、windows
可通过 帮助->关于Google Chrome查看已安装的Chrome版本
2、linux
TODO
二、下载chromdriver
下载链接:
https://sites.google.com/a/chromium.org/chromedriver/downloads
国内下载链接 - http://npm.taobao.org/mirrors/chromedriver/
1、选择对应的版本
2、选择对应的操作系统
如win32版本下载解压后:
如linux64版本下载解压后
三、测试chromdriver
首先需要先安装selenium
pip install selenium
windows环境下测试chromedriver
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
# headless无界面模式
chrome_options.add_argument("--headless")
browser = webdriver.Chrome(
executable_path=r"D:\programs\chromedriver_win32\chromedriver.exe",
chrome_options=chrome_options
)
browser.get("https://www.baidu.com/")
print("Title: %s" % browser.title)
browser.quit()
运行结果
Title: 百度一下,你就知道
注:
注释掉chrome_options.add_argument("–headless")这条语句,就会看见弹出的chrome窗口,在browser.quit()后会自动关闭
四、chromedriver解析Json
https://stackoverflow.com/questions/37121843/how-to-get-a-json-response-from-a-google-chrome-selenium-webdriver-client
即json响应默认会通过body>pre进行包装
<html>
<head>
<style></style>
<script src="chrome-extension://mooikfkahbdckldjjndioackbalphokd/assets/prompt.js"></script>
</head>
<body>
<pre>json content...</pre>
...
</body>
</html>
五、chromdriver无图模式
方式1:https://tarunlalwani.com/post/selenium-disable-image-loading-different-browsers/
from selenium import webdriver
option = webdriver.ChromeOptions()
chrome_prefs = {}
option.experimental_options["prefs"] = chrome_prefs
# 1 - Allow all images
# 2 - Block all images
# 3 - Block 3rd party images
chrome_prefs["profile.default_content_settings"] = {"images": 2}
chrome_prefs["profile.managed_default_content_settings"] = {"images": 2}
driver = webdriver.Chrome(chrome_options=option)
driver.get("http://www.baidu.com")
实际测试发现方式1在headless模式下不生效,而在删除headless选项后(即弹出浏览器窗口)是可以生效的。
方式2【推荐】:https://stackoverflow.com/questions/48773031/how-to-prevent-chrome-headless-from-loading-images
from selenium import webdriver
option = webdriver.ChromeOptions()
# 设置无界面
# option.add_argument('--headless')
# 设置无图模式
option.add_argument('--blink-settings=imagesEnabled=false')
driver = webdriver.Chrome(chrome_options=option)
driver.get("http://www.baidu.com")
实际测试方式2在headless和有界面模式下均生效
六、Scrapy集成Selenium+ChromeDriver
1、修改settings.py:
# 设置ChromeDriver的执行path
CHROME_DRIVER_PATH = 'D:/programs/chromedriver_win32/chromedriver.exe'
# 集成ChromeDriver的downloader middlewares
# 具体代码实现参见下文
DOWNLOADER_MIDDLEWARES = {
'mx_crawl_spider.middlewares.MxCrawlSpiderDownloaderMiddleware': 543,
}
2、集成ChromeDriver的downloader middlewares代码实现:
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
class MxCrawlSpiderDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
try:
# 获取网页链接内容
spider.logger.info(f"Chrome driver get: {request.url}")
self.driver.get(request.url)
# self.driver.execute_script("scroll(0, 1000);")
# time.sleep(1)
# 返回HTML数据
return HtmlResponse(url=request.url,
body=self.convert_resp_body(request, spider),
request=request,
encoding='utf-8',
status=200)
except TimeoutException:
return HtmlResponse(url=request.url, request=request, encoding='utf-8', status=500)
finally:
spider.logger.info('Chrome driver end...')
def convert_resp_body(self, request, spider):
# 提取JSON 或 HTML内容
try:
json = self.driver.find_element_by_css_selector("body > pre").text
spider.logger.info(f"convert {request.url} to json resp")
return json
except Exception as e:
return self.driver.page_source
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info(f'Spider opened: {spider.name}')
options = webdriver.ChromeOptions()
# 设置无界面
options.add_argument('--headless')
# 设置无图模式
options.add_argument('--blink-settings=imagesEnabled=false')
# 初始化Chrome驱动
chrome_driver_path = spider.settings.get("CHROME_DRIVER_PATH")
self.driver = webdriver.Chrome(chrome_options=options, executable_path=chrome_driver_path)
解决CloudFlare防火墙
参考:
https://stackoverflow.com/questions/33247662/how-to-bypass-cloudflare-bot-ddos-protection-in-scrapy
https://stackoverflow.com/questions/55480924/how-to-enable-javascript-in-selenium-webdriver-chrome-using-python
https://stackoverflow.com/questions/64842858/selenium-app-redirect-to-cloudflare-page-when-hosted-on-heroku
在新弹出的chrome窗口中查看是否支持JS:
chrome://settings/content/javascript
<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width,initial-scale=1" />
<title>Just a moment...</title>
<style type="text/css">
html, body {width: 100%; height: 100%; margin: 0; padding: 0;}
body {background-color: #ffffff; color: #000000; font-family:-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Helvetica Neue",Arial, sans-serif; font-size: 16px; line-height: 1.7em;-webkit-font-smoothing: antialiased;}
h1 { text-align: center; font-weight:700; margin: 16px 0; font-size: 32px; color:#000000; line-height: 1.25;}
p {font-size: 20px; font-weight: 400; margin: 8px 0;}
p, .attribution, {text-align: center;}
#spinner {margin: 0 auto 30px auto; display: block;}
.attribution {margin-top: 32px;}
@keyframes fader { 0% {opacity: 0.2;} 50% {opacity: 1.0;} 100% {opacity: 0.2;} }
@-webkit-keyframes fader { 0% {opacity: 0.2;} 50% {opacity: 1.0;} 100% {opacity: 0.2;} }
#cf-bubbles > .bubbles { animation: fader 1.6s infinite;}
#cf-bubbles > .bubbles:nth-child(2) { animation-delay: .2s;}
#cf-bubbles > .bubbles:nth-child(3) { animation-delay: .4s;}
.bubbles { background-color: #f58220; width:20px; height: 20px; margin:2px; border-radius:100%; display:inline-block; }
a { color: #2c7cb0; text-decoration: none; -moz-transition: color 0.15s ease; -o-transition: color 0.15s ease; -webkit-transition: color 0.15s ease; transition: color 0.15s ease; }
a:hover{color: #f4a15d}
.attribution{font-size: 16px; line-height: 1.5;}
.ray_id{display: block; margin-top: 8px;}
#cf-wrapper #challenge-form { padding-top:25px; padding-bottom:25px; }
#cf-hcaptcha-container { text-align:center;}
#cf-hcaptcha-container iframe { display: inline-block;}
</style>
<meta http-equiv="refresh" content="12">
<script type="text/javascript">
//<![CDATA[
(function(){
window._cf_chl_opt={
cvId: "2",
cType: "non-interactive",
cNounce: "17094",
cRay: "6785deb3cd503aec",
cHash: "1ca301a3b67a5a3",
cFPWv: "g",
cTTimeMs: "4000",
cRq: {
ru: "aHR0cHM6Ly9jbi5pbnZlc3RpbmcuY29tL2luc3RydW1lbnRzL0hpc3RvcmljYWxEYXRhQWpheA==",
ra: "Y3VybC83LjU1LjE=",
rm: "UE9TVA==",
d: "joJnQtza+iRnbn5WVV6f6IRmR9EmES9mV6n7g4Yw1ovTCDkJNh3sAWK5tMO4VX1WOCbHGXNhKqplhgQc+BsUDYg8ug29nFRVt+Szx4Ms6zp0KTDtlJnycpjOZXua7InOYjEgD03WfZHGzUiXMRSLKgbzLYXOZNBUj2g428irH0Ldhe6d89LPYRSYDuXWtvI7YO5FL7n550l8HVbC13Wyi9FnMbsnfwn4ZvLn952yZQgwwxcP5vKN+dvhIPJF6nPUGJlcfTJNiPd4dRqE+0cC0YBmlsrKFvh0iuDcJVRGD04gtHhDB6kApcAaeCTwn8V+OXd/xb90k9UsjqbMyVvA4z+Ji5UxdMFYvHh3keWhhpAWCX8METoZ0z5t8f2dYEmXO8GW0CSdc8w0hedlJK1CalDKbwiz8ZdEhfmoop5JAKiQH4vwBDzOOK2Bc9j5p30anU8WdafUHOxnFJJ8SBaOUROJb+XodTY59hv/KCTGjX7aqsd0tMnbwX1NA+BzHCvQtRHxH/SWha9SNIH4M8nb6T7I2qSbo3xBiLKn0UL9hdSQePW9Om++oeRyBPHPJNxokd5tTOTC6yTx/sa0e03hePhc1tXZQhfD1Uk69I4d4SlKVDB04Urs3KFRMIMGeeQC+88SlKbkSlqL1su0WwNeFbCPmKndkyVpNpYEUfMPD6JSfE0lO1E5PUesflYCqiOG", t: "MTYyNzg5MjI0Ny42NjEwMDA=",
m: "K+89ac3LYJNEWJFtk9ohWKrbb2Ovzl8Oo2w81fr8HPA=",
i1: "/3OUSzHGzSBW2Er1g1CHDw==",
i2: "OHyyOn1ApyQjIyRm+kGpsA==",
zh: "JJQg2KI/+bPgJbLHlLjmrs/mnno8aAGH5k3tm8QDk4c=",
uh: "yJo4Yz2g40fRnUbkl+3xumbT2Zvi1Q9/8tEG1FzQ5ro=",
hh: "T2hi97JJ3TXBbbaDfe4fVaGfimFjucUPtz+gmsc9Zq0=",
}
}
window._cf_chl_enter = function(){window._cf_chl_opt.p=1};
})();
//]]>
</script>
</head>
<body>
<table width="100%" height="100%" cellpadding="20">
<tr>
<td align="center" valign="middle">
<div class="cf-browser-verification cf-im-under-attack">
<noscript>
<h1 data-translate="turn_on_js" style="color:#bd2426;">Please turn JavaScript on and reload the page.</h1>
</noscript>
<div id="cf-content" style="display:none">
<div id="cf-bubbles">
<div class="bubbles"></div>
<div class="bubbles"></div>
<div class="bubbles"></div>
</div>
<h1><span data-translate="checking_browser">Checking your browser before accessing</span> investing.com.</h1>
<div id="no-cookie-warning" class="cookie-warning" data-translate="turn_on_cookies" style="display:none">
<p data-translate="turn_on_cookies" style="color:#bd2426;">Please enable Cookies and reload the page.</p>
</div>
<p data-translate="process_is_automatic">This process is automatic. Your browser will redirect to your requested content shortly.</p>
<p data-translate="allow_5_secs" id="cf-spinner-allow-5-secs" >Please allow up to 5 seconds…</p>
<p data-translate="redirecting" id="cf-spinner-redirecting" style="display:none">Redirecting…</p>
</div>
<form class="challenge-form" id="challenge-form" action="/instruments/HistoricalDataAjax?__cf_chl_jschl_tk__=pmd_b44e6894f26381ec65b7ce23b86e8129a364b849-1627892247-0-gqNtZGzNAjijcnBszQbO" method="POST" enctype="application/x-www-form-urlencoded">
<input type="hidden" name="md" value="451c537d80e1bea80c69dc737c5769b09b670734-1627892247-0-AaEUgkAqbIiw24JsE46HbzTyxeOEhQmi8EbAWnTlRKLCuY_9e8BK9vsyTztGDpOgOKGeHJ6c2nGZlaC9GD7yXQgT3yayH4hyysTGh0qAA8Cohn_rXsoVKEy5sQELH-n4w4O7ueEyvl-qpKfI1OcD-NwUfpABwRYqlgZ8IFAtYpWfWSWJZV6-a_jc_KTmZWYEsQgcrL7ymfTZ9GRGWrWe0gXh6Nd_3Lnix8qjaq-D2PfTLLdnMMK6dR2QRfEGsDndMrYjd0wITStlRWRn-vIbWqlDgUe6ZquuYNAinXRsaRS0pFGBkrmgLAgCEWD2Qucurcw5chay51tk_bKrFsSuAKEf2j8EG4x47F_QKQjhrCpKwdAGymcrxlrzuKY4iTL4RVfXOD-1oG6OkC-hLxfUriL41rHF7n069gS_7CPGruyQudaZT-G7JSP6ziEB7ewhlg_0wesnWQvLRS-38NXv4FKxPXFh-y7yVf4M5CW1qsEudiiYr7IllPoURvr-jEmMVg" />
<input type="hidden" name="r" value="79ae4aef921ee4d89184402e18dc0eef7ef2799e-1627892247-0-ATyPW3cpfiZHnKMzYYz6kWN5wLP8bh69u0RAqk0c0NVdhf5rZYvFYEI1XErZIsXclkv+OiQk3wyP4UCWpqGdHkl34vCx28J66C3QHxcXHWmltizpewOrPNzIV39l0t3tos+LRohlQVGEd7CD2DN+2w3eNIzmj8IcxhRDIWa4kXvLdGF10zxehh9dB/zaRLJtUPnPk3fKshXcQbRTT9Uz207nUrk6N3qoCh6baJwAccK6tYPAsuf8jYesH+oKWGT1ZavzujhFvaPAMqEmOELZGRCq/Cq9s3HJdg3njBknHmKIkBoYaecpaewpGqBIZeXPOvgdr11FEPhvamjJATEyhss6r8/P3UooX3OiPgix0ePXcIzhtXaNTY5bftVmVyTiHKcwpLQx2SQH4lKzC273CxsIGQH91SI0/TmFqJx+e1cz8K9SrPcDq4nBJX9NuwP56NM9jkZA2hPjOkf9kNOD/sF9KAEXPIumhP1/k5Gnyp5O+u3gxiAfJHHeAclxUsFLXA/TBGdP4+qXb/2D6/wRPoBQKuXPK3QaiBf6KhzmvaEhktTgXqBX7E8m7tWetIlpXSsNGW7oVEabVd14BJfXNh7wnFhbxadYBrL7jPs8F7fiTlyekw4omUcTq52kpgW5KanEcuTksGn4yldb4O9C086LTasGLPkd0Qz25RIuGvXUmXfwnfoJQ8pg2mNV0GykXcyIVs09r0Wz5+IgclodF4WhZ2GMaL1ZDeGJbRQQHrmsF8a74cEgm4/HaZW0rn0xrAxNovhwPkWTaz7UxMNQaxVa0uoJ7c1g5j83wHkrKwnX12P9TSI8X375B8l7P4PS56i3iDd7edcsyqgtq8F8FKfh3BLl+MU4ZNJu/nKa43GqlD+YEkfK2aK7MXfuhT+vuVF5fSUkOo05TQ+td8VxYDmwTjN8vl5XXTEgLCKBIN1QTaxQN+YSwbGsmHy9brqRKcvCTFzJjneQZ8XDWuLh2FUIeqD6N8viDlFcJB4VMT9p5hNEKQLV7bEg84o3UKtOtBYj3M4k+hfz664fGg/giI7gYhQ7l9W+FbT3zKnni6wlxjgCWwo8h78b/S/4UPXqaT9p8eKIpOZVpIe9AXQRQNtl7uQntf1xiW00XupiAHC0N95rRQy5KrSAtwMiuFbxPP+ttfodwESvakbQ0rzQk5t2huYKljcNF4rzdexJe4c1iPetB97VmOo9vXixvwwQGds5iQIMqSrRxi/PCjowSK8JjReH1qLeGU0I/9RKJZA30Sz8jMQM7S2FC/kqSsr2rUAH7Ku/UIjblXUpaoCxEsC57YnhhMUxC5f8wNWuxMZz2+IfZfHqZXMtv2L6APd2LEnOaSk9DIlUectu4kwhsQ+59x9zWicVOPDoPQJ5DvB0TP5BYj+jvxmXUi3vls3TsMNaDSzhxv6IOQorhkBVuu5Md973zksLLUy7kH9E9ffH1jyEo/4G6/MkDTFpfqRBcES7s6zdHJgyuFO5rVC72SWTmj7bWvKZHLs3UFA7vWyOxdBSbuYidY6fKS/qgR1CkcipEy+5YsPYkSTpqqllGJrKV6fWWBT2hLG2DFkoBfgSKdoPEJ/pvMia1qBOI7t4G5Cfvcb+G4j4g3AdPUF1axJwPPXjQeHK7xpzJ9baNE5gGzxLihp3JUcdoFmdZaZ5Sv1qMq5UdKrKqZBSQ1j52Bo="/>
<input type="hidden" value="62f859a005e28f79ef2069d9de198947" id="jschl-vc" name="jschl_vc"/>
<!-- <input type="hidden" value="" id="jschl-vc" name="jschl_vc"/> -->
<input type="hidden" name="pass" value="1627892251.661-yYtA6C6xg3"/>
<input type="hidden" id="jschl-answer" name="jschl_answer"/>
</form>
<script type="text/javascript">
//<![CDATA[
(function(){
var a = document.getElementById('cf-content');
a.style.display = 'block';
var isIE = /(MSIE|Trident\/|Edge\/)/i.test(window.navigator.userAgent);
var trkjs = isIE ? new Image() : document.createElement('img');
trkjs.setAttribute("src", "/cdn-cgi/images/trace/jschal/js/transparent.gif?ray=6785deb3cd503aec");
trkjs.id = "trk_jschal_js";
trkjs.setAttribute("alt", "");
document.body.appendChild(trkjs);
var cpo=document.createElement('script');
cpo.type='text/javascript';
cpo.src="/cdn-cgi/challenge-platform/h/g/orchestrate/jsch/v1?ray=6785deb3cd503aec";
document.getElementsByTagName('head')[0].appendChild(cpo);
}());
//]]>
</script>
<div id="trk_jschal_nojs" style="background-image:url('/cdn-cgi/images/trace/jschal/nojs/transparent.gif?ray=6785deb3cd503aec')"> </div>
</div>
<div class="attribution">
DDoS protection by <a rel="noopener noreferrer" href="https://www.cloudflare.com/5xx-error-landing/" target="_blank">Cloudflare</a>
<br />
<span class="ray_id">Ray ID: <code>6785deb3cd503aec</code></span>
</div>
</td>
</tr>
</table>
</body>
</html>
七、从入门到放弃💔
selenium+chromedriver的组合可以很好的解决网页渲染(js执行)的问题,
但是在Scrapy中使用selenium+chromdriver存在以下问题:
(1)Python + scrapy + selenium + chromedriver + chrome环境配置繁杂;
(2)Scrapy线程阻塞 - 串行的执行http请求,爬取速度太慢😭,并没有充分发挥Scrapy的性能;
(3)多个spider同时执行时开启多个chrome实例,系统负载过高;
综上,结合当前同时爬取500+网站的需求,最终弃用Selenium+ChromeDriver的组合😓
通过进一步了解,决定使用Scrapy+Splash的架构…
更多推荐
所有评论(0)