用 Python 抓取 ResearchGate 个人资料页面
会刮什么

先决条件
使用 CSS 选择器抓取基本知识
CSS 选择器声明样式适用于标记的哪一部分,从而允许从匹配的标签和属性中提取数据。
如果你还没有使用 CSS 选择器,我有一篇专门的博客文章关于如何在 web 抓取时使用 CSS 选择器涵盖了它是什么,优点和缺点,以及为什么它们对网络很重要-抓取透视图并展示在网页抓取时使用 CSS 选择器的最常见方法。
减少被屏蔽的机会
请求可能会被阻止。看看如何减少网页抓取时被阻止的机会,有十一种方法可以绕过大多数网站的阻止。
安装库:
pip install parsel playwright
完整代码
from parsel import Selector
from playwright.sync_api import sync_playwright
import json, re
def scrape_researchgate_profile(profile: str):
with sync_playwright() as p:
profile_data = {
"basic_info": {},
"about": {},
"co_authors": [],
"publications": [],
}
browser = p.chromium.launch(headless=True, slow_mo=50)
page = browser.new_page(user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36")
page.goto(f"https://www.researchgate.net/profile/{profile}")
selector = Selector(text=page.content())
profile_data["basic_info"]["name"] = selector.css(".nova-legacy-e-text.nova-legacy-e-text--size-xxl::text").get()
profile_data["basic_info"]["institution"] = selector.css(".nova-legacy-v-institution-item__stack-item a::text").get()
profile_data["basic_info"]["department"] = selector.css(".nova-legacy-e-list__item.nova-legacy-v-institution-item__meta-data-item:nth-child(1)").xpath("normalize-space()").get()
profile_data["basic_info"]["current_position"] = selector.css(".nova-legacy-e-list__item.nova-legacy-v-institution-item__info-section-list-item").xpath("normalize-space()").get()
profile_data["basic_info"]["lab"] = selector.css(".nova-legacy-o-stack__item .nova-legacy-e-link--theme-bare b::text").get()
profile_data["about"]["number_of_publications"] = re.search(r"\d+", selector.css(".nova-legacy-c-card__body .nova-legacy-o-grid__column:nth-child(1)").xpath("normalize-space()").get()).group()
profile_data["about"]["reads"] = re.search(r"\d+", selector.css(".nova-legacy-c-card__body .nova-legacy-o-grid__column:nth-child(2)").xpath("normalize-space()").get()).group()
profile_data["about"]["citations"] = re.search(r"\d+", selector.css(".nova-legacy-c-card__body .nova-legacy-o-grid__column:nth-child(3)").xpath("normalize-space()").get()).group()
profile_data["about"]["introduction"] = selector.css(".nova-legacy-o-stack__item .Linkify").xpath("normalize-space()").get()
profile_data["about"]["skills"] = selector.css(".nova-legacy-l-flex__item .nova-legacy-e-badge ::text").getall()
for co_author in selector.css(".nova-legacy-c-card--spacing-xl .nova-legacy-c-card__body--spacing-inherit .nova-legacy-v-person-list-item"):
profile_data["co_authors"].append({
"name": co_author.css(".nova-legacy-v-person-list-item__align-content .nova-legacy-e-link::text").get(),
"link": co_author.css(".nova-legacy-l-flex__item a::attr(href)").get(),
"avatar": co_author.css(".nova-legacy-l-flex__item .lite-page-avatar img::attr(data-src)").get(),
"current_institution": co_author.css(".nova-legacy-v-person-list-item__align-content li").xpath("normalize-space()").get()
})
for publication in selector.css("#publications+ .nova-legacy-c-card--elevation-1-above .nova-legacy-o-stack__item"):
profile_data["publications"].append({
"title": publication.css(".nova-legacy-v-publication-item__title .nova-legacy-e-link--theme-bare::text").get(),
"date_published": publication.css(".nova-legacy-v-publication-item__meta-data-item span::text").get(),
"authors": publication.css(".nova-legacy-v-person-inline-item__fullname::text").getall(),
"publication_type": publication.css(".nova-legacy-e-badge--theme-solid::text").get(),
"description": publication.css(".nova-legacy-v-publication-item__description::text").get(),
"publication_link": publication.css(".nova-legacy-c-button-group__item .nova-legacy-c-button::attr(href)").get(),
})
print(json.dumps(profile_data, indent=2, ensure_ascii=False))
browser.close()
scrape_researchgate_profile(profile="Agnis-Stibe")
代码说明
导入库:
from parsel import Selector
from playwright.sync_api import sync_playwright
import re, json, time
代码
解释
parsel
解析 HTML/XML 文档。支持 XPath。
playwright
使用浏览器实例呈现页面。
re
用正则表达式匹配部分数据。
json
将 Python 字典转换为 JSON 字符串。
定义一个函数:
def scrape_researchgate_profile(profile: str):
# ...
代码
解释
profile: str
告诉 Pythonprofile应该是str。
使用上下文管理器打开playwright:
with sync_playwright() as p:
# ...
定义提取数据的结构:
profile_data = {
"basic_info": {},
"about": {},
"co_authors": [],
"publications": [],
}
午餐浏览器实例,打开和goto页面并将响应传递给 HTML/XML 解析器:
browser = p.chromium.launch(headless=True, slow_mo=50)
page = browser.new_page(user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36")
page.goto(f"https://www.researchgate.net/profile/{profile}")
selector = Selector(text=page.content())
代码
解释
p.chromium.launch()
启动 Chromium 浏览器实例。
headless
明确告诉playwright在无头模式下运行,即使它是默认值。
slow_mo
告诉playwright减慢执行速度。
browser.new_page()
打开新页面。user_agent用于充当真实用户从浏览器发出请求。如果不使用,它将默认为playwright值,即None。检查你的用户代理是什么。
更新basic_info字典键,创建新键,并分配提取的数据:
profile_data["basic_info"]["name"] = selector.css(".nova-legacy-e-text.nova-legacy-e-text--size-xxl::text").get()
profile_data["basic_info"]["institution"] = selector.css(".nova-legacy-v-institution-item__stack-item a::text").get()
profile_data["basic_info"]["department"] = selector.css(".nova-legacy-e-list__item.nova-legacy-v-institution-item__meta-data-item:nth-child(1)").xpath("normalize-space()").get()
profile_data["basic_info"]["current_position"] = selector.css(".nova-legacy-e-list__item.nova-legacy-v-institution-item__info-section-list-item").xpath("normalize-space()").get()
profile_data["basic_info"]["lab"] = selector.css(".nova-legacy-o-stack__item .nova-legacy-e-link--theme-bare b::text").get()
代码
解释
profile_data["basic_info"]["name"]
访问之前创建的basic_info密钥,然后创建一个新的["name"]密钥,并分配提取的数据。
css()
从传递的 CSS 选择器中解析数据。每个CSS 查询都使用csselect包转换为 XPath。
::text
从节点中提取文本数据。
get()
从匹配节点获取实际数据
zoz100077
也解析空白文本节点。默认情况下,XPath 会跳过空白文本节点。
更新about字典键,创建新键,并分配提取的数据:
profile_data["about"]["number_of_publications"] = re.search(r"\d+", selector.css(".nova-legacy-c-card__body .nova-legacy-o-grid__column:nth-child(1)").xpath("normalize-space()").get()).group()
profile_data["about"]["reads"] = re.search(r"\d+", selector.css(".nova-legacy-c-card__body .nova-legacy-o-grid__column:nth-child(2)").xpath("normalize-space()").get()).group()
profile_data["about"]["citations"] = re.search(r"\d+", selector.css(".nova-legacy-c-card__body .nova-legacy-o-grid__column:nth-child(3)").xpath("normalize-space()").get()).group()
profile_data["about"]["introduction"] = selector.css(".nova-legacy-o-stack__item .Linkify").xpath("normalize-space()").get()
profile_data["about"]["skills"] = selector.css(".nova-legacy-l-flex__item .nova-legacy-e-badge ::text").getall()
代码
解释
profile_data["basic_info"]["name"]
访问之前创建的basic_info密钥,然后创建一个新的["name"]密钥,并分配提取的数据。
re.search(r"\d+", <returned_data_from_parsel>).group()
通过re.search()正则表达式\d+从返回的字符串中提取数字数据。group()是提取正则表达式匹配的子字符串。
css()
从传递的 CSS 选择器中解析数据。每个CSS 查询都使用csselect包转换为 XPath。
::text
从节点中提取文本数据。
get()/getall()
从匹配的节点获取实际数据,或者从](https://github.com/scrapy/parsel/blob/90397dcd0b2c1cbb91e44f65c50f9e11628ba028/parsel/selector.py#L447-L451)节点获取匹配数据的[的list。
xpath("normalize-space()")
也解析空白文本节点。默认情况下,XPath 会跳过空白文本节点。
遍历共同作者并提取单个共同作者,并附加到临时列表中:
for co_author in selector.css(".nova-legacy-c-card--spacing-xl .nova-legacy-c-card__body--spacing-inherit .nova-legacy-v-person-list-item"):
profile_data["co_authors"].append({
"name": co_author.css(".nova-legacy-v-person-list-item__align-content .nova-legacy-e-link::text").get(),
"link": co_author.css(".nova-legacy-l-flex__item a::attr(href)").get(),
"avatar": co_author.css(".nova-legacy-l-flex__item .lite-page-avatar img::attr(data-src)").get(),
"current_institution": co_author.css(".nova-legacy-v-person-list-item__align-content li").xpath("normalize-space()").get()
})
代码
解释
::attr(attribute)
从节点中提取属性数据。
接下来是遍历所有出版物并提取单个出版物,并附加到一个临时列表:
for publication in selector.css("#publications+ .nova-legacy-c-card--elevation-1-above .nova-legacy-o-stack__item"):
profile_data["publications"].append({
"title": publication.css(".nova-legacy-v-publication-item__title .nova-legacy-e-link--theme-bare::text").get(),
"date_published": publication.css(".nova-legacy-v-publication-item__meta-data-item span::text").get(),
"authors": publication.css(".nova-legacy-v-person-inline-item__fullname::text").getall(),
"publication_type": publication.css(".nova-legacy-e-badge--theme-solid::text").get(),
"description": publication.css(".nova-legacy-v-publication-item__description::text").get(),
"publication_link": publication.css(".nova-legacy-c-button-group__item .nova-legacy-c-button::attr(href)").get(),
})
打印提取的数据,close浏览器实例:
print(json.dumps(profile_data, indent=2, ensure_ascii=False))
browser.close()
# call function. "profiles" could be a list of authors.
# author name should be with a "-", otherwise ResearchGate doesn't recognize it.
scrape_researchgate_profile(profile="Agnis-Stibe")
JSON 输出的一部分:
{
"basic_info": {
"name": "Agnis Stibe",
"institution": "EM Normandie Business School",
"department": "Supply Chain Management & Decision Sciences",
"current_position": "Artificial Inteligence Program Director",
"lab": "Riga Technical University"
},
"about": {
"number_of_publications": "71",
"reads": "40",
"citations": "572",
"introduction": "4x TEDx speaker, MIT alum, YouTube creator. Globally recognized corporate consultant and scientific advisor at AgnisStibe.com. Provides a science-driven STIBE method and practical tools for hyper-performance. Academic Director on Artificial Intelligence and Professor of Transformation at EM Normandie Business School. Paris Lead of Silicon Valley founded Transformative Technology community. At the renowned Massachusetts Institute of Technology, he established research on Persuasive Cities.",
"skills": [
"Social Influence",
"Behavior Change",
"Persuasive Design",
"Motivational Psychology",
"Artificial Intelligence",
"Change Management",
"Business Transformation"
]
},
"co_authors": [
{
"name": "Mina Khan",
"link": "profile/Mina-Khan-2",
"avatar": "https://i1.rgstatic.net/ii/profile.image/387771463159814-1469463329918_Q64/Mina-Khan-2.jpg",
"current_institution": "Massachusetts Institute of Technology"
}, ... other co-authors
],
"publications": [
{
"title": "Change Masters: Using the Transformation Gene to Empower Hyper-Performance at Work",
"date_published": "May 2020",
"authors": [
"Agnis Stibe"
],
"publication_type": "Article",
"description": "Achieving hyper-performance is an essential aim not only for organizations and societies but also for individuals. Digital transformation is reshaping the workplace so fast that people start falling behind, with their poor attitudes remaining the ultimate obstacle. The alignment of human-machine co-evolution is the only sustainable strategy for the...",
"publication_link": "https://www.researchgate.net/publication/342716663_Change_Masters_Using_the_Transformation_Gene_to_Empower_Hyper-Performance_at_Work"
}, ... other publications
]
}
加入我们Twitter|YouTube
添加功能请求💫 或错误🐞
更多推荐

所有评论(0)