🔨需要什么:了解循环、数据结构、异常处理。serpapipandasurllib库。

⏱️需要多长时间:阅读和实施约 10-20 分钟。


  • 简介

  • 会刮什么

  • 先决条件

  • 进程

*刮保存案例法结果

  • 链接

zwz 100011 zwz 100032 其他 zwz 100033 zwz 100031


简介

本教程演示博客文章将展示并指导您完成使用 SerpApigoogle-search-results库的给定搜索查询从所有可用页面中抓取 Google Scholar 判例法结果的过程。

会刮什么

scrape_google_scholar_case_law_what_will_be_scraped_01

先决条件

独立的虚拟环境

如果您之前没有使用过虚拟环境,请查看使用 Virtualenv 和 Poetry 我的](https://serpapi.com/blog/python-virtual-environments-using-virtualenv-and-poetry/)博客文章的专用[Python 虚拟环境教程以熟悉。

简而言之,它创建了一组独立的已安装库,包括可以在同一系统中相互共存的不同 Python 版本,从而防止库或 Python 版本冲突。

安装库:

pip install pandas
pip install google-search-results

进程

如果不需要解释,在在线IDE中试试。

抓取 Google Scholar 判例法结果并将其保存到 CSV

import os
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import pandas as pd

def case_law_results():

    print("Extracting case law results..")

    params = {
        "api_key": os.getenv("API_KEY"),  # SerpApi API key
        "engine": "google_scholar",       # Google Scholar search results
        "q": "minecraft education",       # search query
        "hl": "en",                       # language
        "start": "0",                     # first page
        "as_sdt": "6"                     # case law results. Wierd, huh? Try without it.
    }
    search = GoogleSearch(params)

    case_law_results_data = []

    loop_is_true = True
    while loop_is_true:
      results = search.get_dict()

      print(f"Currently extracting page №{results['serpapi_pagination']['current']}..")

      for result in results["organic_results"]:
        title = result["title"]
        publication_info_summary = result["publication_info"]["summary"]
        result_id = result["result_id"]
        link = result["link"]
        result_type = result.get("type")
        snippet = result["snippet"]

        try:
          file_title = result["resources"][0]["title"]
        except: file_title = None

        try:
          file_link = result["resources"][0]["link"]
        except: file_link = None

        try:
          file_format = result["resources"][0]["file_format"]
        except: file_format = None

        cited_by_count = result.get("inline_links", {}).get("cited_by", {}).get("total", {})
        cited_by_id = result.get("inline_links", {}).get("cited_by", {}).get("cites_id", {})
        cited_by_link = result.get("inline_links", {}).get("cited_by", {}).get("link", {})
        total_versions = result.get("inline_links", {}).get("versions", {}).get("total", {})
        all_versions_link = result.get("inline_links", {}).get("versions", {}).get("link", {})
        all_versions_id = result.get("inline_links", {}).get("versions", {}).get("cluster_id", {})

        case_law_results_data.append({
          "page_number": results['serpapi_pagination']['current'],
          "position": result["position"] + 1,
          "result_type": result_type,
          "title": title,
          "link": link,
          "result_id": result_id,
          "publication_info_summary": publication_info_summary,
          "snippet": snippet,
          "cited_by_count": cited_by_count,
          "cited_by_link": cited_by_link,
          "cited_by_id": cited_by_id,
          "total_versions": total_versions,
          "all_versions_link": all_versions_link,
          "all_versions_id": all_versions_id,
          "file_format": file_format,
          "file_title": file_title,
          "file_link": file_link,
        })

        if "next" in results["serpapi_pagination"]:
          search.params_dict.update(dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query)))
        else:
          loop_is_true = False

    return case_law_results_data


def save_case_law_results_to_csv():
    print("Waiting for case law results to save..")
    pd.DataFrame(data=case_law_results()).to_csv("google_scholar_case_law_results.csv", encoding="utf-8", index=False)

    print("Case Law Results Saved.")

刮存判例法结果解释流程

导入库:

import os
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import pandas as pd
  • pandas将用于轻松地将提取的数据保存到 CSV 文件。

  • urllib将用于分页过程。

  • os用于返回 SerpApi API 密钥环境变量的值。

创建,将搜索参数传递给 SerpApi 并创建一个临时list()来存储提取的数据:

params = {
    "api_key": os.getenv("API_KEY"),  # SerpApi API key
    "engine": "google_scholar",       # Google Scholar search results
    "q": "minecraft education",       # search query
    "hl": "en",                       # language
    "start": "0",                     # first page
    "as_sdt": "6"                     # case law results
}
search = GoogleSearch(params)

case_law_results_data = []

as_sdt用于确定和过滤 API 调用中的目标法院。参考支持的 SerpApi Google Scholar Courts或在 Google Scholar上选择法院并将其传递给as_sdt参数。

注意:如果要搜索密苏里州上诉法院的结果,as_sdt参数将变为as_sdt=4,204。注意4,,没有它会出现文章结果。

设置while循环,添加if语句即可退出循环:

loop_is_true = True
while loop_is_true:
    results = search.get_dict()

    # extraction code here... 

    # if next page is present -> update previous results to new page results.
    # if next page is not present -> exit the while loop.
    if "next" in results["serpapi_pagination"]:
        search.params_dict.update(dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query)))
    else:
        loop_is_true = False

search.params_dict.update()将分割下一页 URL 并将更新的搜索参数值作为字典传递给GoogleSearch(search)

for循环中提取结果并处理异常:

for result in results["organic_results"]:
    title = result["title"]
    publication_info_summary = result["publication_info"]["summary"]
    result_id = result["result_id"]
    link = result["link"]
    result_type = result.get("type")
    snippet = result["snippet"]

    try:
      file_title = result["resources"][0]["title"]
    except: file_title = None

    try:
      file_link = result["resources"][0]["link"]
    except: file_link = None

    try:
      file_format = result["resources"][0]["file_format"]
    except: file_format = None

    # if something is None it will return an empty {} dict()
    cited_by_count = result.get("inline_links", {}).get("cited_by", {}).get("total", {})
    cited_by_id = result.get("inline_links", {}).get("cited_by", {}).get("cites_id", {})
    cited_by_link = result.get("inline_links", {}).get("cited_by", {}).get("link", {})
    total_versions = result.get("inline_links", {}).get("versions", {}).get("total", {})
    all_versions_link = result.get("inline_links", {}).get("versions", {}).get("link", {})
    all_versions_id = result.get("inline_links", {}).get("versions", {}).get("cluster_id", {})

将结果作为字典{}附加到临时list():

case_law_results_data.append({
    "page_number": results['serpapi_pagination']['current'],
    "position": position + 1,
    "result_type": result_type,
    "title": title,
    "link": link,
    "result_id": result_id,
    "publication_info_summary": publication_info_summary,
    "snippet": snippet,
    "cited_by_count": cited_by_count,
    "cited_by_link": cited_by_link,
    "cited_by_id": cited_by_id,
    "total_versions": total_versions,
    "all_versions_link": all_versions_link,
    "all_versions_id": all_versions_id,
    "file_format": file_format,
    "file_title": file_title,
    "file_link": file_link,
})

Return提取数据:

return case_law_results_data

保存返回的case_law_results()数据to_csv():

pd.DataFrame(data=case_law_results()).to_csv("google_scholar_case_law_results.csv", encoding="utf-8", index=False)
  • data中的参数DataFrame是您的数据。

  • encoding='utf-8'参数只是为了确保所有内容都将正确保存。我明确地使用了它,甚至认为它是一个默认值。

  • index=False参数以删除默认的pandas行号。


友情链接

  • 在线IDE中的代码

  • 谷歌学术有机搜索结果 API

  • SerpApi 支持 Google Scholar 法院

  • Google Scholar 法院列表


其他

如果您有任何要分享的内容、任何问题、建议或无法正常工作的内容,请通过 Twitter 与@dimitryzub或@serp_api联系。

您的,Dimitry 和 SerpApi 团队的其他成员。

祖兹 100093 * *

加入我们Reddit|Twitter|YouTube

添加功能请求💫 或错误🐞

Logo

学AI,认准AI Studio!GPU算力,限时免费领,邀请好友解锁更多惊喜福利 >>>

更多推荐