• 会刮什么

  • 先决条件

  • 完整代码

*代码说明

  • 链接

zwz 100009 zwz 100027 其他 zwz 100028 zwz 100026


简介

您可以使用官方的 Google Play Developer API,它的默认限制是每天 200,000 个请求,每小时 60 个请求来检索评论列表和单个评论,大约是每 2 分钟[1 个请求 zwz1000(https://stackoverflow.com/a/55241419/15164646).

您可以使用完整的第三方 Google Play 商店应用抓取解决方案,用于Pythongoogle-play-scraper没有任何外部依赖,以及JavaScriptgoogle-play-scraper。第三方解决方案通常用于打破配额限制。

你真的不需要阅读这篇文章,除非你需要一步一步的解释而不使用浏览器自动化,比如playwrightselenium,因为你可以看到 Pythongoogle-play-scraper``regex解决方案是什么,它如何抓取应用程序结果和如何抓取审查结果。

这篇博文旨在提供一个想法和实际的分步示例,说明如何使用beautifulsoup和正则表达式抓取 Google Play 商店应用程序。

会刮什么

图片

图片

先决条件

独立的虚拟环境

简而言之,它创建了一组独立的已安装库,包括可以在同一系统中相互共存的不同 Python 版本,从而防止库或 Python 版本冲突。

如果您之前没有使用过虚拟环境,请查看使用 Virtualenv 和 Poetry 我的](https://serpapi.com/blog/python-virtual-environments-using-virtualenv-and-poetry/)博客文章的专用[Python 虚拟环境教程以熟悉。

📌注意:这不是这篇博文的严格要求。

安装库:

pip install requests lxml beautifulsoup4

减少被屏蔽的机会

请求可能会被阻止。看看如何减少网页抓取时被阻止的机会,有十一种方法可以绕过大多数网站的阻止。这篇博文只介绍了最简单的方法user-agent


完整代码

from bs4 import BeautifulSoup
import requests, lxml, re, json
from datetime import datetime

# user-agent headers to act as a "real" user visit
headers = {
    "user-agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"
}

# search query params
params = {
    "id": "com.nintendo.zara",  # app name
    "gl": "RU"  # country
}


def scrape_google_store_app():
    html = requests.get("https://play.google.com/store/apps/details", params=params, headers=headers, timeout=10).text
    soup = BeautifulSoup(html, "lxml")

    # where all app data will be stored
    app_data = []

    # <script> position is not changing that's why [12] index being selected. Other <script> tags position are changing.
    # [12] index is a basic app information
    # https://regex101.com/r/DrK0ih/1
    basic_app_info = json.loads(re.findall(r"<script nonce=\".*\" type=\"application/ld\+json\">(.*?)</script>",
                                           str(soup.select("script")[12]), re.DOTALL)[0])

    app_name = basic_app_info["name"]
    app_type = basic_app_info["@type"]
    app_url = basic_app_info["url"]
    app_description = basic_app_info["description"].replace("\n", "")  # replace new line character to nothing
    app_category = basic_app_info["applicationCategory"]
    app_operating_system = basic_app_info["operatingSystem"]
    app_main_thumbnail = basic_app_info["image"]

    app_content_rating = basic_app_info["contentRating"]
    app_rating = round(float(basic_app_info["aggregateRating"]["ratingValue"]), 1)  # 4.287856 -> 4.3
    app_reviews = basic_app_info["aggregateRating"]["ratingCount"]

    app_author = basic_app_info["author"]["name"]
    app_author_url = basic_app_info["author"]["url"]

    # https://regex101.com/r/VX8E7U/1
    app_images_data = re.findall(r",\[\d{3,4},\d{3,4}\],.*?(https.*?)\"", str(soup.select("script")))
    # delete duplicates from app_images_data
    app_images = [item for item in app_images_data if app_images_data.count(item) == 1]

    # User comments
    app_user_comments = []

    # https://regex101.com/r/SrP5DS/1
    app_user_reviews_data = re.findall(r"(\[\"gp.*?);</script>",
                                       str(soup.select("script")), re.DOTALL)

    for review in app_user_reviews_data:
        # https://regex101.com/r/M24tiM/1
        user_name = re.findall(r"\"gp:.*?\",\s?\[\"(.*?)\",", str(review))
        # https://regex101.com/r/TGgR45/1
        user_avatar = [avatar.replace('"', "") for avatar in re.findall(r"\"gp:.*?\"(https.*?\")", str(review))]

        # replace single/double quotes at the start/end of a string
        # https://regex101.com/r/iHPOrI/1
        user_comments = [comment.replace('"', "").replace("'", "") for comment in
                        re.findall(r"gp:.*?https:.*?]]],\s?\d+?,.*?,\s?(.*?),\s?\[\d+,", str(review))]

        # comment utc timestamp
        # use datetime.utcfromtimestamp(int(date)).date() to have only a date
        user_comment_date = [str(datetime.utcfromtimestamp(int(date))) for date in re.findall(r"\[(\d+),", str(review))]

        # https://regex101.com/r/GrbH9A/1
        user_comment_id = [ids.replace('"', "") for ids in re.findall(r"\[\"(gp.*?),", str(review))]
        # https://regex101.com/r/jRaaQg/1
        user_comment_likes = re.findall(r",?\d+\],?(\d+),?", str(review))
        # https://regex101.com/r/Z7vFqa/1
        user_comment_app_rating = re.findall(r"\"gp.*?https.*?\],(.*?)?,", str(review))

        for name, avatar, comment, date, comment_id, likes, user_app_rating in zip(user_name,
                                                                                   user_avatar,
                                                                                   user_comments,
                                                                                   user_comment_date,
                                                                                   user_comment_id,
                                                                                   user_comment_likes,
                                                                                   user_comment_app_rating):
            app_user_comments.append({
                "user_name": name,
                "user_avatar": avatar,
                "comment": comment,
                "user_app_rating": user_app_rating,
                "user__comment_likes": likes,
                "user_comment_published_at": date,
                "user_comment_id": comment_id
            })

        app_data.append({
            "app_name": app_name,
            "app_type": app_type,
            "app_url": app_url,
            "app_main_thumbnail": app_main_thumbnail,
            "app_description": app_description,
            "app_content_rating": app_content_rating,
            "app_category": app_category,
            "app_operating_system": app_operating_system,
            "app_rating": app_rating,
            "app_reviews": app_reviews,
            "app_author": app_author,
            "app_author_url": app_author_url,
            "app_screenshots": app_images
        })

        return {"app_data": app_data, "app_user_comments": app_user_comments}


print(json.dumps(scrape_google_store_app(), indent=2))

# output:

{
  "app_data": [
    {
      "app_name": "Super Mario Run",
      "app_type": "SoftwareApplication",
      "app_url": "https://play.google.com/store/apps/details/Super_Mario_Run?id=com.nintendo.zara&hl=en_US&gl=US",
      "app_main_thumbnail": "https://play-lh.googleusercontent.com/5LIMaa7WTNy34bzdFhBETa2MRj7mFJZWb8gCn_uyxQkUvFx_uOFCeQjcK16c6WpBA3E",
      "app_description": "A new kind of Mario game that you can play with one hand.You control Mario by tapping as he constantly runs forward. You time your taps to pull off stylish jumps, midair spins, and wall jumps to gather coins and reach the goal!Super Mario Run can be downloaded for free and after you purchase the game, you will be able to play all the modes with no additional payment required. You can try out all four modes before purchase: World Tour, Toad Rally, Remix 10, and Kingdom Builder.\u25a0World TourRun and jump with style to rescue Princess Peach from Bowser\u2019s clutches! Travel through plains, caverns, ghost houses, airships, castles, and more.Clear the 24 exciting courses to rescue Princess Peach from Bowser, waiting in his castle at the end. There are many ways to enjoy the courses, such as collecting the 3 different types of colored coins or by competing for the highest score against your friends. You can try courses 1-1 to 1-4 for free.After rescuing Princess Peach, a nine-course special world, World Star, will appear.\u25a0Remix 10Some of the shortest Super Mario Run courses you'll ever play!This mode is Super Mario Run in bite-sized bursts! You'll play through 10 short courses one after the other, with the courses changing each time you play. Daisy is lost somewhere in Remix 10, so try to clear as many courses as you can to find her!\u25a0Toad RallyShow off Mario\u2019s stylish moves, compete against your friends, and challenge people from all over the world.In this challenge mode, the competition differs each time you play.Compete against the stylish moves of other players for the highest score as you gather coins and get cheered on by a crowd of Toads. Fill the gauge with stylish moves to enter Coin Rush Mode to get more coins. If you win the rally, the cheering Toads will come live in your kingdom, and your kingdom will grow. \u25a0Kingdom BuilderGather coins and Toads to build your very own kingdom.Combine different buildings and decorations to create your own unique kingdom. There are over 100 kinds of items in Kingdom Builder mode. If you get more Toads in Toad Rally, the number of buildings and decorations available will increase. With the help of the friendly Toads you can gradually build up your kingdom.\u25a0What You Can Do After Purchasing All Worlds\u30fb All courses in World Tour are playableWhy not try out the bigger challenges and thrills available in all courses?\u30fb Easier to get Rally TicketsIt's easier to get Rally Tickets that are needed to play Remix 10 and Toad Rally. You can collect them in Kingdom Builder through Bonus Game Houses and ? Blocks, by collecting colored coins in World Tour, and more.\u30fb More playable charactersIf you rescue Princess Peach by completing course 6-4 and build homes for Luigi, Yoshi, and Toadette in Kingdom Builder mode, you can get them to join your adventures as playable characters. They play differently than Mario, so why not put their special characteristics to good use in World Tour and Toad Rally?\u30fb More courses in Toad RallyThe types of courses available in Toad Rally will increase to seven different types of courses, expanding the fun! Along with the new additions, Purple and Yellow Toads may also come to cheer for you.\u30fb More buildings and decorations in Kingdom BuilderThe types of buildings available will increase, so you'll be able to make your kingdom even more lively. You can also place Rainbow Bridges to expand your kingdom.\u30fb Play Remix 10 without having to waitYou can play Remix 10 continuously, without having to wait between each game.*Internet connectivity required to play. Data charges may apply. May contain advertisements.",
      "app_content_rating": "Everyone",
      "app_category": "GAME_ACTION",
      "app_operating_system": "ANDROID",
      "app_rating": 3.9,
      "app_reviews": "1615781",
      "app_author": "Nintendo Co., Ltd.",
      "app_author_url": "https://supermariorun.com/",
      "app_screenshots": [
        "https://play-lh.googleusercontent.com/dcv6Z-pr3MsSvxYh_UiwvJem8fktDUsvvkPREnPaHYienbhT31bZ2nUqHqGpM1jdal8",
        "https://play-lh.googleusercontent.com/SVYZCU-xg-nvaBeJ-rz6rHSSDp20AK-5AQPfYwI38nV8hPzFHEqIgFpc3LET-Dmu-Q",
        "https://play-lh.googleusercontent.com/Nne-dalTl8DJ9iius5oOLmFe-4DnvZocgf92l8LTV0ldr9JVQ2BgeW_Bbjb5nkVngrQ",
        "https://play-lh.googleusercontent.com/yIqljB_Jph_T_ITmVFTpmDV0LKXVHWmsyLOVyEuSjL2794nAhTBaoeZDpTZZLahyRsE",
        "https://play-lh.googleusercontent.com/5HdGRlNsBvHTNLo-vIsmRLR8Tr9degRfFtungX59APFaz8OwxTnR_gnHOkHfAjhLse7e",
        "https://play-lh.googleusercontent.com/bPhRpYiSMGKwO9jkjJk1raR7cJjMgPcUFeHyTg_I8rM7_6GYIO9bQm6xRcS4Q2qr6mRx",
        "https://play-lh.googleusercontent.com/7DOCBRsIE5KncQ0AzSA9nSnnBh0u0u804NAgux992BhJllLKGNXkMbVFWH5pwRwHUg",
        "https://play-lh.googleusercontent.com/PCaFxQba_CvC2pi2N9Wuu814srQOUmrW42mh-ZPCbk_xSDw3ubBX7vOQeY6qh3Id3YE",
        "https://play-lh.googleusercontent.com/fQne-6_Le-sWScYDSRL9QdG-I2hWxMbe2QbDOzEsyu3xbEsAb_f5raRrc6GUNAHBoQ",
        "https://play-lh.googleusercontent.com/ql7LENlEZaTq2NaPuB-esEPDXM2hs1knlLa2rWOI3uNuQ77hnC1lLKNJrZi9XKZFb4I",
        "https://play-lh.googleusercontent.com/UIHgekhfttfNCkd5qCJNaz2_hPn67fOkv40_5rDjf5xot-QhsDCo2AInl9036huUtCwf",
        "https://play-lh.googleusercontent.com/7iH7-GjfS_8JOoO7Q33JhOMnFMK-O8k7jP0MUI75mYALK0kQsMsHpHtIJidBZR46sfU",
        "https://play-lh.googleusercontent.com/czt-uL-Xx4fUgzj_JbNA--RJ3xsXtjAxMK7Q_wFZdoMM6nL_g-4S5bxxX3Di3QTCwgw",
        "https://play-lh.googleusercontent.com/e5HMIP0FW9MCoAEGYzji9JsrvyovpZ3StHiIANughp3dovUxdv_eHiYT5bMz38bowOI",
        "https://play-lh.googleusercontent.com/nv2BP1glvMWX11mHC8GWlh_UPa096_DFOKwLZW4DlQQsrek55pY2lHr29tGwf2FEXHM",
        "https://play-lh.googleusercontent.com/xwWDr_Ib6dcOr0H0OTZkHupwSrpBoNFM6AXNzNO27_RpX_BRoZtKIULKEkigX8ETOKI",
        "https://play-lh.googleusercontent.com/AxHkW996UZvDE21HTkGtQPU8JiQLzNxp7yLoQiSCN29Y54kZYvf9aWoR6EzAlnoACQ",
        "https://play-lh.googleusercontent.com/xFouF73v1_c5kS-mnvQdhKwl_6v3oEaLebsZ2inlJqIeF2eenXjUrUPJsjSdeAd41w",
        "https://play-lh.googleusercontent.com/a1pta2nnq6f_b9uV0adiD9Z1VVQrxSfX315fIQqgKDcy8Ji0BRC1H7z8iGnvZZaeg80",
        "https://play-lh.googleusercontent.com/SDAFLzC8i4skDJ2EcsEkXidcAJCql5YCZI76eQB15fVaD0j-ojxyxea00klquLVtNAw",
        "https://play-lh.googleusercontent.com/H7BcVUoygPu8f7oIs2dm7g5_vVt9N9878f-rGd0ACd-muaDEOK2774okryFfsXv9FaI",
        "https://play-lh.googleusercontent.com/5LIMaa7WTNy34bzdFhBETa2MRj7mFJZWb8gCn_uyxQkUvFx_uOFCeQjcK16c6WpBA3E",
        "https://play-lh.googleusercontent.com/DGQjTn_Hp32i88g2YrbjrCwl0mqCPCzDjTwMkECh3wXyTv4y6zECR5VNbAH_At89jGgSJDQuSKsPSB-wVQ",
        "https://play-lh.googleusercontent.com/pzvdI66OFjncahvxJN714Tu5pHUJ_nJK--vg0tv5cpgaGNvjfwsxC-SKxoQh9_n_wEcCdSQF9FeuZeI"
      ]
    }
  ],
  "app_user_comments": [
    {
      "user_name": "Misha t",
      "user_avatar": "https://play-lh.googleusercontent.com/a/AATXAJxvYKOfPVaqDZg0FOUjJOV-W3qR6r_cMAz0XMgU\\u003dmo",
      "comment": "Fun game, but it does not warns you that only World 1 (out of 6) is free, others are behind paywall. Dont spend your time if you want full game for free",
      "user_app_rating": "1",
      "user__comment_likes": "9",
      "user_comment_published_at": "2021-09-07 19:01:46",
      "user_comment_id": "gp:AOqpTOFb_Fc_r33sWOwoSN8Zq4DDV7C9xuPTLUatfAplVowAb0NJbym2jv3j2DHBjT1o89y4z4vNZPblN60png"
    },
    {
      "user_name": "Dangan Carlo",
      "user_avatar": "https://play-lh.googleusercontent.com/a-/AOh14GiC291ZXTKihUQukmruZDx2MJjb5-tMmeCC7Ag3KQ",
      "comment": "The game is fun but you have to pay for the World 1 Boss Fight or do some challenges, and after that the rest of the game needs to be purchased. 10 dollars is too much for a game that can be beat under an hour and especially if its a mobile game. I doubt youll take down the 10 dollar price tag but itd be nice if you could.",
      "user_app_rating": "3",
      "user__comment_likes": "316",
      "user_comment_published_at": "2022-01-04 22:31:34",
      "user_comment_id": "gp:AOqpTOHLIo3dK33ItsiFuHzWadDIbu25RfHoTp8SGcZJTlY3BWMJk4X7FtJwABPSaq2tOd-LEo_3h7qnXOroJg"
    }
    {
      "user_name": "Marcus Hughes",
      "user_avatar": "https://play-lh.googleusercontent.com/a-/AOh14Gisjmr1druQ7QamKHqg0N9qq5ahGmaMQNhS2a35jw",
      "comment": "Amazing! Only one thing would make this game better.... ENDLESS MODE! It would be VERY awesome if you created an endless mode where you can endlessly repeat the same stage, only it gets more difficult as you progress, such as new enemies/obstacles, but its one stage.... Preferably itd be cool if every stage had this.",
      "user_app_rating": "5",
      "user__comment_likes": "193",
      "user_comment_published_at": "2022-01-09 19:36:23",
      "user_comment_id": "gp:AOqpTOEgyCt_NTr2otrPyb2w-l3j8xmROvlkx-xeadRCV2aAt1X9HOwXG7v2f1S0K5vvBM2d3JcY2qAvhSJ80Q"
    }, # OTHER USER COMMENTS
  ]
}

代码解释

导入库:

from bs4 import BeautifulSoup
import requests, lxml, re, json
from datetime import datetime
  • BeautifulSoup,lxml解析 HTML。

  • requests向网站发出请求。

  • re通过正则表达式匹配所需数据所在的 HTML 部分。

  • json将解析的数据从 JSON 转换为 Python 字典,并进行漂亮的打印。

  • datetime将 UTC 时间戳转换为人类可读的日期格式。

创建global请求头,搜索查询params:

# user-agent headers to act as a "real" user visit
headers = {
    "user-agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"
}

# search query params
params = {
    "id": "com.nintendo.zara",  # app name
    "gl": "RU"  # country
}
  • user-agent用于假装这是来自实际浏览器的真实用户访问,因此网站会假定它不是发送请求的机器人。

paramsheaders传递给请求:

html = requests.get("https://play.google.com/store/apps/details", params=params, headers=headers, timeout=10).text
  • timeout参数将告诉request在 10 秒后停止等待响应。

从返回的 HTML 创建一个BeautifulSoup对象,传递 HTML 解析器,在本例中为lxml:

soup = BeautifulSoup(html, "lxml")

创建一个临时的list()来存储提取的应用数据:

app_data = []

通过正则表达式匹配基本应用信息:

# https://regex101.com/r/DrK0ih/1
basic_app_info = json.loads(re.findall(r"<script nonce=\".*\" type=\"application/ld\+json\">(.*?)</script>",
                                       str(soup.select("script")[12]), re.DOTALL)[0])
  • re.findall()将在 HTML 中查找所有匹配的模式。关注评论链接以更好地了解正则表达式匹配的内容。

  • (.*?)是正则表达式捕获组(...),.*?是捕获所有内容的模式。

  • str(soup.select("script")[12])是第二个re.findall()参数,它:

  • 告诉soup抓取所有找到的script标签,

  • 然后从返回的<script>标签中只抓取[12]索引,

  • 将其转换为string以便re模块可以处理它。

  • re.DOTALL将告诉re匹配所有内容,包括换行符。

  • re.findall()[0]将从返回的list匹配项中访问第一个索引,这是本例中唯一的匹配项,用于将typelist转换为str

  • json.loads()会将解析的 JSON 转换为 Python 字典。

访问已解析的 JSON 转换为字典数据:

app_name = basic_app_info["name"]
app_type = basic_app_info["@type"]
app_url = basic_app_info["url"]
app_description = basic_app_info["description"].replace("\n", "")  # replace new line character with nothing
app_category = basic_app_info["applicationCategory"]
app_operating_system = basic_app_info["operatingSystem"]
app_main_thumbnail = basic_app_info["image"]

app_content_rating = basic_app_info["contentRating"]
app_rating = round(float(basic_app_info["aggregateRating"]["ratingValue"]), 1)  # 4.287856 -> 4.3
app_reviews = basic_app_info["aggregateRating"]["ratingCount"]

app_author = basic_app_info["author"]["name"]
app_author_url = basic_app_info["author"]["url"]

通过正则表达式匹配应用截图数据:

# https://regex101.com/r/VX8E7U/1
# Follow the link to better understand what regular expression is matching.
app_images_data = re.findall(r",\[\d{3,4},\d{3,4}\],.*?(https.*?)\"", str(soup.select("script")))

迭代匹配的图像并从匹配的数据中过滤重复:

app_images = [item for item in app_images_data if app_images_data.count(item) == 1]

zwz 100131 zwz 100135:

*count(item)返回值的出现次数,本例为链接。

  • 如果链接出现 exactly 1 次,它将被添加到列表中,如果超过 1 次将被跳过。

📌注意:count()函数不适用于过滤来自同一迭代器的元素,因为它可能导致不需要的结果。

请记住,这是一个示例,有多种过滤重复项的方法并使用您确定的方法。在这种情况下count()方法过滤 71 到 23 个链接:

app_images_data = re.findall(r",\[\d{3,4},\d{3,4}\],.*?(https.*?)\"", str(soup.select("script")))

app_images_not_filtered = [item for item in app_images_data]
app_images_filtered = [item for item in app_images_data if app_images_data.count(item) == 1]

print(len(app_images_not_filtered))
print(app_images_not_filtered)

print(len(app_images_filtered))
print(app_images_filtered)

'''
71
['https://play-lh.googleusercontent.com/dcv6Z-pr3MsSvxYh_UiwvJem8fktDUsvvkPREnPaHYienbhT31bZ2nUqHqGpM1jdal8', 'https://play-lh.googleusercontent.com/SVYZCU-xg-nvaBeJ-rz6rHSSDp20AK-5AQPfYwI38nV8hPzFHEqIgFpc3LET-Dmu-Q', 'https://play-lh.googleusercontent.com/Nne-dalTl8DJ9iius5oOLmFe-4DnvZocgf92l8LTV0ldr9JVQ2BgeW_Bbjb5nkVngrQ', 'https://play-lh.googleusercontent.com/yIqljB_Jph_T_ITmVFTpmDV0LKXVHWmsyLOVyEuSjL2794nAhTBaoeZDpTZZLahyRsE', 'https://play-lh.googleusercontent.com/5HdGRlNsBvHTNLo-vIsmRLR8Tr9degRfFtungX59APFaz8OwxTnR_gnHOkHfAjhLse7e', 'https://play-lh.googleusercontent.com/bPhRpYiSMGKwO9jkjJk1raR7cJjMgPcUFeHyTg_I8rM7_6GYIO9bQm6xRcS4Q2qr6mRx', 'https://play-lh.googleusercontent.com/7DOCBRsIE5KncQ0AzSA9nSnnBh0u0u804NAgux992BhJllLKGNXkMbVFWH5pwRwHUg', 'https://play-lh.googleusercontent.com/PCaFxQba_CvC2pi2N9Wuu814srQOUmrW42mh-ZPCbk_xSDw3ubBX7vOQeY6qh3Id3YE', 'https://play-lh.googleusercontent.com/fQne-6_Le-sWScYDSRL9QdG-I2hWxMbe2QbDOzEsyu3xbEsAb_f5raRrc6GUNAHBoQ', 'https://play-lh.googleusercontent.com/ql7LENlEZaTq2NaPuB-esEPDXM2hs1knlLa2rWOI3uNuQ77hnC1lLKNJrZi9XKZFb4I', 'https://play-lh.googleusercontent.com/UIHgekhfttfNCkd5qCJNaz2_hPn67fOkv40_5rDjf5xot-QhsDCo2AInl9036huUtCwf', 'https://play-lh.googleusercontent.com/7iH7-GjfS_8JOoO7Q33JhOMnFMK-O8k7jP0MUI75mYALK0kQsMsHpHtIJidBZR46sfU', 'https://play-lh.googleusercontent.com/czt-uL-Xx4fUgzj_JbNA--RJ3xsXtjAxMK7Q_wFZdoMM6nL_g-4S5bxxX3Di3QTCwgw', 'https://play-lh.googleusercontent.com/e5HMIP0FW9MCoAEGYzji9JsrvyovpZ3StHiIANughp3dovUxdv_eHiYT5bMz38bowOI', 'https://play-lh.googleusercontent.com/nv2BP1glvMWX11mHC8GWlh_UPa096_DFOKwLZW4DlQQsrek55pY2lHr29tGwf2FEXHM', 'https://play-lh.googleusercontent.com/xwWDr_Ib6dcOr0H0OTZkHupwSrpBoNFM6AXNzNO27_RpX_BRoZtKIULKEkigX8ETOKI', 'https://play-lh.googleusercontent.com/AxHkW996UZvDE21HTkGtQPU8JiQLzNxp7yLoQiSCN29Y54kZYvf9aWoR6EzAlnoACQ', 'https://play-lh.googleusercontent.com/xFouF73v1_c5kS-mnvQdhKwl_6v3oEaLebsZ2inlJqIeF2eenXjUrUPJsjSdeAd41w', 'https://play-lh.googleusercontent.com/a1pta2nnq6f_b9uV0adiD9Z1VVQrxSfX315fIQqgKDcy8Ji0BRC1H7z8iGnvZZaeg80', 'https://play-lh.googleusercontent.com/SDAFLzC8i4skDJ2EcsEkXidcAJCql5YCZI76eQB15fVaD0j-ojxyxea00klquLVtNAw', 'https://play-lh.googleusercontent.com/H7BcVUoygPu8f7oIs2dm7g5_vVt9N9878f-rGd0ACd-muaDEOK2774okryFfsXv9FaI', 'https://play-lh.googleusercontent.com/5LIMaa7WTNy34bzdFhBETa2MRj7mFJZWb8gCn_uyxQkUvFx_uOFCeQjcK16c6WpBA3E', 'https://play-lh.googleusercontent.com/iTZtyWYr4T-slu1nifgRqEhtMLmxcNagc2rDAyiWntDQWCVLlGR7rDvx0uK6z-zLujwv', 'https://play-lh.googleusercontent.com/iTZtyWYr4T-slu1nifgRqEhtMLmxcNagc2rDAyiWntDQWCVLlGR7rDvx0uK6z-zLujwv', 'https://play-lh.googleusercontent.com/EbEX3AN4FC4pu3lsElAHCiksluOVU8OgkgtWC43-wmm_aHVq2D65FmEM97bPexilUAvlAY5_4ARH8Tb3RxQ', 'https://play-lh.googleusercontent.com/_re6mcALPaqotePA0WkgYeOQ6TighHRUS62FRmREPEhyZPdGM3QmRjcSpiMt6Pz1O-WZyvEIy4mtGHj9zw', 'https://play-lh.googleusercontent.com/SI2XoFyY35xzlMz3cZdTH7SSxMDfJTJjKNtbso33YIyYknmxJnBLrfLPJ131gz3O259sB9gP9dcmSvRNFw', 'https://play-lh.googleusercontent.com/SI2XoFyY35xzlMz3cZdTH7SSxMDfJTJjKNtbso33YIyYknmxJnBLrfLPJ131gz3O259sB9gP9dcmSvRNFw', 'https://play-lh.googleusercontent.com/pzvdI66OFjncahvxJN714Tu5pHUJ_nJK--vg0tv5cpgaGNvjfwsxC-SKxoQh9_n_wEcCdSQF9FeuZeI', 'https://play-lh.googleusercontent.com/EbEX3AN4FC4pu3lsElAHCiksluOVU8OgkgtWC43-wmm_aHVq2D65FmEM97bPexilUAvlAY5_4ARH8Tb3RxQ', 'https://play-lh.googleusercontent.com/_re6mcALPaqotePA0WkgYeOQ6TighHRUS62FRmREPEhyZPdGM3QmRjcSpiMt6Pz1O-WZyvEIy4mtGHj9zw', 'https://play-lh.googleusercontent.com/_re6mcALPaqotePA0WkgYeOQ6TighHRUS62FRmREPEhyZPdGM3QmRjcSpiMt6Pz1O-WZyvEIy4mtGHj9zw', 'https://play-lh.googleusercontent.com/z6aS2wnyp16KA9CFEep7HvZd2DmwRfoR9NWm9oHWRw-tuXLE_CPbnb1OL39-a456EgA', 'https://play-lh.googleusercontent.com/z6aS2wnyp16KA9CFEep7HvZd2DmwRfoR9NWm9oHWRw-tuXLE_CPbnb1OL39-a456EgA', 'https://play-lh.googleusercontent.com/z6aS2wnyp16KA9CFEep7HvZd2DmwRfoR9NWm9oHWRw-tuXLE_CPbnb1OL39-a456EgA', 'https://play-lh.googleusercontent.com/kMnXlmzr3b8Tbzs_xDy3vq12fnZ6PM4LVlxPlFMKf_VZVkk1v7xeAUpJxW6iYab9m_w', 'https://play-lh.googleusercontent.com/kMnXlmzr3b8Tbzs_xDy3vq12fnZ6PM4LVlxPlFMKf_VZVkk1v7xeAUpJxW6iYab9m_w', 'https://play-lh.googleusercontent.com/kMnXlmzr3b8Tbzs_xDy3vq12fnZ6PM4LVlxPlFMKf_VZVkk1v7xeAUpJxW6iYab9m_w', 'https://play-lh.googleusercontent.com/5BFo6FvdAn0c10xiDKO_GZMtHmn-4qxHTtF6rarC162rCNqnA7jub30CYWmzC_DZ1l4', 'https://play-lh.googleusercontent.com/5BFo6FvdAn0c10xiDKO_GZMtHmn-4qxHTtF6rarC162rCNqnA7jub30CYWmzC_DZ1l4', 'https://play-lh.googleusercontent.com/5BFo6FvdAn0c10xiDKO_GZMtHmn-4qxHTtF6rarC162rCNqnA7jub30CYWmzC_DZ1l4', 'https://play-lh.googleusercontent.com/9EYrR6ilBWJFLt_LE_QniHZjdYlG9on6PzTOqR9tBf3SKiWU4lIDOXq-kXrrSKyyEg', 'https://play-lh.googleusercontent.com/9EYrR6ilBWJFLt_LE_QniHZjdYlG9on6PzTOqR9tBf3SKiWU4lIDOXq-kXrrSKyyEg', 'https://play-lh.googleusercontent.com/9EYrR6ilBWJFLt_LE_QniHZjdYlG9on6PzTOqR9tBf3SKiWU4lIDOXq-kXrrSKyyEg', 'https://play-lh.googleusercontent.com/pyJ4VUsm75cc800LalWMZupRMG7o-JictgTeSIUKni2Hn2ncR4m22hgV_LhTahsRt0U', 'https://play-lh.googleusercontent.com/pyJ4VUsm75cc800LalWMZupRMG7o-JictgTeSIUKni2Hn2ncR4m22hgV_LhTahsRt0U', 'https://play-lh.googleusercontent.com/pyJ4VUsm75cc800LalWMZupRMG7o-JictgTeSIUKni2Hn2ncR4m22hgV_LhTahsRt0U', 'https://play-lh.googleusercontent.com/z6aS2wnyp16KA9CFEep7HvZd2DmwRfoR9NWm9oHWRw-tuXLE_CPbnb1OL39-a456EgA', 'https://play-lh.googleusercontent.com/z6aS2wnyp16KA9CFEep7HvZd2DmwRfoR9NWm9oHWRw-tuXLE_CPbnb1OL39-a456EgA', 'https://play-lh.googleusercontent.com/z6aS2wnyp16KA9CFEep7HvZd2DmwRfoR9NWm9oHWRw-tuXLE_CPbnb1OL39-a456EgA', 'https://play-lh.googleusercontent.com/H8AjtJR4LviiM3M8dg1BrS7_XzHBziG91Cn-udo8w44fRPo5mwj6NL683JBJQslpnZOY', 'https://play-lh.googleusercontent.com/H8AjtJR4LviiM3M8dg1BrS7_XzHBziG91Cn-udo8w44fRPo5mwj6NL683JBJQslpnZOY', 'https://play-lh.googleusercontent.com/H8AjtJR4LviiM3M8dg1BrS7_XzHBziG91Cn-udo8w44fRPo5mwj6NL683JBJQslpnZOY', 'https://play-lh.googleusercontent.com/DGP09C5sfjxaawTV0JUIFTDKJ0579kmss59AkjHzvz6ry6FSjTzjHGO8GiB3BwglPI5g', 'https://play-lh.googleusercontent.com/DGP09C5sfjxaawTV0JUIFTDKJ0579kmss59AkjHzvz6ry6FSjTzjHGO8GiB3BwglPI5g', 'https://play-lh.googleusercontent.com/DGP09C5sfjxaawTV0JUIFTDKJ0579kmss59AkjHzvz6ry6FSjTzjHGO8GiB3BwglPI5g', 'https://play-lh.googleusercontent.com/tUB0fBVdN0pLn_wpfKRETC3jcraaYc7nFEDCOFsE7SUK0WCKpUWO0k3pOi-x-bPkIAo', 'https://play-lh.googleusercontent.com/tUB0fBVdN0pLn_wpfKRETC3jcraaYc7nFEDCOFsE7SUK0WCKpUWO0k3pOi-x-bPkIAo', 'https://play-lh.googleusercontent.com/tUB0fBVdN0pLn_wpfKRETC3jcraaYc7nFEDCOFsE7SUK0WCKpUWO0k3pOi-x-bPkIAo', 'https://play-lh.googleusercontent.com/z6aS2wnyp16KA9CFEep7HvZd2DmwRfoR9NWm9oHWRw-tuXLE_CPbnb1OL39-a456EgA', 'https://play-lh.googleusercontent.com/z6aS2wnyp16KA9CFEep7HvZd2DmwRfoR9NWm9oHWRw-tuXLE_CPbnb1OL39-a456EgA', 'https://play-lh.googleusercontent.com/z6aS2wnyp16KA9CFEep7HvZd2DmwRfoR9NWm9oHWRw-tuXLE_CPbnb1OL39-a456EgA', 'https://play-lh.googleusercontent.com/H8AjtJR4LviiM3M8dg1BrS7_XzHBziG91Cn-udo8w44fRPo5mwj6NL683JBJQslpnZOY', 'https://play-lh.googleusercontent.com/H8AjtJR4LviiM3M8dg1BrS7_XzHBziG91Cn-udo8w44fRPo5mwj6NL683JBJQslpnZOY', 'https://play-lh.googleusercontent.com/H8AjtJR4LviiM3M8dg1BrS7_XzHBziG91Cn-udo8w44fRPo5mwj6NL683JBJQslpnZOY', 'https://play-lh.googleusercontent.com/DGP09C5sfjxaawTV0JUIFTDKJ0579kmss59AkjHzvz6ry6FSjTzjHGO8GiB3BwglPI5g', 'https://play-lh.googleusercontent.com/DGP09C5sfjxaawTV0JUIFTDKJ0579kmss59AkjHzvz6ry6FSjTzjHGO8GiB3BwglPI5g', 'https://play-lh.googleusercontent.com/DGP09C5sfjxaawTV0JUIFTDKJ0579kmss59AkjHzvz6ry6FSjTzjHGO8GiB3BwglPI5g', 'https://play-lh.googleusercontent.com/tUB0fBVdN0pLn_wpfKRETC3jcraaYc7nFEDCOFsE7SUK0WCKpUWO0k3pOi-x-bPkIAo', 'https://play-lh.googleusercontent.com/tUB0fBVdN0pLn_wpfKRETC3jcraaYc7nFEDCOFsE7SUK0WCKpUWO0k3pOi-x-bPkIAo', 'https://play-lh.googleusercontent.com/tUB0fBVdN0pLn_wpfKRETC3jcraaYc7nFEDCOFsE7SUK0WCKpUWO0k3pOi-x-bPkIAo']
23
['https://play-lh.googleusercontent.com/dcv6Z-pr3MsSvxYh_UiwvJem8fktDUsvvkPREnPaHYienbhT31bZ2nUqHqGpM1jdal8', 'https://play-lh.googleusercontent.com/SVYZCU-xg-nvaBeJ-rz6rHSSDp20AK-5AQPfYwI38nV8hPzFHEqIgFpc3LET-Dmu-Q', 'https://play-lh.googleusercontent.com/Nne-dalTl8DJ9iius5oOLmFe-4DnvZocgf92l8LTV0ldr9JVQ2BgeW_Bbjb5nkVngrQ', 'https://play-lh.googleusercontent.com/yIqljB_Jph_T_ITmVFTpmDV0LKXVHWmsyLOVyEuSjL2794nAhTBaoeZDpTZZLahyRsE', 'https://play-lh.googleusercontent.com/5HdGRlNsBvHTNLo-vIsmRLR8Tr9degRfFtungX59APFaz8OwxTnR_gnHOkHfAjhLse7e', 'https://play-lh.googleusercontent.com/bPhRpYiSMGKwO9jkjJk1raR7cJjMgPcUFeHyTg_I8rM7_6GYIO9bQm6xRcS4Q2qr6mRx', 'https://play-lh.googleusercontent.com/7DOCBRsIE5KncQ0AzSA9nSnnBh0u0u804NAgux992BhJllLKGNXkMbVFWH5pwRwHUg', 'https://play-lh.googleusercontent.com/PCaFxQba_CvC2pi2N9Wuu814srQOUmrW42mh-ZPCbk_xSDw3ubBX7vOQeY6qh3Id3YE', 'https://play-lh.googleusercontent.com/fQne-6_Le-sWScYDSRL9QdG-I2hWxMbe2QbDOzEsyu3xbEsAb_f5raRrc6GUNAHBoQ', 'https://play-lh.googleusercontent.com/ql7LENlEZaTq2NaPuB-esEPDXM2hs1knlLa2rWOI3uNuQ77hnC1lLKNJrZi9XKZFb4I', 'https://play-lh.googleusercontent.com/UIHgekhfttfNCkd5qCJNaz2_hPn67fOkv40_5rDjf5xot-QhsDCo2AInl9036huUtCwf', 'https://play-lh.googleusercontent.com/7iH7-GjfS_8JOoO7Q33JhOMnFMK-O8k7jP0MUI75mYALK0kQsMsHpHtIJidBZR46sfU', 'https://play-lh.googleusercontent.com/czt-uL-Xx4fUgzj_JbNA--RJ3xsXtjAxMK7Q_wFZdoMM6nL_g-4S5bxxX3Di3QTCwgw', 'https://play-lh.googleusercontent.com/e5HMIP0FW9MCoAEGYzji9JsrvyovpZ3StHiIANughp3dovUxdv_eHiYT5bMz38bowOI', 'https://play-lh.googleusercontent.com/nv2BP1glvMWX11mHC8GWlh_UPa096_DFOKwLZW4DlQQsrek55pY2lHr29tGwf2FEXHM', 'https://play-lh.googleusercontent.com/xwWDr_Ib6dcOr0H0OTZkHupwSrpBoNFM6AXNzNO27_RpX_BRoZtKIULKEkigX8ETOKI', 'https://play-lh.googleusercontent.com/AxHkW996UZvDE21HTkGtQPU8JiQLzNxp7yLoQiSCN29Y54kZYvf9aWoR6EzAlnoACQ', 'https://play-lh.googleusercontent.com/xFouF73v1_c5kS-mnvQdhKwl_6v3oEaLebsZ2inlJqIeF2eenXjUrUPJsjSdeAd41w', 'https://play-lh.googleusercontent.com/a1pta2nnq6f_b9uV0adiD9Z1VVQrxSfX315fIQqgKDcy8Ji0BRC1H7z8iGnvZZaeg80', 'https://play-lh.googleusercontent.com/SDAFLzC8i4skDJ2EcsEkXidcAJCql5YCZI76eQB15fVaD0j-ojxyxea00klquLVtNAw', 'https://play-lh.googleusercontent.com/H7BcVUoygPu8f7oIs2dm7g5_vVt9N9878f-rGd0ACd-muaDEOK2774okryFfsXv9FaI', 'https://play-lh.googleusercontent.com/5LIMaa7WTNy34bzdFhBETa2MRj7mFJZWb8gCn_uyxQkUvFx_uOFCeQjcK16c6WpBA3E', 'https://play-lh.googleusercontent.com/pzvdI66OFjncahvxJN714Tu5pHUJ_nJK--vg0tv5cpgaGNvjfwsxC-SKxoQh9_n_wEcCdSQF9FeuZeI']
'''

为用户评论数据创建临时list():

app_user_comments = []

使用正则表达式匹配用户评论数据:

# https://regex101.com/r/SrP5DS/1
# # Follow the link to better understand what regular expression is matching.
app_user_reviews_data = re.findall(r"(\[\"gp.*?);</script>",
                                   str(soup.select("script")), re.DOTALL)
  • re.DOTALL将告诉re匹配所有内容,包括换行符。

遍历app_user_reviews_data以提取所有可用评论并使用正则表达式匹配适当的数据:

# Follow the links to better understand what regular expressions are matching.

for review in app_user_reviews_data:
    # https://regex101.com/r/M24tiM/1
    user_name = re.findall(r"\"gp:.*?\",\s?\[\"(.*?)\",", str(review))

    # https://regex101.com/r/TGgR45/1
    user_avatar = [avatar.replace('"', "") for avatar in re.findall(r"\"gp:.*?\"(https.*?\")", str(review))]

    # replace single/double quotes at the start/end of a string
    # https://regex101.com/r/iHPOrI/1
    user_comment = [comment.replace('"', "").replace("'", "") for comment in
                    re.findall(r"gp:.*?https:.*?]]],\s?\d+?,.*?,\s?(.*?),\s?\[\d+,", str(review))]

    # comment utc timestamp
    # use datetime.utcfromtimestamp(int(date)).date() to have only a date: 2022-01-09
    user_comment_date = [str(datetime.utcfromtimestamp(int(date))) for date in re.findall(r"\[(\d+),", str(review))]

    # https://regex101.com/r/GrbH9A/1
    user_comment_id = [ids.replace('"', "") for ids in re.findall(r"\[\"(gp.*?),", str(review))]

    # https://regex101.com/r/jRaaQg/1
    user_comment_likes = re.findall(r",?\d+\],?(\d+),?", str(review))

    # https://regex101.com/r/Z7vFqa/1
    user_comment_app_rating = re.findall(r"\"gp.*?https.*?\],(.*?)?,", str(review))

创建另一个for循环以并行迭代所有用户评论数据:

for name, avatar, comment, date, comment_id, likes, user_app_rating in zip(user_name,
                                                                           user_avatar,
                                                                           user_comment,
                                                                           user_comment_date,
                                                                           user_comment_id,
                                                                           user_comment_likes,
                                                                           user_comment_app_rating):
  • zip()接受迭代,将它们聚合成一个元组并返回它。在这种情况下,每个值的数量对于所有人来说都是相同的,例如,如果有 30 个名称,那么也会有 30 个评论 ID 或评论日期等等。

将用户评论数据附加到临时list:

# for name, ... in zip(...) is here

app_user_comments.append({
    "user_name": name,
    "user_avatar": avatar,
    "comment": comment,
    "user_app_rating": user_app_rating,
    "user__comment_likes": likes,
    "user_comment_published_at": date,
    "user_comment_id": comment_id
})

将app信息数据追加到临时list:

app_data.append({
    "app_name": app_name,
    "app_type": app_type,
    "app_url": app_url,
    "app_main_thumbnail": app_main_thumbnail,
    "app_description": app_description,
    "app_content_rating": app_content_rating,
    "app_category": app_category,
    "app_operating_system": app_operating_system,
    "app_rating": app_rating,
    "app_reviews": app_reviews,
    "app_author": app_author,
    "app_author_url": app_author_url,
    "app_screenshots": app_images
})

dict形式返回应用和用户评论数据:

return {"app_data": app_data, "app_user_comments": app_user_comments}

访问提取的用户评论数据的示例:

for review in scrape_google_store_app()["app_user_comments"]:
    print(review["user_name"],
          review["user_avatar"],
          review["comment"],
          review["user_app_rating"],
          review["user_comment_likes"], sep='\n')

# part of the output (in this case 40 commnets in total):
'''
Misha t
https://play-lh.googleusercontent.com/a/AATXAJxvYKOfPVaqDZg0FOUjJOV-W3qR6r_cMAz0XMgU\u003dmo
Fun game, but it does not warns you that only World 1 (out of 6) is free, others are behind paywall. Dont spend your time if you want full game for free
1
9
Dangan Carlo
https://play-lh.googleusercontent.com/a-/AOh14GiC291ZXTKihUQukmruZDx2MJjb5-tMmeCC7Ag3KQ
The game is fun but you have to pay for the World 1 Boss Fight or do some challenges, and after that the rest of the game needs to be purchased. 10 dollars is too much for a game that can be beat under an hour and especially if its a mobile game. I doubt youll take down the 10 dollar price tag but itd be nice if you could.
3
316
Marcus Hughes
https://play-lh.googleusercontent.com/a-/AOh14Gisjmr1druQ7QamKHqg0N9qq5ahGmaMQNhS2a35jw
Amazing! Only one thing would make this game better.... ENDLESS MODE! It would be VERY awesome if you created an endless mode where you can endlessly repeat the same stage, only it gets more difficult as you progress, such as new enemies/obstacles, but its one stage.... Preferably itd be cool if every stage had this.
5
193
'''

友情链接

  • 在线IDE中的代码

  • GitHub存储库

  • Python google-play-scraper

  • JavaScript google-play-scraper


其他

使用这样的爬虫,您可以为应用程序竞争对手、特定类别的应用程序、下载/评分最多的应用程序或任何其他有用的分析构建数据集。

如果您有任何要分享的内容、任何问题、建议或无法正常工作的内容,请通过 Twitter 与@dimitryzub联系。

您的,Dmitriy 和 SerpApi 团队的其他成员。

加入我们Reddit|推特|YouTube

Logo

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐