how to scrape data from shopee using beautiful soup

Mangs

1人浏览 · 2022-08-25 07:39:29

Mangs · 2022-08-25 07:39:29 发布

Answer a question

I'm currently a student where currently I studied beautifulsoup so my lecturer as me to scrape data from shopee however I cannot scrape the details of the products. Currently, I'm trying to scrape data from https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales. I only want to scrape the name and price of the products. can someone tell me why I cannot scrape the data using beautifulsoup ?

Here is my code:

from requests import get
from bs4 import BeautifulSoup

url = "https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales"
response= get (url)
soup=BeautifulSoup(response.text,'html.parser')
print (soup)

Answers

This question is a bit tricky (for python beginners) because it involves a combination of selenium (for headless browsing) and beautifulsoup (for html data extraction). Moreover, the problem becomes difficult because the Document Object Model (DOM) is encased within javascripting. We know javascript is there because we get an empty response from the website when accessed only using beautifulsoup, like, for item_n in soup.find_all('div', class_='_1NoI8_ _16BAGk'): print(item_n.get_text())

Therefore, to extract data from such a webpage which has a scripting language controlling its DOM, we have to use selenium for headless browsing (this tells the website that a browser is accessing it). We also have to use some sort of delay parameter, (which tells the website that it's accessed by a human). For this, the function WebdriverWait() from the selenium library will help.

I now present snippets of code that explain the process.

First, import the requisite libraries

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from time import sleep

Next, initialize the settings for the headless browser. I'm using chrome.

# create object for chrome options
chrome_options = Options()
base_url = 'https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales'

# set chrome driver options to disable any popup's from the website
# to find local path for chrome profile, open chrome browser
# and in the address bar type, "chrome://version"
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('start-maximized')
chrome_options.add_argument('user-data-dir=C:\\Users\\username\\AppData\\Local\\Google\\Chrome\\User Data\\Default')
# To disable the message, "Chrome is being controlled by automated test software"
chrome_options.add_argument("disable-infobars")
# Pass the argument 1 to allow and 2 to block
chrome_options.add_experimental_option("prefs", { 
    "profile.default_content_setting_values.notifications": 2
    })
# invoke the webdriver
browser = webdriver.Chrome(executable_path = r'C:/Users/username/Documents/playground_python/chromedriver.exe',
                          options = chrome_options)
browser.get(base_url)
delay = 5 #secods

Next, I declare empty list variables to hold the data.

# declare empty lists
item_cost, item_init_cost, item_loc = [],[],[]
item_name, items_sold, discount_percent = [], [], []
while True:
    try:
        WebDriverWait(browser, delay)
        print ("Page is ready")
        sleep(5)
        html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
        #print(html)
        soup = BeautifulSoup(html, "html.parser")

        # find_all() returns an array of elements. 
        # We have to go through all of them and select that one you are need. And than call get_text()
        for item_n in soup.find_all('div', class_='_1NoI8_ _16BAGk'):
            print(item_n.get_text())
            item_name.append(item_n.text)

        # find the price of items
        for item_c in soup.find_all('span', class_='_341bF0'):
            print(item_c.get_text())
            item_cost.append(item_c.text)

        # find initial item cost
        for item_ic in soup.find_all('div', class_ = '_1w9jLI QbH7Ig U90Nhh'):
            print(item_ic.get_text())
            item_init_cost.append(item_ic.text)
        # find total number of items sold/month
        for items_s in soup.find_all('div',class_ = '_18SLBt'):
            print(items_s.get_text())
            items_sold.append(item_ic.text)

        # find item discount percent
        for dp in soup.find_all('span', class_ = 'percent'):
            print(dp.get_text())
            discount_percent.append(dp.text)
        # find item location
        for il in soup.find_all('div', class_ = '_3amru2'):
            print(il.get_text())
            item_loc.append(il.text)

        break # it will break from the loop once the specific element will be present. 
    except TimeoutException:
        print ("Loading took too much time!-Try again")

Thereafter, I use the zip function to combine the different list items.

rows = zip(item_name, item_init_cost,discount_percent,item_cost,items_sold,item_loc)

Finally, I write this data to disc,

import csv
newFilePath = 'shopee_item_list.csv'
with open(newFilePath, "w") as f:
    writer = csv.writer(f)
    for row in rows:
        writer.writerow(row)

As a good practice, its wise to close the headless browser once the task is complete. And so i code it as,

# close the automated browser
browser.close()

Result

Nestle MILO Activ-Go Chocolate Malt Powder (2kg)
NESCAFE GOLD Refill (170g)
Nestle MILO Activ-Go Chocolate Malt Powder (1kg)
MAGGI Hot Cup - Asam Asam Laksa (60g)
MAGGI 2-Minit Curry (79g x 5 Packs x 2)
MAGGI PAZZTA Cheese Macaroni 70g
.......
29.90
21.90
16.48
1.69
8.50
3.15
5.90
.......
RM40.70
RM26.76
RM21.40
RM1.80
RM9.62
........
9k sold/month
2.3k sold/month
1.8k sold/month
1.7k sold/month
.................
27%
18%
23%
6%
.............
Selangor
Selangor
Selangor
Selangor

Note to the readers

The OP brought to my attention that the xpath was not working as given in my answer. I checked the website again after 2 days and noticed a strange phenomenon. The class_ attribute of the div class had indeed changed. I found a similar Q. But it did not help much. So for now, I'm concluding the div attributes in the shoppee website can change again. I leave this as an open problem to solve later.

Note to the OP

Ana, the above code will work for just one page i.e., it will work only for the webpage, https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales. I invite you to further enhance your skills by solving how to scrape data for multiple webpages under the sales tag. Your hint is the 1/9 seen on the top right of the this page and/or the 1 2 3 4 5 links at the bottom of the page. Another hint for you is to look at the urljoin in the urlparse library. Hope this should get you started.

Helpful resources

XPATH tutorial

Python

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐

求助！为什么用InsCode部署会出现无限重定向？

Python

如何重塑熊猫。系列

问题:如何重塑熊猫。系列在我看来,它就像 pandas.Series 中的一个错误。 a = pd.Series([1,2,3,4]) b = a.reshape(2,2) b b 有类型 Series 但无法显示,最后一条语句给出异常,非常冗长,最后一行是“TypeError: %d format: a number is required, not numpy.ndarray”。 b.sha

Python

在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制]

问题:在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制] 我刚刚在这里](https://keras.io/initializers/)中阅读了有关[中的 Keras 权重初始化器的信息。在文档中,只介绍了不同的初始化程序。如: model.add(Dense(64, kernel_initializer='random_normal')) 当我没有指定kernel_initia