问题:CNN Scraper 偶尔在 python 中工作

我试图为 CNN 创建一个 Web Scraper。我的目标是抓取搜索查询中的所有新闻文章。有时我会得到一些抓取页面的输出,有时它根本不起作用。

我在 Jupiter Notebook 中使用 selenium 和 BeautifulSoup 包。我正在通过 url 参数&page={}&from={}遍历页面。我之前尝试过 by.XPATH,只需单击页面末尾的下一步按钮,但它给了我相同的结果。

这是我正在使用的代码:

#0 ------------import libraries
import requests
from bs4 import BeautifulSoup
from bs4.element import Tag
import feedparser
import urllib
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pickle
import pandas as pd

#3 ------------CNN SCRAPER
#3.1 ----------Define Funktion
def CNN_Scraper(max_pages):
    base = "https://edition.cnn.com/"
    browser = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
    load_content = browser.implicitly_wait(30)
    base_url = 'https://edition.cnn.com/search?q=coronavirus&sort=newest&category=business,us,politics,world,opinion,health&size=100'
    
 #-------------Define empty lists to be scraped
    CNN_title   = []
    CNN_date   = []
    CNN_article   = []
    article_count = 0
        

 #-------------iterate over pages and extract   
    for page in range(1, max_pages + 1):
        print("Page %d" % page)
        
        url= base_url + "&page=%d&from=%d" % (page, article_count)
        browser.get(url)
        load_content
        soup = BeautifulSoup(browser.page_source,'lxml')
        search_results = soup.find('div', {'class':'cnn-search__results-list'})
        contents = search_results.find_all('div', {'class':'cnn-search__result-contents'})

        for content in contents:
            try:
                title = content.find('h3').text
                print(title)
                link = content.find('a')
                link_url = link['href']    

                date = content.find('div',{'class':'cnn-search__result-publish-date'}).text.strip()
                article = content.find('div',{'class':'cnn-search__result-body'}).text
            except:
                print("loser")
                continue
            CNN_title.append(title)
            CNN_date.append(date)
            CNN_article.append(article)
            
        article_count += 100   
        print("-----")
        
 #-------------Save in DF    
    df = pd.DataFrame()
    df['title'] = CNN_title
    df['date'] = CNN_date      
    df['article'] = CNN_article 
    df['link']=CNN_link
    return df        

    #print("Complete")

    browser.quit()
    
#3.2 ----------Call Function - Scrape CNN and save pickled data
CNN_data = CNN_Scraper(2)
#CNN_data.to_pickle("CNN_data")

解答

直接调用后端API。更多详情查看我之前的回答

import requests
import json


def main(url):
    with requests.Session() as req:
        for item in range(1, 1000, 100):
            r = req.get(url.format(item)).json()
            for a in r['result']:
                print("Headline: {}, Url: {}".format(
                    a['headline'], a['url']))


main("https://search.api.cnn.io/content?q=coronavirus&sort=newest&category=business,us,politics,world,opinion,health&size=100&from={}")
Logo

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐