Answer a question

I am trying to extract an information with beautifulsoap, however when I do it it extracts it with very rare symbols. But when I enter directly to the page everything looks good and the page has the label <meta charset="utf-8">

my code is:

HEADERS = {
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
        'referrer': 'https://google.com',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'en-US,en;q=0.9',
        'Pragma': 'no-cache',
    }




urls = 'https://www.jcchouinard.com/web-scraping-with-python-and-requests-html/'
r = requests.get(urls, headers=HEADERS)
soup = bs4.BeautifulSoup(r.text, "html.parser")
print (soup)

Nevertheless, the result I get is this:

J{$%X Àà’8}±ŸÅ

I guess it's something with the encoding, however I don't understand why since the page is utf-8.

It is worth clarifying that this only happens in some cases, since with others I manage to extract the information without problems.

Edit: updated with a sample url.

Edit2: added the headers dictionary, which is the one that generates the problem.

Answers

The problem is Accept-Encoding HTTP header. There you have br specified, which means brotli compression method. requests module cannot handle that. Remove br and the server responds without this compression method.

import requests
from bs4 import BeautifulSoup

HEADERS = {
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
        'referrer': 'https://google.com',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate', # <-- remove br
        'Accept-Language': 'en-US,en;q=0.9',
        'Pragma': 'no-cache',
    }
        
urls = 'https://www.jcchouinard.com/web-scraping-with-python-and-requests-html/'
r = requests.get(urls, headers=HEADERS)
soup = BeautifulSoup(r.text, "html.parser")
print (soup)

Prints:

<!DOCTYPE html>

<html lang="fr-FR">
<head><style>img.lazy{min-height:1px}</style><link as="script" href="https://www.jcchouinard.com/wp-content/plugins/w3-total-cache/pub/js/lazyload.min.js?x73818" rel="preload"/>

...and so on.
Logo

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐