How to prevent beautifulsoap from extracting the information as strange symbols?
Answer a question
I am trying to extract an information with beautifulsoap, however when I do it it extracts it with very rare symbols. But when I enter directly to the page everything looks good and the page has the label <meta charset="utf-8">
my code is:
HEADERS = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'referrer': 'https://google.com',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Pragma': 'no-cache',
}
urls = 'https://www.jcchouinard.com/web-scraping-with-python-and-requests-html/'
r = requests.get(urls, headers=HEADERS)
soup = bs4.BeautifulSoup(r.text, "html.parser")
print (soup)
Nevertheless, the result I get is this:
J{$%X Àà’8}±ŸÅ
I guess it's something with the encoding, however I don't understand why since the page is utf-8.
It is worth clarifying that this only happens in some cases, since with others I manage to extract the information without problems.
Edit: updated with a sample url.
Edit2: added the headers dictionary, which is the one that generates the problem.
Answers
The problem is Accept-Encoding HTTP header. There you have br specified, which means brotli compression method. requests module cannot handle that. Remove br and the server responds without this compression method.
import requests
from bs4 import BeautifulSoup
HEADERS = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'referrer': 'https://google.com',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate', # <-- remove br
'Accept-Language': 'en-US,en;q=0.9',
'Pragma': 'no-cache',
}
urls = 'https://www.jcchouinard.com/web-scraping-with-python-and-requests-html/'
r = requests.get(urls, headers=HEADERS)
soup = BeautifulSoup(r.text, "html.parser")
print (soup)
Prints:
<!DOCTYPE html>
<html lang="fr-FR">
<head><style>img.lazy{min-height:1px}</style><link as="script" href="https://www.jcchouinard.com/wp-content/plugins/w3-total-cache/pub/js/lazyload.min.js?x73818" rel="preload"/>
...and so on.
更多推荐

所有评论(0)