Can't scrape all the company names from a webpage
·
Answer a question
I'm trying to parse all the company names from this webpage. There are around 2431 companies in there. However, the way I've tried below can fetches me 1000 results.
This is what I can see about the number of results in response while going through dev tools:
hitsPerPage: 1000
index: "YCCompany_production"
nbHits: 2431 <------------------------
nbPages: 1
page: 0
How can I get the rest of the results using requests?
I've tried so far:
import requests
url = 'https://45bwzj1sgc-dsn.algolia.net/1/indexes/*/queries?'
params = {
'x-algolia-agent': 'Algolia for JavaScript (3.35.1); Browser; JS Helper (3.1.0)',
'x-algolia-application-id': '45BWZJ1SGC',
'x-algolia-api-key': 'NDYzYmNmMTRjYzU4MDE0ZWY0MTVmMTNiYzcwYzMyODFlMjQxMWI5YmZkMjEwMDAxMzE0OTZhZGZkNDNkYWZjMHJlc3RyaWN0SW5kaWNlcz0lNUIlMjJZQ0NvbXBhbnlfcHJvZHVjdGlvbiUyMiU1RCZ0YWdGaWx0ZXJzPSU1QiUyMiUyMiU1RCZhbmFseXRpY3NUYWdzPSU1QiUyMnljZGMlMjIlNUQ='
}
payload = {"requests":[{"indexName":"YCCompany_production","params":"hitsPerPage=1000&query=&page=0&facets=%5B%22top100%22%2C%22isHiring%22%2C%22nonprofit%22%2C%22batch%22%2C%22industries%22%2C%22subindustry%22%2C%22status%22%2C%22regions%22%5D&tagFilters="}]}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
r = s.post(url,params=params,json=payload)
print(len(r.json()['results'][0]['hits']))
Answers
As a workaround you can simulate search using alphabet as a search pattern. Using code below you will get all 2431 companies as dictionary with ID as a key and full company data dictionary as a value.
import requests
import string
params = {
'x-algolia-agent': 'Algolia for JavaScript (3.35.1); Browser; JS Helper (3.1.0)',
'x-algolia-application-id': '45BWZJ1SGC',
'x-algolia-api-key': 'NDYzYmNmMTRjYzU4MDE0ZWY0MTVmMTNiYzcwYzMyODFlMjQxMWI5YmZkMjEwMDAxMzE0OTZhZGZkNDNkYWZjMHJl'
'c3RyaWN0SW5kaWNlcz0lNUIlMjJZQ0NvbXBhbnlfcHJvZHVjdGlvbiUyMiU1RCZ0YWdGaWx0ZXJzPSU1QiUyMiUy'
'MiU1RCZhbmFseXRpY3NUYWdzPSU1QiUyMnljZGMlMjIlNUQ='
}
url = 'https://45bwzj1sgc-dsn.algolia.net/1/indexes/*/queries'
result = dict()
for letter in string.ascii_lowercase:
print(letter)
payload = {
"requests": [{
"indexName": "YCCompany_production",
"params": "hitsPerPage=1000&query=" + letter + "&page=0&facets=%5B%22top100%22%2C%22isHiring%22%2C%22nonprofit%22%2C%22batch%22%2C%22industries%22%2C%22subindustry%22%2C%22status%22%2C%22regions%22%5D&tagFilters="
}]
}
r = requests.post(url, params=params, json=payload)
result.update({h['id']: h for h in r.json()['results'][0]['hits']})
print(len(result))
更多推荐

所有评论(0)