scrape top 100 job results from indeed using BeautifulSoup python
·
Answer a question
I'm new to python web scraping and i would like to scrape top 100 job results from indeed and im only able to scrape first page results i.e top 10. I'm using BeautifulSoup framework. This is my code and can anyone help me with this problem?
import urllib2
from bs4 import BeautifulSoup
import json
URL = "https://www.indeed.co.in/jobs?q=software+developer&l=Bengaluru%2C+Karnataka"
soup = BeautifulSoup(urllib2.urlopen(URL).read(), 'html.parser')
results = soup.find_all('div', attrs={'class': 'jobsearch-SerpJobCard'})
for x in results:
company = x.find('span', attrs={"class":"company"})
print 'company:', company.text.strip()
job = x.find('a', attrs={'data-tn-element': "jobTitle"})
print 'job:', job.text.strip()
Answers
Do it in batches of 10 changing the start value in the url. You can loop incrementing and adding the add variable
https://www.indeed.co.in/jobs?q=software+developer&l=Bengaluru%2C+Karnataka&start=0
https://www.indeed.co.in/jobs?q=software+developer&l=Bengaluru,+Karnataka&start=1
E.g.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
results = []
url = 'https://www.indeed.co.in/jobs?q=software+developer&l=Bengaluru,+Karnataka&start={}'
with requests.Session() as s:
for page in range(5):
res = s.get(url.format(page))
soup = bs(res.content, 'lxml')
titles = [item.text.strip() for item in soup.select('[data-tn-element=jobTitle]')]
companies = [item.text.strip() for item in soup.select('.company')]
data = list(zip(titles, companies))
results.append(data)
newList = [item for sublist in results for item in sublist]
df = pd.DataFrame(newList)
df.to_json(r'C:\Users\User\Desktop\data.json')
更多推荐

所有评论(0)