Answer a question

I found a really nice code on the https://towardsdatascience.com/ website for web scraping and I'm trying to implement for my own use.

https://ingatlan.com/lista/elado+lakas+ii-ker?page=1 this is a hungarian real estate website. Firstly, I just want to grab the prices of the real estates but if I run my code I don't get any results, the number of items found is 0.

import urllib.request,sys,time
from bs4 import BeautifulSoup
import requests
import pandas as pd

pagesToGet= 1

upperframe=[]  
for page in range(1,pagesToGet+1):
    print('processing page :', page)
    url = 'https://ingatlan.com/lista/elado+lakas+ii-ker?page='+str(page)
    print(url)
    
    
    try:
        page=requests.get(url)                            
    
    except Exception as e:                                   
        error_type, error_obj, error_info = sys.exc_info()     
        print ('ERROR FOR LINK:',url)                          
        print (error_type, 'Line:', error_info.tb_lineno)     
        continue                                              
    time.sleep(2)   
    soup=BeautifulSoup(page.text,'html.parser')
    frame=[]
    links=soup.find_all('div',attrs={'class':'listing js-listing '})
    print(len(links))
    filename="NEWS.csv"
    f=open(filename,"w", encoding = 'utf-8')
    headers="Price\n"
    f.write(headers)
    
for j in links:
        Price = j.find("div",attrs={'class':'price'})
        frame.append((Price))
        upperframe.extend(frame)
f.close()
data=pd.DataFrame(upperframe, columns=['Price'])
data.head()

What can I ruin? There have been sites where it works, such as Myprotein, but there are places where it does not.

Answers

Here only the price has been taken as you only asked that

without the User-Agent it give 403 error forbidden

import requests
from bs4 import BeautifulSoup
import pandas as pd

start_url="https://ingatlan.com/lista/elado+lakas+ii-ker?page=1"
page_data=requests.get(start_url, headers={'User-Agent': 'XYZ/3.0'})
soup=BeautifulSoup(page_data.content,"html.parser")

#for i in soup:  #i was first just checking http staus here 
    #print(i)    #without useragent i got 403 as response
    #print()
    
Price=[]

for job_tag in soup.find_all("div",class_="resultspage__content"):
    for job_tag2 in job_tag.find_all("div",class_="listing js-listing"):
        for job_tag3 in job_tag2.find_all("div",class_="price__container js-has-sqm-price-info-tooltip"): 


            price=job_tag3.find("div",class_="price")
            Price.append(price.text.strip())
            #print(Price)

data=pd.DataFrame(Price,columns=["price"])
print(data)

output of pandas DataFrame

         price
0    31.5 M Ft
1    77.9 M Ft
2      62 M Ft
3   129.5 M Ft
4     125 M Ft
5    95.9 M Ft
6    46.9 M Ft
7    45.9 M Ft
8    59.9 M Ft
9     109 M Ft
10     48 M Ft
11     87 M Ft
Logo

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐