Cannot find HTML elements with BeautifulSoup Python

Mangs

0人浏览 · 2022-08-24 17:02:58

Mangs · 2022-08-24 17:02:58 发布

Answer a question

I found a really nice code on the https://towardsdatascience.com/ website for web scraping and I'm trying to implement for my own use.

https://ingatlan.com/lista/elado+lakas+ii-ker?page=1 this is a hungarian real estate website. Firstly, I just want to grab the prices of the real estates but if I run my code I don't get any results, the number of items found is 0.

import urllib.request,sys,time
from bs4 import BeautifulSoup
import requests
import pandas as pd

pagesToGet= 1

upperframe=[]  
for page in range(1,pagesToGet+1):
    print('processing page :', page)
    url = 'https://ingatlan.com/lista/elado+lakas+ii-ker?page='+str(page)
    print(url)
    
    
    try:
        page=requests.get(url)                            
    
    except Exception as e:                                   
        error_type, error_obj, error_info = sys.exc_info()     
        print ('ERROR FOR LINK:',url)                          
        print (error_type, 'Line:', error_info.tb_lineno)     
        continue                                              
    time.sleep(2)   
    soup=BeautifulSoup(page.text,'html.parser')
    frame=[]
    links=soup.find_all('div',attrs={'class':'listing js-listing '})
    print(len(links))
    filename="NEWS.csv"
    f=open(filename,"w", encoding = 'utf-8')
    headers="Price\n"
    f.write(headers)
    
for j in links:
        Price = j.find("div",attrs={'class':'price'})
        frame.append((Price))
        upperframe.extend(frame)
f.close()
data=pd.DataFrame(upperframe, columns=['Price'])
data.head()

What can I ruin? There have been sites where it works, such as Myprotein, but there are places where it does not.

Answers

Here only the price has been taken as you only asked that

without the User-Agent it give 403 error forbidden

import requests
from bs4 import BeautifulSoup
import pandas as pd

start_url="https://ingatlan.com/lista/elado+lakas+ii-ker?page=1"
page_data=requests.get(start_url, headers={'User-Agent': 'XYZ/3.0'})
soup=BeautifulSoup(page_data.content,"html.parser")

#for i in soup:  #i was first just checking http staus here 
    #print(i)    #without useragent i got 403 as response
    #print()
    
Price=[]

for job_tag in soup.find_all("div",class_="resultspage__content"):
    for job_tag2 in job_tag.find_all("div",class_="listing js-listing"):
        for job_tag3 in job_tag2.find_all("div",class_="price__container js-has-sqm-price-info-tooltip"): 


            price=job_tag3.find("div",class_="price")
            Price.append(price.text.strip())
            #print(Price)

data=pd.DataFrame(Price,columns=["price"])
print(data)

output of pandas DataFrame

         price
0    31.5 M Ft
1    77.9 M Ft
2      62 M Ft
3   129.5 M Ft
4     125 M Ft
5    95.9 M Ft
6    46.9 M Ft
7    45.9 M Ft
8    59.9 M Ft
9     109 M Ft
10     48 M Ft
11     87 M Ft

Python

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐

求助！为什么用InsCode部署会出现无限重定向？

Python

如何重塑熊猫。系列

问题:如何重塑熊猫。系列在我看来,它就像 pandas.Series 中的一个错误。 a = pd.Series([1,2,3,4]) b = a.reshape(2,2) b b 有类型 Series 但无法显示,最后一条语句给出异常,非常冗长,最后一行是“TypeError: %d format: a number is required, not numpy.ndarray”。 b.sha

Python

在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制]

问题:在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制] 我刚刚在这里](https://keras.io/initializers/)中阅读了有关[中的 Keras 权重初始化器的信息。在文档中,只介绍了不同的初始化程序。如: model.add(Dense(64, kernel_initializer='random_normal')) 当我没有指定kernel_initia