JSoup doesn't load the whole HTML [duplicate]

Mangs

23人浏览 · 2022-09-08 04:15:43

Mangs · 2022-09-08 04:15:43 发布

Answer a question

I want to scrape a website but when I connect to it using Jsoup.connect(url) only a part of the page is loaded.

When I downloaded the page as html I saw that in one part of the page there is only a loader icon so I concluded that that part of the page is loaded afterwards from some other source.

The funny thing is that inspect element contains the missing html and view page source doesn't. HTML loaded from jSoup is basically the same as when opened from "view page source".

Is there a way to bypass this and to load the whole page as it is displayed in browser?

The page in question is this: https://www.oddsportal.com/tennis/australia/atp-australian-open-2017/results/page/1/

Ask for any additional information I could provide.

===============

EDIT: I am connecting to url like this:

Document doc = null;

try {
    doc =  Jsoup.connect(url).get();
} catch (IOException e) {
    e.printStackTrace();
}

I am getting this div using css selector:

Elements tournamentTable = doc.select("div[id=tournamentTable]");

Content of tournamentTable is <div id="tournamentTable"></div>

Answers

It seems id=tournamentTable is generated dynamically using javascript. JSoup is not evaluating javascript, so you'd have to use library like HtmlUnit. For example:

WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(true); // enable javascript
webClient.getOptions().setThrowExceptionOnScriptError(false); //even if there is error in js continue
webClient.waitForBackgroundJavaScript(5000); // important! wait until javascript finishes rendering
HtmlPage page = webClient.getPage(url);

page.getElementById("tournamentTable");

向您推荐>>百度飞桨AI Studio社区

学AI，认准AI Studio！GPU算力，限时免费领，邀请好友解锁更多惊喜福利 >>>

更多推荐

求助！为什么用InsCode部署会出现无限重定向？

Python

如何重塑熊猫。系列

问题:如何重塑熊猫。系列在我看来,它就像 pandas.Series 中的一个错误。 a = pd.Series([1,2,3,4]) b = a.reshape(2,2) b b 有类型 Series 但无法显示,最后一条语句给出异常,非常冗长,最后一行是“TypeError: %d format: a number is required, not numpy.ndarray”。 b.sha

Python

在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制]

问题:在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制] 我刚刚在这里](https://keras.io/initializers/)中阅读了有关[中的 Keras 权重初始化器的信息。在文档中,只介绍了不同的初始化程序。如: model.add(Dense(64, kernel_initializer='random_normal')) 当我没有指定kernel_initia