Julia 0.4.0-dev+7053 html解析速度极快
问题:Julia 0.4.0-dev+7053 html解析速度极快 编辑:...好吧,在@Ismael VC 的慷慨帮助下变得很快。解决方案是先擦除我的 Julia v0.4,从最近的每晚重新安装它,然后再进行一定数量的包杂耍:Pkg.init()、Pkg.add("Gumbo")。 Gumbo 的添加首先会产生构建错误: INFO: Installing Gumbo v0.1.0 INFO:
问题:Julia 0.4.0-dev+7053 html解析速度极快
编辑:...好吧,在@Ismael VC 的慷慨帮助下变得很快。解决方案是先擦除我的 Julia v0.4,从最近的每晚重新安装它,然后再进行一定数量的包杂耍:Pkg.init()
、Pkg.add("Gumbo")
。 Gumbo 的添加首先会产生构建错误:
INFO: Installing Gumbo v0.1.0
INFO: Building Gumbo
WARNING: deprecated syntax "[a=>b, ...]" at /Users/szalmaf/.julia/v0.4/Gumbo/deps/build.jl:19.
Use "Dict(a=>b, ...)" instead.
INFO: Attempting to Create directory /Users/szalmaf/.julia/v0.4/Gumbo/deps/downloads
INFO: Downloading file http://jamesporter.me/static/julia/gumbo-1.0.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
curl: (22) The requested URL returned error: 404 Not Found
================================[ ERROR: Gumbo ]================================
LoadError: failed process: Process(`curl -f -o /Users/szalmaf/.julia/v0.4/Gumbo/deps/downloads/gumbo-1.0.tar.gz -L http://jamesporter.me/static/julia/gumbo-1.0.tar.gz`, ProcessExited(22)) [22]
while loading /Users/szalmaf/.julia/v0.4/Gumbo/deps/build.jl, in expression starting on line 19
================================================================================
================================[ BUILD ERRORS ]================================
WARNING: Gumbo had build errors.
- packages with build errors remain installed in /Users/szalmaf/.julia/v0.4
- build the package(s) and all dependencies with `Pkg.build("Gumbo")`
- build a single package by running its `deps/build.jl` script
================================================================================
INFO: Package database updated
,因此需要从主分支Pkg.update()
、Pkg.build("Gumbo")
中查看最新的 Gumbo,这反过来又会产生parsehtml
速度极快的 Gumbo。
注意:问题不在于评论者(他们没有足够仔细地阅读以前的评论)提到的内容,即声称 JIT 编译器使“它”变慢的说法。如果您阅读我和@Ismael VC 之间的来回讨论,您会看到我像他一样运行了他的确切测试代码,并且我在前两条评论中得到了结果,这对于我的原始安装确实太慢了.无论如何,重要的是parsehtml
在我们的私人聊天中得到了伊斯梅尔的帮助。再次感谢!
原帖:
Julia 0.4.0-dev+7053 html解析速度极慢?
尽管 Julia 语言在很多事情上都卖得很快,但在解析网页等生活中的基本事情上看起来却很慢。
分析http://julialang.org
网页,显示 Julia 对抗 C、Fortran、R、Matlab 等的速度有多快。
# using HTTPClient, Gumbo
julia_url = "http://julialang.org"
println(" scrape start: ", Dates.unix2datetime(time()))
julia_pageg = julia_url |> get
println(" scrape end: ", Dates.unix2datetime(time()))
julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml
println(" parsed: ", Dates.unix2datetime(time()))
给
scrape start: 2015-09-05T16:47:03.843
scrape end: 2015-09-05T16:47:04.044
parsed: 2015-09-05T16:47:04.41
这表明获取这个网页需要大约 100 毫秒,这在我的 wifi 连接上是合理的,但是,解析这个简单的页面需要大约 400 毫秒,这听起来以今天的标准来说是令人望而却步的。
对更复杂的网页做同样的测试
julia_url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
println(" scrape start: ", Dates.unix2datetime(time()))
julia_pageg = julia_url |> get
println(" scrape end: ", Dates.unix2datetime(time()))
julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml
println(" parsed: ", Dates.unix2datetime(time()))
给
scrape start: 2015-09-05T16:57:52.054
scrape end: 2015-09-05T16:57:52.736
parsed: 2015-09-05T16:57:53.699
解析几乎需要一整秒。
我可能遗漏了一些东西,但是在 Julia 中是否有更好/更快的方法来解析网页或从中获取 html 元素?如果是这样,怎么做?
解答
首先,你红过手册中的性能提示吗?您使用的是哪个 Julia 版本? (versioninfo()
)
- http://julia.readthedocs.org/en/latest/manual/performance-tips/
您可以先阅读它,然后按照文档中的建议将代码放入函数中,有一个@time
宏,它也提示您有关内存分配的信息,如下所示:
朱莉娅 v0.3.11
测试于:https://juliabox.org
using HTTPClient, Gumbo
function test(url::String)
@show url
print("Scraping: ")
@time page = get(url)
print("Parsing: ")
@time page = parsehtml(bytestring(page.body))
end
let
gc_disable()
url = "http://julialang.org"
println("First run:")
test(url) # first run JITed
println("\nSecond run:")
test(url)
url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
println("\nThird run:")
test(url)
println("\nFourth run:")
test(url)
gc_enable()
end
First run:
url => "http://julialang.org"
Scraping: elapsed time: 0.248092469 seconds (3971912 bytes allocated)
Parsing: elapsed time: 0.850927483 seconds (27207516 bytes allocated)
Second run:
url => "http://julialang.org"
Scraping: elapsed time: 0.055722638 seconds (73952 bytes allocated)
Parsing: elapsed time: 0.005446998 seconds (821800 bytes allocated)
Third run:
url => "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping: elapsed time: 0.282382774 seconds (619324 bytes allocated)
Parsing: elapsed time: 0.227427243 seconds (9728620 bytes allocated)
Fourth run:
url => "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping: elapsed time: 0.288903961 seconds (400272 bytes allocated)
Parsing: elapsed time: 0.017787089 seconds (1516560 bytes allocated)
这是您的代码与@time
的时序:
julia_url = "http://julialang.org"
@time julia_pageg = julia_url |> get
@time julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml
第一次运行:
elapsed time: 0.361194892 seconds (11108960 bytes allocated)
elapsed time: 0.996812988 seconds (34546156 bytes allocated, 4.04% gc time)
第二次运行:
elapsed time: 0.018920084 seconds (77952 bytes allocated)
elapsed time: 0.006632215 seconds (823256 bytes allocated)
julia_url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
@time julia_pageg = julia_url |> get
@time julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml
第一次运行:
elapsed time: 0.33795947 seconds (535916 bytes allocated)
elapsed time: 0.224386491 seconds (9729852 bytes allocated)
第二次运行:
elapsed time: 0.276848452 seconds (584944 bytes allocated)
elapsed time: 0.018806686 seconds (1517856 bytes allocated)
编辑 v0.4.0-dev+7053
在 0.4+ 版本中,确保首先执行Pkg.checkout("Gumbo")
以获取最新的提交,然后在 JuliaBox 中执行Pkg.build("Gumbo")
我得到:
http://nbviewer.ipython.org/gist/Ismael-VC/4c241228f04ed54c70e2
First run:
url = "http://julialang.org"
Scraping: 0.227681 seconds (85.11 k allocations: 3.585 MB)
Parsing: 0.696063 seconds (799.12 k allocations: 29.450 MB)
Second run:
url = "http://julialang.org"
Scraping: 0.018953 seconds (571 allocations: 69.344 KB)
Parsing: 0.007132 seconds (15.91 k allocations: 916.313 KB)
Third run:
url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping: 0.313128 seconds (4.86 k allocations: 608.850 KB)
Parsing: 0.196110 seconds (270.17 k allocations: 10.356 MB)
Fourth run:
url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping: 0.307949 seconds (1.41 k allocations: 470.953 KB)
Parsing: 0.019801 seconds (23.82 k allocations: 1.627 MB)
更多推荐
所有评论(0)