How can I do web scraping in Julia?
Answer a question
I want to extract the names of universities and their websites from this site into lists.
In Python I did it with BeautifulSoup v4:
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://thebestschools.org/features/best-computer-science-programs-in-the-world/')
content = BeautifulSoup(page.text, 'html.parser')
college_name = []
college_link = []
college_name_list = content.find_all('h3',class_='college')
for college in college_name_list:
if college.find('a'):
college_name.append(college.find('a').text)
college_link.append(college.find('a')['href'])
I really like programming in Julia and since it's very similar to Python, I wanted to know if I can do web scraping in Julia too. Any help would be appreciated.
Answers
Your python code doesn't quite work. I guess the website has been updated recently. Since they have removed the links as far as i can tell,. Here is a similar example using Gumbo.jl and Cascadia.jl.
I am using the built in download command to download the webpage. which writes it to disk in a temp-file, which i then read into String. It might be cleaner to use HTTP.jl, which could read it straight into a String. But for this simple example it's fine
using Gumbo
using Cascadia
url = "https://thebestschools.org/features/best-computer-science-programs-in-the-world/"
page = parsehtml(read(download(url), String))
college_name = String[]
college_location = String[]
sections = eachmatch(sel"section", page.root)
for section in sections
maybe_col_heading = eachmatch(sel"h3.college", section)
if length(maybe_col_heading) == 0
continue
end
col_heading = first(maybe_col_heading)
name = strip(text(last(col_heading.children)))
push!(college_name, name)
loc = first(eachmatch(sel".school-location", section))
push!(college_location, text(loc[1]))
end
[college_name college_location]
Outputs
julia> [college_name college_location]
51×2 Array{String,2}:
"Massachusetts Institute of Technology (MIT)" "Cambridge, Massachusetts"
"Massachusetts Institute of Technology (MIT)" "Cambridge, Massachusetts"
"Stanford University" "Stanford, California"
"Carnegie Mellon University" "Pittsburgh, Pennsylvania"
⋮
"Shanghai Jiao Tong University" "Shanghai, China"
"Lomonosov Moscow State University" "Moscow, Russia"
"City University of Hong Kong" "Hong Kong"
Seems like it listed MIT twice. probably the filtering code in my demo isn't quiet right. But :shrug: MIT is a great university I hear. Julia was invented there :joy:
更多推荐

所有评论(0)