Python crawler actual combat, pyecharts module, python realizes China Metro data visualization

preface Using Python to realize the visualization of China Metro data. No more nonsense. Let's start happily~ development tool Python version: 3.6.4 Related modules: requests module; wordcloud module

Mangs

13人浏览 · 2022-08-12 16:51:45

Mangs · 2022-08-12 16:51:45 发布

preface

Using Python to realize the visualization of China Metro data. No more nonsense.

Let's start happily~

development tool

Python version: 3.6.4

Related modules:

requests module;

wordcloud module;

pandas module;

numpy module;

jieba module;

Pyecarts module;

matplotlib module;

And some Python built-in modules.

Environment construction

Many people learn Python and don't know where to start.

Many people learn to find python，After mastering the basic grammar, I don't know where to start.

Many people who may already know the case do not learn more advanced knowledge.

For these three types of people, I provide you with a good learning platform, free access to video tutorials, e-books, and the source code of the course!

QQ Group:101677771

Welcome to join us and discuss and study together

Install Python and add it to the environment variable. pip can install the relevant modules required.

This time, through the acquisition of subway line data, the urban distribution data are visually analyzed.

Analysis acquisition

Metro information is obtained from Gaode map.

The above mainly obtains the "id", "cityname" and "name" of the city.

It is used to splice the request website to obtain the specific information of the subway line.

Find the request information and get the details of subway lines and stations in the lines in each city.

get data

Specific code

import json
import requests
from bs4 import BeautifulSoup

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}

def get_message(ID, cityname, name):
    """
    Metro line information acquisition
    """
    url = 'http://map.amap.com/service/subway?_1555502190153&srhdata=' + ID + '_drw_' + cityname + '.json'
    response = requests.get(url=url, headers=headers)
    html = response.text
    result = json.loads(html)
    for i in result['l']:
        for j in i['st']:
            # Judge whether the subway line is included
            if len(i['la']) > 0:
                print(name, i['ln'] + '(' + i['la'] + ')', j['n'])
                with open('subway.csv', 'a+', encoding='gbk') as f:
                    f.write(name + ',' + i['ln'] + '(' + i['la'] + ')' + ',' + j['n'] + '\n')
            else:
                print(name, i['ln'], j['n'])
                with open('subway.csv', 'a+', encoding='gbk') as f:
                    f.write(name + ',' + i['ln'] + ',' + j['n'] + '\n')

def get_city():
    """
    Urban information acquisition
    """
    url = 'http://map.amap.com/subway/index.html?&1100'
    response = requests.get(url=url, headers=headers)
    html = response.text
    # code
    html = html.encode('ISO-8859-1')
    html = html.decode('utf-8')
    soup = BeautifulSoup(html, 'lxml')
    # City list
    res1 = soup.find_all(class_="city-list fl")[0]
    res2 = soup.find_all(class_="more-city-list")[0]
    for i in res1.find_all('a'):
        # City ID value
        ID = i['id']
        # City Pinyin name
        cityname = i['cityname']
        # City name
        name = i.get_text()
        get_message(ID, cityname, name)
    for i in res2.find_all('a'):
        # City ID value
        ID = i['id']
        # City Pinyin name
        cityname = i['cityname']
        # City name
        name = i.get_text()
        get_message(ID, cityname, name)

if __name__ == '__main__':
    get_city()

Display of data acquisition results

3541 subway stations

Data visualization

Firstly, clean the data to remove the duplicate transfer station information.

from wordcloud import WordCloud, ImageColorGenerator
from pyecharts import Line, Bar
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import jieba

# Set column name to align with data
pd.set_option('display.unicode.ambiguous_as_wide', True)
pd.set_option('display.unicode.east_asian_width', True)
# Show 10 lines
pd.set_option('display.max_rows', 10)
# Read data
df = pd.read_csv('subway.csv', header=None, names=['city', 'line', 'station'], encoding='gbk')
# Subway lines in various cities
df_line = df.groupby(['city', 'line']).count().reset_index()
print(df_line)

By grouping cities and subway lines, the total number of subway lines in China is obtained.

183 subway lines

def create_map(df):
    # draw a map
    value = [i for i in df['line']]
    attr = [i for i in df['city']]
    geo = Geo("Distribution of opened metro cities", title_pos='center', title_top='0', width=800, height=400, title_color="#fff", background_color="#404a59", )
    geo.add("", attr, value, is_visualmap=True, visual_range=[0, 25], visual_text_color="#fff", symbol_size=15)
    geo.render("Distribution of opened metro cities.html")

def create_line(df):
    """
    Number and distribution of generated urban subway lines
    """
    title_len = df['line']
    bins = [0, 5, 10, 15, 20, 25]
    level = ['0-5', '5-10', '10-15', '15-20', '20 above']
    len_stage = pd.cut(title_len, bins=bins, labels=level).value_counts().sort_index()
    # Generate histogram
    attr = len_stage.index
    v1 = len_stage.values
    bar = Bar("Number and distribution of subway lines in each city", title_pos='center', title_top='18', width=800, height=400)
    bar.add("", attr, v1, is_stack=True, is_label_show=True)
    bar.render("Number and distribution of subway lines in each city.html")

# Number of subway lines in each city
df_city = df_line.groupby(['city']).count().reset_index().sort_values(by='line', ascending=False)
print(df_city)
create_map(df_city)
create_line(df_city)

Data of cities that have opened subway, as well as the number of subway lines in each city.

Subways opened in 32 cities

Urban distribution

Most of them are provincial capitals, as well as some cities with strong economic strength.

Number and distribution of lines

It can be seen that most of them are still in the "0-5" stage, of course, at least 1 line.

# Which line has the most subway stations in which city
print(df_line.sort_values(by='station', ascending=False))

Which line has the most subway stations in which city

Beijing line 10 is the first and Chongqing line 3 is the second

Remove data from duplicate transfer stations

# Remove subway data from duplicate transfer stations
df_station = df.groupby(['city', 'station']).count().reset_index()
print(df_station)

Including 3034 subway stations

Nearly 400 subway stations have been reduced

Next, let's see which city has the most subway stations

# Count the number of subway stations included in each city (duplicate transfer stations have been removed)
print(df_station.groupby(['city']).count().reset_index().sort_values(by='station', ascending=False))

There are so many subway stations in Wuhan

Realize the operation in the new weekly to generate the subway noun cloud

def create_wordcloud(df):
    """
    Generate Metro noun cloud
    """
    # participle
    text = ''
    for line in df['station']:
        text += ' '.join(jieba.cut(line, cut_all=False))
        text += ' '
    backgroud_Image = plt.imread('rocket.jpg')
    wc = WordCloud(
        background_color='white',
        mask=backgroud_Image,
        font_path='C:\Windows\Fonts\Huakangli Gold Black W8.TTF',
        max_words=1000,
        max_font_size=150,
        min_font_size=15,
        prefer_horizontal=1,
        random_state=50,
    )
    wc.generate_from_text(text)
    img_colors = ImageColorGenerator(backgroud_Image)
    wc.recolor(color_func=img_colors)
    # Look at those with high word frequency
    process_word = WordCloud.process_text(wc, text)
    sort = sorted(process_word.items(), key=lambda e: e[1], reverse=True)
    print(sort[:50])
    plt.imshow(wc)
    plt.axis('off')
    wc.to_file("Subway noun cloud.jpg")
    print('Word cloud generated successfully!')

create_wordcloud(df_station)

Show word cloud

点击阅读全文

向您推荐>>百度飞桨AI Studio社区

学AI，认准AI Studio！GPU算力，限时免费领，邀请好友解锁更多惊喜福利 >>>

更多推荐

求助！为什么用InsCode部署会出现无限重定向？

Python

如何重塑熊猫。系列

问题:如何重塑熊猫。系列在我看来,它就像 pandas.Series 中的一个错误。 a = pd.Series([1,2,3,4]) b = a.reshape(2,2) b b 有类型 Series 但无法显示,最后一条语句给出异常,非常冗长,最后一行是“TypeError: %d format: a number is required, not numpy.ndarray”。 b.sha

Python

在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制]

问题:在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制] 我刚刚在这里](https://keras.io/initializers/)中阅读了有关[中的 Keras 权重初始化器的信息。在文档中,只介绍了不同的初始化程序。如: model.add(Dense(64, kernel_initializer='random_normal')) 当我没有指定kernel_initia