CSDN博客搬家到WordPress

一直想把CSDN上写的博客，搬家移植到自己的WordPress博客，网上介绍了一些方法，例如先把CSDN博客导入博客园，然后从博客园导出xml文件，最后用WordPress导入工具导入进WordPress博客，操作步骤太繁杂。于是周末在家自己动手用Python写了爬虫脚本，利用WordPress自定义的投稿功能（wp_insert_post），导入进WordPress博客，步骤只有简单的两步：步骤
阳光岛主

6300人浏览 · 2015-12-06 21:33:40
阳光岛主 · 2015-12-06 21:33:40 发布
一直想把CSDN上写的博客，搬家移植到自己的WordPress博客，网上介绍了一些方法，例如先把CSDN博客导入博客园，然后从博客园导出xml文件，最后用WordPress导入工具导入进WordPress博客，操作步骤太繁杂。
于是周末在家自己动手用Python写了爬虫脚本，利用WordPress自定义的投稿功能（wp_insert_post），导入进WordPress博客，步骤只有简单的两步：
步骤1：Python脚本爬取CSDN博客，包括博客标题，正文，发布日期，浏览数，标签
步骤2：利用WordPress投稿功能插入到WordPress，把所有CSDN博客导入WordPress
Python完整代码：
 
        #!/usr/bin/env python 
       
        # -*- coding:utf8 -*- 
       
        ''' 
       
        author  :    yanggang@mimvp.com 
       
        date    :    2015-12-01 
       
        blog    :    http://blog.mimvp.com 
       
        demo    :    http://blog.mimvp.com/category/csdn_blog/ 
       
        安装库文件： 
       
        1. MySQLdb，连接MySQL数据库，修改浏览次数 
       
        2. bs4，    格式化HTML，抓取过滤CSDN网页内容 
       
        WordPress博客POST文章接口API 
       
        http://blog.mimvp.com/tougao/ 
       
        查询自动POST文章ID： 
       
        select id from wp_posts where post_title like '【米扑代理】%' 
       
        修改POST文章浏览数： 
       
        update wp_postmeta set meta_value=meta_value+100 where meta_key='views' and post_id in (select id from wp_posts where post_title like '【米扑代理】%'); 
       
        修改浏览次数小于100 
       
        select post_id, meta_value from wp_postmeta where  meta_key='views' and meta_value < 100 and post_id in (select id  from wp_posts where post_status = 'publish'); 
       
        ''' 
       
        import 
        time, datetime 
       
        import 
        os, random 
       
        import 
        urllib, urllib2 
       
        import 
        base64 
       
        import 
        bs4 
       
        import 
        MySQLdb 
       
        import 
        sys 
       
        reload 
        (sys) 
       
        sys.setdefaultencoding( 
        'utf-8' 
        ) 
       
        # 排除要抓取的CSDN博客链接 
       
        EXCLUDE_BLOG_URL  
        = 
        [ 
       
        'http://blog.csdn.net/sunboy_2050/article/details/45868613' 
        , 
       
        'http://blog.csdn.net/sunboy_2050/article/details/49449319' 
        , 
       
        'http://blog.csdn.net/sunboy_2050/article/details/48622023' 
        , 
       
        'http://blog.csdn.net/sunboy_2050/article/details/47858847' 
        , 
       
        ] 
       
        # 根据CSDN博客标题打标签tags 
       
        CATEGORY_DICT  
        = 
        { 
       
        'Python'    
        :    
        1 
        , 
       
        'C/C++'     
        :    
        2 
        , 
       
        '网络常识'   
        :    
        3 
        , 
       
        'Algrithm'  
        :    
        11 
        , 
       
        'Clojure'   
        :    
        4 
        , 
       
        'CSDN'      
        :    
        5 
        , 
       
        'Git/SVN'   
        :    
        6 
        , 
       
        'Go'        
        :    
        7 
        , 
       
        'HTML/CSS/JS'  
        :    
        8 
        , 
       
        'iOS/Android'  
        :    
        9 
        , 
       
        'Java/JSP'     
        :    
        113 
        , 
       
        'Linux/Unix'   
        :    
        95 
        , 
       
        'MacBook'      
        :    
        111 
        , 
       
        'Nginx/Apache' 
        :    
        112 
        , 
       
        'PHP'          
        :    
        777 
        , 
       
        'SQL/NoSQL'    
        :    
        114 
        , 
       
        'Storm/Hadoop' 
        :    
        115 
        , 
       
        'WP技巧'        
        :    
        12 
        , 
       
        '产品经理'       
        :    
        116 
        , 
       
        '创业邦'        
        :    
        3 
        , 
       
        '理财'          
        :    
        117 
        , 
       
        '生活小札'       
        :    
        4 
        , 
       
        '科技资讯'       
        :    
        10 
        , 
       
        '米扑代理'       
        :    
        118 
        , 
       
        '系统架构'       
        :    
        97 
        , 
       
        '软件测试'       
        :    
        119 
        , 
       
        'leetcode'   
        :    
        11 
        , 
       
        '链表'        
        :    
        11 
        , 
       
        '算法'        
        :    
        11 
        , 
       
        'leetcode'   
        :    
        11 
        , 
       
        'django'     
        :    
        1 
        , 
       
        'tornado'    
        :    
        1 
        , 
       
        'c语言'       
        :    
        2 
        , 
       
        'c++'        
        :    
        2 
        , 
       
        'c#'         
        :    
        2 
        , 
       
        'vc'         
        :    
        2 
        , 
       
        'qt'         
        :    
        2 
        , 
       
        '网络'     
        :    
        3 
        , 
       
        '算法'     
        :    
        11 
        , 
       
        'Git'     
        :    
        6 
        , 
       
        'SVN'     
        :    
        6 
        , 
       
        '版本控制'  
        :    
        6 
        , 
       
        'HTML'    
        :    
        8 
        , 
       
        'CSS'     
        :    
        8 
        , 
       
        'JS'      
        :    
        8 
        , 
       
        'javascript' 
        :    
        8 
        , 
       
        'iOS'     
        :    
        9 
        , 
       
        'Android' 
        :    
        9 
        , 
       
        'Java'    
        :    
        113 
        , 
       
        'JSP'     
        :    
        113 
        , 
       
        'JVM'     
        :    
        113 
        , 
       
        'Spring'  
        :    
        113 
        , 
       
        'Eclipse' 
        :    
        113 
        , 
       
        'Linux'   
        :    
        95 
        , 
       
        'Unix'    
        :    
        95 
        , 
       
        'Ubuntu'  
        :    
        95 
        , 
       
        'CentOS'  
        :    
        95 
        , 
       
        'Shell'   
        :    
        95 
        , 
       
        'AWK'     
        :    
        95 
        , 
       
        'vim'     
        :    
        95 
        , 
       
        'Nginx'   
        :    
        112 
        , 
       
        'Apache'  
        :    
        112 
        , 
       
        'Tomcat'  
        :    
        112 
        , 
       
        '数据库'   
        :    
        114 
        , 
       
        'SQL'     
        :    
        114 
        , 
       
        'NoSQL'   
        :    
        114 
        , 
       
        'MySQL'   
        :    
        114 
        , 
       
        'Redis'   
        :    
        114 
        , 
       
        'Memcache' 
        :    
        114 
        , 
       
        'mongo'   
        :    
        114 
        , 
       
        'sqlite'  
        :    
        114 
        , 
       
        'WP'      
        :    
        12 
        , 
       
        'WordPress'     
        :    
        12 
        , 
       
        '软件'     
        :    
        119 
        , 
       
        '测试'     
        :    
        119 
        , 
       
        'Storm'   
        :    
        115 
        , 
       
        'Hadoop'  
        :    
        115 
        , 
       
        } 
       
        # CSDN博客用urllib2 + header抓取不了，改用了curl爬取网页 
       
        headers  
        = 
        { 
       
        'Use-Agent' 
        :    
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36' 
        , 
       
        'Cookie'    
        :    
        'bdshare_firstime=1430381399166; uuid_tt_dd=-4349350129693538496_20150430; __gads=ID=b549fb1461110ca8:T=1430381399:S=ALNI_MbhiN2nMgWWiXPP25667Pq-BPZf5g; CloudGuest=dbieDkbnW7cz5P9qimYEjpecak8Udv0BPR8Iflg0PlBd3HR1Wj+RyQQksR2cDE9ab/hPXNGjFpuKsRe1dFjVJjY+mf3bfeWiP6kN0TKk1rY6g5SOuowPs/8F5FJJBdddW71JZ7rp4Q9b8DsLk2TASIPHLj3iL599bPUGKga0mRsaTJi0td73QBZaNlY7+VAl; __qca=P0-127195741-1432118980220; lzstat_uv=29735680241046032471|2955225@3582543@2675686@3411160; __utma=1722629.287164464.1430711804.1438215324.143920134.51; __utmz=1722629.14380001130.49.44.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; UN=Sunboy_2050; UE="yanggang_2050@19.com"; __message_district_code=110000; uuid=92c68418-b2bd-41bc-846d-a1e4e9cf094; _ga=GA1.2.287164464.1430711804; scvh=2011-09-09+17%3a11%3a33+003; FullCookie=1; ViewMode=list; avh=8471508%2c298041%2c45868613%2c2659431%2c17398807; __message_sys_msg_id=0; __message_gu_msg_id=0; __message_cnel_msg_id=0; __message_in_school=0; lzstat_ss=3875140043_20_1448553416_2955225; dc_tos=nyewqw; dc_session_id=1448524616939' 
        , 
       
        'Host'      
        :    
        'blog.csdn.net' 
        , 
       
        'DNT'       
        :    
        1 
        , 
       
        'Cache-Control'     
        :    
        'max-age=0' 
        , 
       
        'Connection'        
        :    
        'keep-alive' 
        , 
       
        'Accept'            
        :    
        'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' 
        , 
       
        'Accept-Encoding'   
        :    
        'gzip, deflate, sdch' 
        , 
       
        'Accept-Language'   
        :    
        'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4' 
        , 
       
        'DNT'    
        :    
        1 
        , 
       
        } 
       
        # hangzhou of 115.29.237.28 
       
        PROXY_MYSQL_SERVER  
        = 
        { 
       
        "host"     
        :    
        "localhost" 
        , 
       
        "port"     
        :    
        3306 
        , 
       
        "user"     
        :    
        "root" 
        , 
       
        "passwd"   
        :    
        "123456" 
        , 
       
        "dbname"   
        :    
        "wp_blog" 
        , 
       
        } 
       
        # 获取当前的实时日期 
       
        def 
        get_now_datetime(): 
       
        return 
        datetime.datetime.now().strftime( 
        '%Y-%m-%d %H:%M:%S' 
        ) 
       
        def 
        split_category(blog_title 
        = 
        ''): 
       
        cat_keys  
        = 
        CATEGORY_DICT.keys() 
       
        blog_title  
        = 
        blog_title.lower() 
       
        print 
        blog_title 
       
        cat_list  
        = 
        [ 
        '5' 
        ] 
       
        for 
        cat  
        in 
        cat_keys: 
       
        cat_lower  
        = 
        cat.lower() 
       
        if 
        blog_title.find(cat_lower) > 
        = 
        0 
        : 
       
        cat_value  
        = 
        str 
        (CATEGORY_DICT.get(cat)) 
       
        cat_list.append(cat_value) 
       
        cat_join  
        = 
        "," 
        .join(cat_list) 
       
        return 
        cat_join 
       
        # 爬取博客链接，先爬取翻页数 ——》 拼接翻页链接 ——》 爬取翻页正文 ——》 获取博客链接 
       
        def 
        spider_blog_url_list(blog_url 
        = 
        'http://blog.csdn.net/sunboy_2050/' 
        ): 
       
        blog_url_list  
        = 
        [] 
       
        blog_url  
        = 
        blog_url 
       
        print 
        ( 
        "blog_url: " 
        + 
        blog_url) 
       
        ######################### 爬取博客翻页链接 ######################## 
       
        try 
        : 
       
        #         req = urllib2.Request(blog_url, headers=headers) 
       
        #         content = urllib2.urlopen(req).read() 
       
        content  
        = 
        os.popen( 
        'curl ' 
        + 
        blog_url).read() 
       
        content  
        = 
        bs4.BeautifulSoup(content) 
       
        content  
        = 
        content.prettify() 
       
        content  
        = 
        bs4.BeautifulSoup(content, from_encoding 
        = 
        'GB18030' 
        ) 
       
        table_soup  
        = 
        content.find( 
        'div' 
        , { 
        "id" 
        : 
        "papelist" 
        }).find_all( 
        'a' 
        ) 
       
        last_a  
        = 
        table_soup[ 
        len 
        (table_soup) 
        - 
        1 
        ] 
       
        last_a_href  
        = 
        last_a[ 
        'href' 
        ] 
       
        list_a_prefix, list_num  
        = 
        os.path.split(last_a_href) 
       
        list_num  
        = 
        int 
        (list_num) 
       
        blog_url_page_list  
        = 
        [] 
       
        for 
        i  
        in 
        range 
        (list_num): 
       
        i  
        + 
        = 
        1 
       
        page_url  
        = 
        'http://blog.csdn.net' 
        + 
        list_a_prefix  
        + 
        '/' 
        + 
        str 
        (i) 
       
        print 
        page_url 
       
        blog_url_page_list.append(page_url) 
       
        except 
        Exception as ex: 
       
        print 
        ( 
        "spider_url() - error_msg: " 
        + 
        str 
        (ex)) 
       
        ######################### 爬取每页的博客链接 ######################## 
       
        blog_url_set  
        = 
        set 
        () 
       
        list_len  
        = 
        len 
        (blog_url_page_list) 
       
        index  
        = 
        0 
       
        for 
        page_url  
        in 
        blog_url_page_list: 
       
        index  
        + 
        = 
        1 
       
        print 
        ( 
        "++++ " 
        , index,  
        "/" 
        , list_len,  
        "page_url: " 
        + 
        page_url) 
       
        try 
        : 
       
        content  
        = 
        os.popen( 
        'curl ' 
        + 
        page_url).read() 
       
        content  
        = 
        bs4.BeautifulSoup(content) 
       
        content  
        = 
        content.prettify() 
       
        content  
        = 
        bs4.BeautifulSoup(content) 
       
        span_list  
        = 
        content.find_all( 
        'span' 
        , { 
        "class" 
        : 
        "link_title" 
        }) 
       
        for 
        span  
        in 
        span_list: 
       
        a_list  
        = 
        span.find_all( 
        'a' 
        ) 
       
        for 
        a  
        in 
        a_list: 
       
        href  
        = 
        a[ 
        'href' 
        ] 
       
        blog_url  
        = 
        'http://blog.csdn.net' 
        + 
        href 
       
        print 
        ( 
        "blog_url: " 
        + 
        blog_url) 
       
        blog_url_set.add(blog_url) 
       
        except 
        Exception as ex: 
       
        print 
        ( 
        "spider_url() - error_msg: " 
        + 
        str 
        (ex)) 
       
        blog_url_list  
        = 
        blog_url_set 
       
        return 
        blog_url_list 
       
        # 爬取博客正文 
       
        def 
        spider_blog_url(blog_url 
        = 
        'http://blog.csdn.net/sunboy_2050/article/details/45868613' 
        ): 
       
        print 
        blog_url 
       
        if 
        blog_url  
        in 
        EXCLUDE_BLOG_URL: 
       
        print 
        ( 
        "blog_url is IN EXCLUDE_BLOG_URL, blog_url: " 
        + 
        blog_url) 
       
        return 
       
        blog_title  
        = 
        '' 
       
        blog_content  
        = 
        '' 
       
        blog_tags  
        = 
        '' 
       
        blog_cat  
        = 
        '5' 
       
        blog_postdate  
        = 
        '' 
       
        blog_viewsCount  
        = 
        0 
       
        try 
        : 
       
        content  
        = 
        os.popen( 
        'curl ' 
        + 
        blog_url).read() 
       
        content  
        = 
        bs4.BeautifulSoup(content) 
       
        content  
        = 
        content.prettify() 
       
        content  
        = 
        bs4.BeautifulSoup(content) 
       
        blog_title  
        = 
        content.find( 
        'span' 
        , { 
        "class" 
        : 
        "link_title" 
        }).text.strip() 
       
        blog_title  
        = 
        blog_title.replace( 
        "\n" 
        , " 
        ").replace(" 
        [置顶] 
        ", " 
        ").strip() 
       
        blog_cat  
        = 
        split_category(blog_title) 
       
        tags_set  
        = 
        set 
        () 
       
        tags_link  
        = 
        content.find( 
        'span' 
        , { 
        "class" 
        : 
        "link_categories" 
        }) 
       
        if 
        tags_link: 
       
        a_list  
        = 
        tags_link.find_all( 
        'a' 
        ) 
       
        for 
        a  
        in 
        a_list: 
       
        tags_set.add(a.text.strip()) 
       
        blog_tags  
        = 
        "," 
        .join(tags_set)               
       
        blog_postdate  
        = 
        content.find( 
        'span' 
        , { 
        "class" 
        : 
        "link_postdate" 
        }).text.strip() 
       
        blog_postdate  
        = 
        blog_postdate  
        + 
        ":" 
        + 
        str 
        (random.randint( 
        10 
        , 
        60 
        )) 
       
        blog_viewsCount  
        = 
        content.find( 
        'span' 
        , { 
        "class" 
        : 
        "link_view" 
        }).text.strip() 
       
        blog_viewsCount  
        = 
        blog_viewsCount.replace( 
        "人阅读" 
        , "") 
       
        blog_content  
        = 
        content.find( 
        'div' 
        , { 
        "class" 
        : 
        "article_content" 
        }) 
       
        # 语法高亮，例如： <pre class="python" xxx    ===> <pre class="brush:python" xxxx 
       
        blog_content  
        = 
        str 
        (blog_content).replace( 
        '<pre class="' 
        ,  
        '<pre class="brush:' 
        ) 
       
        except 
        Exception as ex: 
       
        print 
        ( 
        "spider_blog_url() - error_msg: " 
        + 
        str 
        (ex)) 
       
        print 
        blog_title 
       
        print 
        blog_tags 
       
        print 
        blog_postdate 
       
        print 
        blog_viewsCount 
       
        print 
        blog_cat 
       
        #     print blog_content 
       
        post_blog(blog_title, blog_tags, blog_content, blog_postdate, blog_viewsCount, blog_cat, blog_url) 
       
        # 爬取博客正文 
       
        def 
        spider_blog(blog_root 
        = 
        'http://blog.csdn.net/sunboy_2050/' 
        ): 
       
        blog_url_list  
        = 
        spider_blog_url_list(blog_root) 
       
        list_len  
        = 
        len 
        (blog_url_list) 
       
        index  
        = 
        0 
       
        for 
        blog_url  
        in 
        blog_url_list: 
       
        index  
        + 
        = 
        1 
       
        print 
        "++++++++++++++ " 
        , index,  
        "/" 
        , list_len,  
        "++++++++++++++ " 
        , blog_url 
       
        spider_blog_url(blog_url) 
       
        # 写入博客， cat=5 ： CSDN分类 
       
        def 
        post_blog(title 
        = 
        'test_title' 
        , tags 
        = 
        'test_tag' 
        , content 
        = 
        'test_content' 
        , postdate 
        = 
        ' 
        ', viewsCount=0, cat=' 
        5 
        ', blog_url=' 
        '): 
       
        if 
        postdate  
        = 
        = 
        '': 
       
        postdate  
        = 
        datetime.datetime.now().strftime( 
        '%Y-%m-%d %H:%M:%S' 
        ) 
       
        post_data  
        = 
        { 
       
        'tougao_form'           
        :    
        'blog_mimvp' 
        , 
       
        'tougao_authorname'     
        :    
        'admin' 
        , 
       
        'tougao_authoremail'    
        :    
        'yanggang@mimvp.com' 
        , 
       
        'tougao_authorblog'     
        :    
        'blog.mimvp.com' 
        , 
       
        'tougao_title'          
        :   title, 
       
        'tougao_tags'           
        :   tags, 
       
        'tougao_cat'            
        :   cat, 
       
        'tougao_content'        
        :   content, 
       
        'tougao_date'           
        :   postdate, 
       
        } 
       
        content_head  
        = 
        "<div style='font-size: 16px;'>" 
       
        content_foot  
        = 
        "</div><div style='margin: 50px auto 50px;'><h3><font color='red'>原文：</font> <a target='_blank' href='{blog_url}'>{blog_title}</a></h3></div>" 
        . 
        format 
        (blog_url 
        = 
        blog_url, blog_title 
        = 
        title) 
       
        tougao_content  
        = 
        content_head  
        + 
        str 
        (content)  
        + 
        str 
        (content_foot) 
       
        try 
        : 
       
        POST_URL  
        = 
        'http://blog.mimvp.com/tougao/' 
       
        post_data[ 
        'tougao_content' 
        ]  
        = 
        tougao_content 
       
        post_data  
        = 
        urllib.urlencode(post_data) 
       
        req  
        = 
        urllib2.Request(POST_URL, data 
        = 
        post_data) 
       
        res  
        = 
        urllib2.urlopen(req).read() 
       
        except 
        Exception as ex: 
       
        print 
        ( 
        "error_msg: " 
        + 
        str 
        (ex)) 
       
        # 延时一分钟后，等文章发布后，修改浏览次数 
       
        print 
        ( 
        "sleep 3, then modify post_views..." 
        ) 
       
        time.sleep( 
        3 
        ) 
       
        modify_post_views(title, viewsCount) 
       
        # 文章发布后，修改浏览次数 
       
        def 
        modify_post_views(tougao_title 
        = 
        '', viewsCount 
        = 
        100 
        ): 
       
        if 
        tougao_title  
        = 
        = 
        '': 
       
        print 
        ( 
        "error: no post title" 
        ) 
       
        return 
       
        sql  
        = 
        "update wp_postmeta set meta_value=meta_value+{viewsCount} where meta_key='views' and post_id in (select id from wp_posts where post_title = '{tougao_title}');" 
        . 
        format 
        (viewsCount 
        = 
        viewsCount, tougao_title 
        = 
        tougao_title) 
       
        print 
        sql 
       
        try 
        : 
       
        sql_conn  
        =  
        MySQLdb.connect(host 
        = 
        PROXY_MYSQL_SERVER[ 
        'host' 
        ],  
       
        port 
        = 
        int 
        (PROXY_MYSQL_SERVER[ 
        'port' 
        ]),  
       
        user 
        = 
        PROXY_MYSQL_SERVER[ 
        'user' 
        ],  
       
        passwd 
        = 
        PROXY_MYSQL_SERVER[ 
        'passwd' 
        ],  
       
        db 
        = 
        PROXY_MYSQL_SERVER[ 
        'dbname' 
        ],  
       
        charset 
        = 
        'utf8' 
        ) 
       
        sql_cursor  
        = 
        sql_conn.cursor() 
       
        sql_cursor.execute(sql) 
       
        sql_cursor.close() 
       
        sql_conn.close() 
       
        except 
        Exception, ex: 
       
        print 
        ( 
        "check_except_province_little_proxy（） -- error_msg: %r" 
        % 
        ex) 
       
        if 
        __name__  
        = 
        = 
        '__main__' 
        : 
       
        spider_blog()
博客搬家示例
导出 CSDN原博客： http://blog.csdn.net/sunboy_2050
导入 WordPress博客： http://blog.mimvp.com/category/csdn_blog/
游戏开发技术专区
这里是一个专注于游戏开发的社区，我们致力于为广大游戏爱好者提供一个良好的学习和交流平台。我们的专区包含了各大流行引擎的技术博文，涵盖了从入门到进阶的各个阶段，无论你是初学者还是资深开发者，都能在这里找到适合自己的内容。除此之外，我们还会不定期举办游戏开发相关的活动，让大家更好地交流互动。加入我们，一起探索游戏开发的奥秘吧！
更多推荐