FinGPT数据工程指南：实时金融新闻爬取与清洗全流程

何红桔Joey

993人浏览 · 2025-09-26 04:32:59

何红桔Joey · 2025-09-26 04:32:59 发布

FinGPT数据工程指南：实时金融新闻爬取与清洗全流程

【免费下载链接】FinGPT 项目地址: https://gitcode.com/GitHub_Trending/fi/FinGPT

金融市场瞬息万变，每一条新闻都可能引发资产价格剧烈波动。然而，面对海量的金融资讯，如何快速准确地获取、筛选和处理关键信息，成为许多投资者和分析师的痛点。本文将基于FinGPT项目，详细介绍如何利用其内置工具实现金融新闻的实时爬取与清洗，帮助读者轻松掌握这一核心技能。

一、系统架构概览

FinGPT的金融新闻处理系统采用模块化设计，主要由数据爬取、数据清洗和情感分析三大模块组成。

该框架能够实现从多个金融数据源实时获取新闻，并进行自动化处理和分析。其中，数据爬取模块负责从各类金融网站抓取新闻内容，数据清洗模块对原始数据进行去重、标准化等处理，情感分析模块则对新闻文本进行情感倾向判断。

相关模块源码路径：fingpt/FinGPT_RAG/multisource_retrieval/

二、环境准备

在开始使用FinGPT进行金融新闻爬取与清洗之前，需要先完成环境配置。

首先，克隆项目代码库：

git clone https://gitcode.com/GitHub_Trending/fi/FinGPT.git
cd GitHub_Trending/fi/FinGPT

然后安装所需依赖：

pip install -r fingpt/FinGPT_RAG/requirements.txt

三、多源新闻爬取实现

FinGPT支持从多种主流金融媒体爬取新闻，包括Seeking Alpha、Bloomberg、Reuters等。

3.1 Seeking Alpha新闻爬取

Seeking Alpha是一个重要的金融资讯平台，FinGPT提供了专门的API接口和网页爬虫工具来获取其内容。

以下是使用Seeking Alpha API爬取新闻的核心代码：

def process_article(number):
    url = f"{base_url}{number}"
    try:
        user_agent = UserAgent().random
        headers = {"User-Agent": user_agent}
        response = requests.get(url, headers=headers)
        success = response.status_code == 200
        content = response.json().get("data", {}).get("attributes", {}).get("content", None)
    except Exception as e:
        success = False
        content = None

    return {
        "number": number,
        "success": success,
        "content": content
    }

源码路径：fingpt/FinGPT_RAG/multisource_retrieval/scrapers/seeking_alpha/seeking_alpha_scraping.py

3.2 通用网页爬取器

除了针对特定网站的API爬取方式，FinGPT还提供了通用的网页爬取器，能够处理不同结构的金融新闻网页。

以下是网页爬取的核心代码，它能够根据不同的网站域名调用相应的解析方法：

def scraping_by_url(link, subject):
    if "seekingalpha.com" in link:
        print("Found 1 Seeking Alpha link:", link)
        if "xml" not in link:
            print("Non-.xml case of Seeking Alpha")
            url, subject = scrape_seeking_alpha_article_page(link, subject)
            if url != "N/A":
                return url, subject
        elif "xml" in link:
            print(".xml case of Seeking Alpha")
            response = requests_get(link)
            soup = BeautifulSoup(response.content, 'lxml-xml')
            hyphenated_subject = "-".join([word.strip("'\"") for word in subject.split()])
            print("Hyphenated subject:", hyphenated_subject)

            # Find the first <loc> whose text contains the hyphenated subject
            loc_element = soup.find('loc', string=re.compile(hyphenated_subject))
            if loc_element:
                link = loc_element.text
                print("Found:", link, "from .xml")
                url, subject = scrape_seeking_alpha_article_page(link, subject)
                if url != "N/A":
                    return url, subject
            print("Didn't find from .xml")
    elif "reuters.com" in link:
        print("Found 1 Reuters link:", link)
        url, subject = scrape_reuters(subject)
        if url != "N/A":
            return url, subject
    elif "marketscreener.com" in link:
        print("Found 1 Market Screener link:", link)
        url, subject = scrape_market_screener.scrape_market_screen_article_page(link, subject)
        if url != "N/A":
            return url, subject
    # 其他网站处理代码...

源码路径：fingpt/FinGPT_RAG/multisource_retrieval/news_scraper.py

四、数据清洗与标准化

爬取到的原始数据往往存在重复、格式不统一等问题，需要进行清洗和标准化处理。

4.1 去重处理

FinGPT提供了基于文本相似度的去重算法，能够有效识别重复或高度相似的新闻内容：

def similarity_score(a, b):
    words_a = a.split()
    words_b = b.split()
    matching_words = 0

    for word_a in words_a:
        for word_b in words_b:
            if word_a in word_b or word_b in word_a:
                matching_words += 1
                break

    similarity = matching_words / min(len(words_a), len(words_b))
    return similarity

4.2 文本标准化

对于不同来源的新闻文本，FinGPT会进行统一的格式化处理，包括去除特殊字符、统一日期格式等。相关工具类源码路径：fingpt/FinGPT_RAG/multisource_retrieval/utils/

五、情感分析与数据标记

清洗后的新闻数据可以进一步进行情感分析，为后续的金融分析提供支持。

FinGPT提供了基于外部LLM的情感分类功能，能够自动将新闻文本分类为积极、消极或中性：

for row_index, row in df.iloc[1:].iterrows():
    target_sentence = row[sentence_column]
    classification_response = external_LLMs.extract_classification(target_sentence, classification_prompt)
    if "negative" in classification_response:
        classification_response = 0
    elif "positive" in classification_response:
        classification_response = 1
    elif "neutral" in classification_response:
        classification_response = 2
    df.at[row_index, "openai_inferred_sentiment_from_RAG"] = classification_response

源码路径：fingpt/FinGPT_RAG/multisource_retrieval/utils/sentiment_classification_by_external_LLMs.py

六、实战应用与扩展

6.1 实时监控系统搭建

利用FinGPT的新闻爬取和处理功能，可以搭建一个实时金融新闻监控系统。具体实现可参考社区教程：README.md

6.2 自定义数据源扩展

如果需要从FinGPT目前不支持的网站爬取新闻，可以通过继承现有的爬虫基类，实现自定义的解析方法。相关扩展文档：fingpt/FinGPT_RAG/multisource_retrieval/scrapers/

七、总结与展望

本文详细介绍了如何使用FinGPT进行金融新闻的实时爬取与清洗。通过FinGPT提供的工具，我们可以轻松构建一个从数据获取到情感分析的完整 pipeline。

随着金融市场的不断变化，FinGPT也在持续迭代更新。未来，项目将进一步优化爬取效率，增加更多数据源支持，并提升情感分析的准确性。

建议读者深入研究项目源码，根据自身需求进行定制化开发。如有问题，可参考项目贡献指南：CONTRIBUTING.md。

【免费下载链接】FinGPT 项目地址: https://gitcode.com/GitHub_Trending/fi/FinGPT

亚马逊云科技技术品牌专区

更多推荐

WSaiOS认知内核：一种模块化可解释人工智能操作系统核心的设计与实现

亚马逊云科技技术品牌专区

GEO系统实战指南：提升网站流量与AI引荐率的3大关键技术

GEO系统已成为解决网站流量下降和提升AI引荐率的有效工具。通过去中心化流控、多引擎调度和智能合规校验，格子GEO系统为批量内容运营提供了安全高效的解决方案。包括知识库、拓词、一键授权发布等模块，构成了完整产品体系。未来随着生成式AI持续渗透，GEO技术的应用场景将进一步扩展。GEO系统流控模块示例。