正则表达式详解（C++20 ）

君鼎

367人浏览 · 2026-06-28 15:18:41

君鼎 · 2026-06-28 15:18:41 发布

正则表达式详解（C++20 ）

1. 什么是正则表达式

正则表达式（Regular Expression，简称 regex）是一种用于描述字符串匹配模式的强大工具。它本质上是一种微型的领域特定语言，通过特定的语法规则来定义一组字符串的集合。正则表达式广泛应用于：

输入验证（邮箱、电话、URL、密码强度等）
文本搜索与提取（日志分析、数据抓取）
查找替换（敏感词过滤、格式化整理）
编译器词法分析、语法高亮等

在 C++20 中，标准库 <regex> 提供了完整的正则表达式支持，包括匹配、搜索、替换和迭代等功能。

2. 正则表达式基本语法一览

这里以默认的 ECMAScript 语法（JavaScript 风格）为例，这也是 C++ std::regex 的默认语法。大多数通用 regex 知识在此适用。

2.1 普通字符与元字符

普通字符（字母、数字、空格等）匹配自身。
元字符有特殊含义：. ^ $ * + ? { } [ ] \ | ( )

若要匹配元字符本身，需用反斜杠 \ 转义。在 C++ 代码中，反斜杠本身需要转义，因此推荐使用原始字符串字面量 R"(...)"，避免转义地狱。

2.2 字符类

模式	说明
`[abc]`	匹配 a、b 或 c 中的任意一个字符
`[^abc]`	匹配除 a、b、c 外的任意一个字符（否定）
`[a-z]`	匹配 a 到 z 的任意小写字母
`.`	匹配除换行符外的任意单个字符
`\d`	匹配一个数字，等价于 `[0-9]`
`\D`	匹配一个非数字，等价于 `[^0-9]`
`\w`	匹配一个单词字符（字母、数字、下划线），等价于 `[A-Za-z0-9_]`
`\W`	匹配一个非单词字符
`\s`	匹配一个空白字符（空格、制表符、换行等）
`\S`	匹配一个非空白字符

2.3 量词（重复次数）

模式	说明
`*`	前一表达式出现 0 次或多次
`+`	前一表达式出现 1 次或多次
`?`	前一表达式出现 0 次或 1 次
`{n}`	前一表达式恰好出现 n 次
`{n,}`	前一表达式出现至少 n 次
`{n,m}`	前一表达式出现 n 到 m 次

默认是贪婪匹配，量词后面加 ? 变为非贪婪匹配（如 *?, +?, ??）。

2.4 定位符（锚点）

模式	说明
`^`	匹配字符串开头
`$`	匹配字符串结尾
`\b`	匹配单词边界
`\B`	匹配非单词边界

2.5 分组与捕获

(pattern)：捕获组，匹配并捕获内容，可通过编号访问。
(?:pattern)：非捕获组，只匹配不捕获，不产生反向引用。
\1, \2…：反向引用，匹配与第 n 个捕获组相同的内容。
(?'name'pattern) 或 (?<name>pattern)：命名捕获组（C++ 中需 std::regex::ECMAScript 并注意支持情况，std::regex 本身不直接支持命名捕获，可用编号替代）。

2.6 零宽断言

模式	说明
`(?=p)`	正向先行断言，要求后面是 p，但不消耗字符
`(?!p)`	负向先行断言，要求后面不是 p
`(?<=p)`	正向后发断言，要求前面是 p（C++ `std::regex` 不完全支持可变宽度后发断言）
`(?<!p)`	负向后发断言，要求前面不是 p

std::regex 对后发断言支持有限，使用时需测试。

3. C++20 正则表达式库核心组件

3.1 头文件与主要类

#include <regex>

std::regex：存储编译后的正则表达式（基于模板 std::basic_regex<char>）。
std::wregex：用于宽字符的正则表达式。
std::cmatch / std::smatch：匹配结果集，分别对应 C 风格字符串和 std::string。
std::sub_match：子匹配结果，代表一个捕获组。

3.2 常用匹配函数

函数	作用
`std::regex_match`	检查整个字符串是否与正则表达式完全匹配。
`std::regex_search`	在字符串中搜索是否存在与正则表达式匹配的子串。
`std::regex_replace`	将匹配的子串替换为指定的格式字符串。

所有函数都可接受 std::regex_constants::match_flag_type 标志控制行为。

3.3 编译标志

构造 std::regex 时可指定语法选项和优化标志，常见如下：

std::regex pattern("...", std::regex_constants::ECMAScript | std::regex_constants::optimize);

ECMAScript：默认语法，类似 JavaScript。
grep、extended、awk、egrep：其他语法变体。
icase：忽略大小写。
optimize：提示正则引擎尽量优化，适合多次匹配场景。
multiline：使 ^ 和 $ 匹配行的开头和结尾，而非整个字符串。

3.4 迭代器

std::regex_iterator：迭代字符串中所有匹配项。
std::regex_token_iterator：可迭代匹配项或特定捕获组，常用于字符串分割。

4. 安全优雅的 C++20 实践准则

4.1 用原始字符串字面量书写正则

C++ 正则中反斜杠非常多，传统写法要写 "\\d{3}"，极易出错且难以维护。应始终使用 R"()"：

auto phone_pattern = std::regex(R"(\d{3}-\d{4})");   // 清晰直观

4.2 避免重复编译正则对象

正则编译（构造 std::regex）开销较大。最佳实践是将正则对象声明为 static const，保证只编译一次。

static const std::regex email_regex(R"(^[\w.+-]+@[\w-]+\.[\w.-]+$)");

4.3 异常处理

正则语法错误、不支持的特性、以及内存分配等问题会抛出 std::regex_error。健壮的代码应当捕获该异常：

try {
    static const std::regex re(R"(\d+)");
} catch (const std::regex_error& e) {
    std::cerr << "Regex error: " << e.what() << " (code: " << e.code() << ")\n";
    // 进行合适的错误处理
}

4.4 善用 `std::format`（C++20）输出结果

使用 std::format 可以让结果打印更为优雅，避免繁琐的流操作。

#include <format>
#include <iostream>
// ...
std::cout << std::format("Match found at position {}: {}\n", match.position(), match.str());

4.5 将常用操作封装为可复用的函数

例如封装一个验证函数，返回 bool；或封装一个提取函数，返回 std::optional 或 std::vector。这既安全又优雅。

5. 完整代码示例

5.1 邮箱格式验证

#include <iostream>
#include <regex>
#include <string>
#include <format>

bool is_valid_email(std::string_view email) {
    // 通用的邮箱正则（简化版）
    static const std::regex pattern(
        R"(^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$)",
        std::regex_constants::ECMAScript | std::regex_constants::optimize
    );

    try {
        return std::regex_match(email.begin(), email.end(), pattern);
    } catch (const std::regex_error&) {
        // 理论上静态正则不会在匹配时抛出异常，但保留安全性
        return false;
    }
}

int main() {
    std::string test = "user@example.com";
    std::cout << std::format("'{}' is valid: {}\n", test, is_valid_email(test));
    test = "not-an-email";
    std::cout << std::format("'{}' is valid: {}\n", test, is_valid_email(test));
}

5.2 提取日志中的日期

假设日志行格式为 [2026-06-03 14:30:00] ERROR: message，我们要提取日期部分。

#include <iostream>
#include <regex>
#include <string>
#include <optional>
#include <format>

std::optional<std::string> extract_date(const std::string& log_line) {
    // 捕获组：括号内为日期，格式 YYYY-MM-DD
    static const std::regex date_regex(
        R"(\[(\d{4}-\d{2}-\d{2})\s)",
        std::regex_constants::optimize
    );

    std::smatch match;
    if (std::regex_search(log_line, match, date_regex) && match.size() > 1) {
        return match[1].str();   // 第一个捕获组
    }
    return std::nullopt;
}

int main() {
    std::string log = "[2026-06-03 14:30:00] ERROR: Disk full";
    if (auto date = extract_date(log)) {
        std::cout << std::format("Extracted date: {}\n", *date);
    } else {
        std::cout << "No date found.\n";
    }
}

5.3 敏感词替换

用 * 替换所有出现的敏感词，且忽略大小写。

#include <iostream>
#include <regex>
#include <string>
#include <format>

std::string censor_text(std::string text, const std::string& forbidden_word) {
    // 动态构造正则（此处演示，一般也尽量 static）
    try {
        std::regex word_regex(forbidden_word,
                              std::regex_constants::ECMAScript |
                              std::regex_constants::icase |
                              std::regex_constants::optimize);

        return std::regex_replace(text, word_regex, "***");
    } catch (const std::regex_error& e) {
        std::cerr << std::format("Regex error: {}\n", e.what());
        return text; // 失败时返回原字符串
    }
}

int main() {
    std::string message = "You are an idiot, IDIOT!";
    std::string clean = censor_text(message, "idiot");
    std::cout << std::format("Censored: {}\n", clean);
}

5.4 遍历所有匹配（提取所有数字）

#include <iostream>
#include <regex>
#include <string>
#include <vector>
#include <format>

std::vector<int> extract_all_numbers(const std::string& input) {
    static const std::regex number_regex(R"(\d+)", std::regex_constants::optimize);
    std::vector<int> numbers;

    // regex_iterator 遍历所有匹配
    auto begin = std::sregex_iterator(input.begin(), input.end(), number_regex);
    auto end = std::sregex_iterator();

    for (auto it = begin; it != end; ++it) {
        numbers.push_back(std::stoi(it->str()));
    }
    return numbers;
}

int main() {
    std::string data = "Price: 42, Discount: 15, Items: 3.";
    auto nums = extract_all_numbers(data);
    for (size_t i = 0; i < nums.size(); ++i) {
        std::cout << std::format("Number {}: {}\n", i + 1, nums[i]);
    }
}

6. 注意事项与局限性

6.1 性能特性

std::regex 在绝大多数标准库实现中性能一般，它使用回溯算法，某些模式可能导致指数级时间（灾难性回溯）。
对于高性能需求场景，可考虑使用 boost::regex 或 RE2 等外部库，它们提供更优的算法和更稳定的表现。
仍然建议：编译一次，多次使用；使用 optimize 标志。

6.2 Unicode 支持

C++20 标准正则库对 Unicode 的支持有限，\w 等并不匹配所有 Unicode 字母。若需完整 Unicode 属性匹配，请借助第三方库（如 ICU 或 Boost.Regex）。

6.3 后发断言限制

std::regex 要求后发断言 (?<=...) 中的模式必须是固定宽度（即不能包含 *、+、{n,m} 等不定长度量词）。这是 ECMAScript 标准的行为。

6.4 线程安全

std::regex 对象本身是线程安全的（可以多个线程同时使用同一个 const 对象进行匹配）。
std::smatch 等结果对象则不是线程安全的，每次匹配应使用局部变量。

7. 总结

正则表达式是处理文本的瑞士军刀，C++20 通过 <regex> 库完整支持了正则的匹配、搜索、替换和遍历。写出安全优雅的 C++ 正则代码应遵循以下要点：

使用原始字符串字面量 R"(...)" 定义模式，告别转义困扰。
静态常量存储编译后的正则，加上 optimize 标志提升效率。
始终准备捕获 std::regex_error，确保程序健壮。
利用 std::regex_match、std::regex_search 和 std::regex_replace 处理常用场景。
结合现代 C++20 特性（如 std::format、std::optional、Range-based 循环）使代码更清晰。
理解标准库的局限性，必要时替换为更专业的正则引擎。

亚马逊云科技技术品牌专区

更多推荐

Agent很好，但你的RAG项目可能并不需要它

很多团队踩过的坑是这样的：兴致勃勃搭了一套RAG流水线，向量数据库嵌好了，LLM也接上了，结果用户问一个稍微带点条件的问题，召回的全是噪声。更不用说当塞进去的干扰信息增多时，模型定位准确内容的能力会明显下降，这是“大海捞针”类测试反复验证过的事实——即便模型“捞得到”，速度和价格也不允许你在生产环境里这么干。他要的是文档里的原话，要的是能点开看的出处链接。如果你的任务需要“先查A知识库，再根据结果