How to use regex in BS4 `find_all` to return matched items with pattern priority?

Mangs

0人浏览 · 2022-08-24 23:05:44

Mangs · 2022-08-24 23:05:44 发布

Answer a question

I have the following regex expression:

import re

re.compile('|'.join([pattern1, pattern2, pattern3]))

I would like it to work in the following way:

Try to match only pattern1; if matched - stop; else - proceed.
Try to match only pattern2; if matched - stop; else - proceed.
Try to match only pattern3; stop.

However currently it matches all of them.

I found this Q/A, which I thought answers my question, but adding flags=re.I does not fix my issue, since my result does not change.

How is this possible (if at all)?

A reproducible example:

from bs4 import BeautifulSoup

xml_doc = """
    <m3_commodity_group commodity3="Oilseeds"><m3_year_group_Collection><m3_year_group market_year3="2011/12"><m3_month_group_Collection><m3_month_group forecast_month3=""><m3_attribute_group_Collection><m3_attribute_group attribute3="Output"><Textbox40><Cell cell_value3="353.93"/></Textbox40></m3_attribute_group><m3_attribute_group attribute3="Total
    Supply"><Textbox40><Cell cell_value3="429.49"/></Textbox40></m3_attribute_group><m3_attribute_group attribute3="Trade"><Textbox40><Cell cell_value3="73.59"/></Textbox40></m3_attribute_group><m3_attribute_group attribute3="Total
    Use  2/"><Textbox40><Cell cell_value3="345.49"/></Textbox40></m3_attribute_group><m3_attribute_group attribute3="Ending
    Stocks"><Textbox40><Cell cell_value3="59.03"/></Textbox40></m3_attribute_group></m3_attribute_group_Collection><m3_value_group_Collection><m3_value_group><m3_attribute_group_Collection><m3_attribute_group attribute3="Output"><Textbox40><Cell Textbox44="filler"/></Textbox40></m3_attribute_group><m3_attribute_group attribute3="Total
    Supply"><Textbox40><Cell Textbox44="filler"/></Textbox40></m3_attribute_group><m3_attribute_group attribute3="Trade"><Textbox40><Cell Textbox44="filler"/></Textbox40></m3_attribute_group><m3_attribute_group attribute3="Total
    Use  2/"><Textbox40><Cell Textbox44="filler"/></Textbox40></m3_attribute_group><m3_attribute_group attribute3="Ending
    Stocks"><Textbox40><Cell Textbox44="filler"/></Textbox40></m3_attribute_group></m3_attribute_group_Collection></m3_value_group></m3_value_group_Collection></m3_month_group></m3_month_group_Collection></m3_year_group></m3_year_group_Collection></m3_commodity_group>
    """

soup = BeautifulSoup(xml_doc, "xml")

# This gives 11 vales.
len(soup.find_all(re.compile('|'.join([
    r'^m[0-9]_commodity_group$',r'^m[0-9]_region_group$',r'^m[0-9]_attribute_group$'
]), flags=re.I)))

# This gives 1 value <-- It's what I want, but I want to achieve it with the regex from above (which would work for other texts)
len(soup.find_all(re.compile('|'.join([
    r'^m[0-9]_commodity_group$'
]), flags=re.I)))

# This gives 10 values, which in this example I'd like to be ignored, since the first regex already gave results.
len(soup.find_all(re.compile('|'.join([
    r'^m[0-9]_attribute_group$'
]), flags=re.I)))

Answers

You could restructure your search:

patterns = [r'^m[0-9]_commodity_group$',r'^m[0-9]_region_group$',r'^m[0-9]_attribute_group$']
for pattern in patterns:
    result = soup.find_all(re.compile(pattern, flags=re.I))
    if result:
        break  # Stop after the first time you found a match
else:
    result = None  # When there never was a match

That might be more reabable than regex magic. If you will be executing this a lot, you might want to pre-compile your regexes once instead of at every loop iteration.

Python

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐

求助！为什么用InsCode部署会出现无限重定向？

Python

如何重塑熊猫。系列

问题:如何重塑熊猫。系列在我看来,它就像 pandas.Series 中的一个错误。 a = pd.Series([1,2,3,4]) b = a.reshape(2,2) b b 有类型 Series 但无法显示,最后一条语句给出异常,非常冗长,最后一行是“TypeError: %d format: a number is required, not numpy.ndarray”。 b.sha

Python

在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制]

问题:在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制] 我刚刚在这里](https://keras.io/initializers/)中阅读了有关[中的 Keras 权重初始化器的信息。在文档中,只介绍了不同的初始化程序。如: model.add(Dense(64, kernel_initializer='random_normal')) 当我没有指定kernel_initia