Python regular expression (regex) match comma separated number - why does this not work?

Question 1

I am trying to parse transaction letters from my (German) bank. I'd like to extract all the numbers from the following string which turns out to be harder than I thought. Option 2 does almost what I want. I now want to modify it to capture e.g. 80 as well.

My first try is option 1 which only returns garbage. Why is it returning so many empty strings? It should always have at least a number from the first \d+, no?

Option 3 works (or at least works as expected), so somehow I am answering my own question. I guess I'm mostly banging my head about why option 2 does not work.

# -*- coding: utf-8 -*-
import re


my_str = """
Dividendengutschrift für inländische Wertpapiere

Depotinhaber    : ME

Extag           :  18.04.2013          Bruttodividende
Zahlungstag     :  18.04.2013          pro Stück       :       0,9800 EUR
Valuta          :  18.04.2013

                                       Bruttodividende :        78,40 EUR
                                      *Einbeh. Steuer  :        20,67 EUR
                                       Nettodividende  :        78,40 EUR

                                       Endbetrag       :        57,73 EUR
"""

print re.findall(r'\d+(,\d+)?', my_str)
print re.findall(r'\d+,\d+', my_str)
print re.findall(r'[-+]?\d*,\d+|\d+', my_str)

Output is

['', '', '', '', '', '', ',98', '', '', '', '', ',40', ',67', ',40', ',73']
['0,9800', '78,40', '20,67', '78,40', '57,73']
['18', '04', '2013', '18', '04', '2013', '0,9800', '18', '04', '2013', '78,40', '20,67', '78,40', '57,73']

Question 2

Option 1 is the most suitable of the regex, but it is not working correctly because findall will return what is matched by the capture group (), not the complete match.

For example, the first three matches in your example will be the 18, 04 and 2013, and in each case the capture group will be unmatched so an empty string will be added to the results list.

The solution is to make the group non-capturing

r'\d+(?:,\d+)?'

Option 2 does not work only so far as it won't match sequences that don't contain a comma.

Option 3 isn't great because it will match e.g. +,1.

Python regular expression (regex) match comma separated number - why does this not work?

Mangs

Answer a question

Answers

所有评论(0)

Mangs