如何使用 RegEx 提取模式之间的文本列表？

Question

我有这样的文字：

05/06/21 05/06/21 Margin Div/Int - Income ACTIVISION BLIZZARD INC

COM
Payable: 05/06/2021
QUALIFIED DIVIDENDS 23.50 

ATVI - 0.00 23.50 (9,425.77)

05/13/21 05/13/21 Margin Div/Int - Income APPLE INC
COM
Payable: 05/13/2021
QUALIFIED DIVIDENDS 6.16 

AAPL - 0.00 6.16 (9,419.61)

05/28/21 05/28/21 Margin Div/Int - Income STARBUCKS CORP
COM
Payable: 05/28/2021
QUALIFIED DIVIDENDS 18.00 

SBUX - 0.00 18.00 (9,401.61)

05/28/21 05/28/21 Margin Div/Int - Expense MARGIN INTEREST CHARGE
Payable: 05/28/2021 

 - - 0.00 (73.03) (9,474.64)

我要提取个别记录，如：

05/06/21 05/06/21 Margin Div/Int - Income ACTIVISION BLIZZARD INC

COM
Payable: 05/06/2021
QUALIFIED DIVIDENDS 23.50 

ATVI - 0.00 23.50 (9,425.77)

和

05/13/21 05/13/21 Margin Div/Int - Income APPLE INC
COM
Payable: 05/13/2021
QUALIFIED DIVIDENDS 6.16 

AAPL - 0.00 6.16 (9,419.61)

和

05/28/21 05/28/21 Margin Div/Int - Expense MARGIN INTEREST CHARGE
Payable: 05/28/2021 

 - - 0.00 (73.03) (9,474.64)

此处每条记录的格式应以日期(\d+/\d+/\d)开始，以(\n\n\d+/\d+/\d)

结束

我试过(re.findall(r'\d+/\d+/\d(.*?)\n\n\d+/\d+/\d+',a))。但它并没有像预期的那样工作

Answer 1

您可以将其用作基础并进行更改以获得您需要的确切：

\d+\/\d+\/\d+(.*?)\n\n(\s+\d+\/\d+\/\d+|$)

您可以在 demo 中尝试。

我所做的更改是：

\n 变为 \n.
\n\n 和示例文本中的日期之间有一个 space。我在正则表达式中添加了它。
正则表达式中日期的年份部分缺失 +。我补充说
样本的最后一部分末尾没有日期。该支票已包含在内。

Answer 2

你可以匹配

.+?(?=\s*(?:\d{2}\/\d{2}\/\d{2} ){2}|$)

设置了 'g'（“全局”）和 's'（“单行”或“全点”）标志。 's' 使句点匹配所有字符，包括行终止符。

Demo

正则表达式可以分解如下

.+?                        # match one or more chars, lazily
(?=                        # begin a positive lookahead
  \s*                      # match zero or more whitespaces
  (?:                      # begin a non-capture group 
    \d{2}\/\d{2}\/\d{2}[ ] # match a date string followed by a space
  ){2}                     # end the non-capture group and execute it twice
|                          # or
  $                        # match the end of the string
)                          # end positive lookahead

Answer 3

您可以在字符串的开头匹配类似日期的模式，并重复所有不以匹配类似日期的模式开头的行。

^\d+/\d+/\d+ .*(?:\n(?!^\d+/\d+/\d+ ).*)*

模式匹配：

^ 字符串开头
\d+/\d+/\d+ 匹配类似日期的模式和 space
.* 匹配行的其余部分
(?:非捕获组
- \n(?!^\d+/\d+/\d+ ).* 匹配一个换行符和该行的其余部分，如果它不是以类似 pattern
)* 关闭非捕获组并选择性重复

看到一个regex demo and a Python demo.

使用可以使用re.findall得到所有的匹配项：

import re

pattern = r"^\d+/\d+/\d+ .*(?:\n(?!^\d+/\d+/\d+ ).*)*"
 
s = ("05/06/21 05/06/21 Margin Div/Int - Income ACTIVISION BLIZZARD INC\n\n....")
 
print(re.findall(pattern, s, re.M))

如何使用 RegEx 提取模式之间的文本列表？

How to extract the list of text between the pattern using RegEx?

python

regex

text-extraction

python-3.x