re.findall 在两个字符串之间（但忽略数字）

Question

我正在尝试解析许多 txt 文件。以下text只是一个更大的txt文件的一部分。

<P STYLE="font: 10pt Times New Roman, Times, Serif; margin: 0; text-align: justify">Prior to this primary offering, there has
been no public market for our common stock. We anticipate that the public offering price of the shares will be between .00 and
.00. We have applied to list our common stock on the Nasdaq Capital Market (&ldquo;Nasdaq&rdquo;) under the symbol &ldquo;HYRE.&rdquo;
If our application is not approved or we otherwise determine that we will not be able to secure the listing of our common stock
on the Nasdaq, we will not complete this primary offering.</P>

我想要的输出：be between .00 and and .00。因此，我需要提取 be between 到 . 之间的任何内容（但不考虑小数 5.00 点！）。我尝试了以下 (Python 3.7):

shareprice = re.findall(r"be between\s$.+?\.", text, re.DOTALL)

但是这段代码给了我：be between .（停在小数点）。我最初在字符串的末尾添加一个 \s 以在 . 之后要求一个白色的 space 这将保留 5.00 小数点，但许多其他 txt 文件没有在句子的结尾 . 之后有一个白色的 space。无论如何，我可以在我的字符串中指定 \. 之后的 "skip" 数字吗？

非常感谢。我希望这很清楚。最佳

Answer 1

从 HTML 中解析出纯文本后，您可以考虑匹配尽可能少的任何 0+ 个字符，然后是 .后面没有数字：

r"be between\s*$.*?\.(?!\d)"

参见regex demo。

或者，如果您只想忽略两位数字之间的点，您可以使用

r"be between\s*$.*?\.(?!(?<=\d\.)\d)"

参见this regex demo。 (?!(?<=\d\.)\d) 确保 \d\.\d 模式被跳过到第一个匹配 .，而不仅仅是 \.\d.

re.findall 在两个字符串之间（但忽略数字）

re.findall between two strings (but dismiss numeric digits)

regex

parsing

findall

python-3.x