拆分 space 上跟随非固定长度表达式的句子

Question

给定以下文本：

text = "Van der Weyden was preoccupied by commissioned portraiture towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character. In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers.[2] She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress.[3][4][5] It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"

我需要：

["Van der Weyden was preoccupied by commissioned portraiture towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character.",
 "In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers.[2]",
 "She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress.[3][4][5]",
 "It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"]

我试过了，但没用：

new_line = re.split('(?<=\.) |(([.?!](\[\d+\])+))\s', text)
print(new_line)

我得到的结果是这样的：

['Van der Weyden was preoccupied by commissioned\xa0portraiture\xa0towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character.', None, None, None, "In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers", '.[2]', '.[2]', '[2]', 'She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress', '.[3][4][5]', '.[3][4][5]', '[5]', "It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"]

Answer 1

你可以使用

re.findall(r'(?s)(.*?(?:\.|[.?!](?:\[\d+\])+))(?:\s+|\s*\Z)', text)

见regex demo。详情:

(?s) - 与 re.S 或 re.DOTALL 相同，使 . 跨行匹配
(.*?(?:\.|[.?!](?:\[\d+\])+)) - 第 1 组：
- .*? - 尽可能少的零个或多个字符
- (?:\.|[.?!](?:\[\d+\])+) - 点或 ./?/! 以及 [ + 数字的一次或多次出现+ ] 子字符串
(?:\s+|\s*\Z) - 一个或多个空格或零个或多个空格后跟字符串结尾。

参见 Python demo:

import re
text = "Van der Weyden was preoccupied by commissioned portraiture towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character. In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers.[2] She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress.[3][4][5] It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"
print( re.findall(r'(.*?(?:\.|[.?!](?:\[\d+\])+))(?:\s+|\s*\Z)', text, re.DOTALL) )

输出：

[
  'Van der Weyden was preoccupied by commissioned portraiture towards the end of his life[1] and was highly regarded by later generations of painters for his penetrating evocations of character.',
  "In this work, the woman's humility and reserved demeanour are conveyed through her fragile physique, lowered eyes and tightly grasped fingers.[2]",
  'She is slender and depicted according to the Gothic ideal of elongated features, indicated by her narrow shoulders, tightly pinned hair, high forehead and the elaborate frame set by the headdress.[3][4][5]',
  "It is the only known portrait of a woman accepted as an autograph work by van der Weyden,[1][3] yet the sitter's name is not recorded and he did not title the work![21][14][5][8][10]"
]

Answer 2

您需要使用非捕获组 ((?:...)) 否则 re.split 将在输出中包含捕获的部分：

import re
new_line = re.split(r'(?<=\.) |(?:[.?!](?:\[\d+\])+)\s', text)
print(new_line)

拆分 space 上跟随非固定长度表达式的句子

Splitting sentences on space that follows a non-fixed length expression

python

regex

split

space

positive-lookahead