Python，解析多行字符串提取字符和数字子串

Question

这是的后续，我更清楚地发现了问题，我需要一些进一步的建议:)

我有一个字符串，由一些机器学习算法产生，一般具有以下结构：

在开头和结尾，可以有一些行不包含任何字符（空格除外）；
在中间，应该有 2 行，每行包含一个名字（只有姓氏，或者名字和姓氏，或者名字的首字母加上姓氏......），然后是一些数字和（有时）其他字符混合在数字之间；
其中一个名称通常以特殊的非字母数字字符（>、>>、@、...）开头。

像这样：

Connery  3 5 7 @  4
>> R. Moore 4 5 67| 5 [

我需要提取 2 个名称和数字字符，并检查其中一行是否以特殊字符开头，所以我的输出应该是：.

name_01 = 'Connery'
digits_01 = [3, 5, 7, 4]
name_02 = 'R. Moore'
digits_02 = [4, 5, 67, 5]
selected_line = 2 (anything indicating that it's the second line)

在链接的原始问题中，有人建议我使用：

inp = '''Connery  3 5 7 @  4
    >> R. Moore 4 5 67| 5 ['''
lines = inp.split('\n')
for line in lines:
    matches = re.findall(r'\w+', line)
    print(matches)

产生的结果非常接近我想要的结果：

['Connery', '3', '5', '7', '4']
['R', 'Moore', '4', '5', '67', '5']

但我需要将第二行中的前两个字符串 ('R'、'Moore') 组合在一起（基本上，将数字开始之前的所有字符组合在一起）。并且，它跳过了特殊字符的检测。我应该以某种方式修复此输出，还是可以完全以不同的方式解决该问题？

Answer 1

我不确定您希望保留或删除哪些字符，但类似以下的内容应该适用于该示例：

inp = '''Connery  3 5 7 @  4
    >> R. Moore 4 5 67| 5 ['''
lines = inp.split('\n')
for line in lines:
    matches = re.findall(r'(?:[a-zA-Z.][a-zA-Z.\s]+[a-zA-Z.])|\w+', line)
    print(matches)

输出：

['Connery', '3', '5', '7', '4']
['R. Moore', '4', '5', '67', '5']

注意。我包括 a-z（下和上）和点，中间有可选空格：[a-zA-Z.][a-zA-Z.\s]+[a-zA-Z.]，但您应该根据您的实际需要进行更新。

Answer 2

这还将包括特殊字符（请记住它们是硬编码的，因此您必须将缺失的字符添加到正则表达式部分 [>@]+）

for line in lines:
    matches = re.findall(r'(?:[a-zA-Z.][a-zA-Z.\s]+[a-zA-Z.])|\w+|[>@]+', line)
    print(matches)

Answer 3

最好分几步完成。

# get the whitespace at start and end out
lines = inp.strip().split('\n')
for line in lines:
    # for each line, identify the selection mark, the name, and the mess at the end
    # assuming names can't have numbers in them
    match = re.match(r'^(\W+)?([^\d]+?)\s*([^a-zA-Z]+)$', line.strip())
    if match:
        selected_raw, name, numbers_raw = match.groups()
        # now parse the unprocessed bits
        selected = selected_raw is not None
        numbers = re.findall(r'\d+', numbers_raw)
        print(selected, name, numbers)

# output
False Connery ['3', '5', '7', '4']
True R. Moore ['4', '5', '67', '5']

Python，解析多行字符串提取字符和数字子串

Python, parse multiple line string extracting characters and digits substring

python

string

text-parsing