如何提取字符串中的人名 python

How to extract a human name in a string python

我有一个来自 OCR 图像的字符串,我需要找到一种方法从中提取人名。 here is the image 需要 OCR,结果为:

From: Al Amri, Salim <salim.amri@gmail.com>

Sent: 25 August 2021 17:20

To: Al Harthi, Mohammed <mohd4.king@rihal.om>

Ce: Al hajri, Malik <hajri990@ocaa.co.om>; Omar, Naif <nnnn49@apple.com>

Subject: Conference Rooms Booking Details

Dear Mohammed,

As per our last discussion these are the available conference rooms available for booking along
with their rates for full day:

Room: Luban, available on 26/09/2021. Rate: 40

Room: Mazoon, available on 04/12/2021 and 13/02/2022. Rate: 00
Room: Dhofar. Available on 11/11/2021. Rate: 00

Room: Nizwa. Available on 13/12/2022. Rate: 00

   

Please let me know which ones you are interested so we go through more details.
Best regards,

Salim Al Amri

标题中一共有4个名字,要求得到输出:

names = 'Al Hajri, Malik', 'Omar, Naif', 'Al Amri, Salim', 'Al Harthy, Mohammed' #desired output

但我不知道如何提取名称。我试过 RegEx 并想出了:

names = re.findall(r'(?i)([A-Z][a-z]+[A-Z][a-z][, ] [A-Z][a-z]+)', string) #regex to find names

搜索大写字母,然后搜索逗号,然后搜索另一个以大写字母开头的单词。它接近预期的结果,但结果为:

names = ['Amri, Salim', 'Harthi, Mohammed', 'hajri, Malik', 'Omar, Naif', 'Luban, available', 'Mazoon, available'] #acutal result

我想过也许可以使用另一个字符串来提取房间名称并将它们从列表中排除,但我不知道如何实现这个想法。我是 RegEx 的新手,所以我们将不胜感激。提前致谢

根据您电子邮件的内容,合理的方法可能是使用:

[:;]\s*(.+?)\s*<

网上看一个demo.

  • [:;] - 一个(半)冒号;
  • \s* - 0+(贪心)空格;
  • (.+?) - 第一个捕获组 1+ 个(惰性)字符;
  • \s* - 0+(贪心)空格;
  • < - 文字“<”。

请注意,我专门使用 (.+?) 来捕获名称,因为众所周知名称很难匹配。


import re
s = """From: Al Amri, Salim <salim.amri@gmail.com>

Sent: 25 August 2021 17:20

To: Al Harthi, Mohammed <mohd4.king@rihal.om>

Ce: Al hajri, Malik <hajri990@ocaa.co.om>; Omar, Naif <nnnn49@apple.com>

Subject: Conference Rooms Booking Details

Dear Mohammed,

As per our last discussion these are the available conference rooms available for booking along
with their rates for full day:

Room: Luban, available on 26/09/2021. Rate: 40

Room: Mazoon, available on 04/12/2021 and 13/02/2022. Rate: 00
Room: Dhofar. Available on 11/11/2021. Rate: 00

Room: Nizwa. Available on 13/12/2022. Rate: 00

   

Please let me know which ones you are interested so we go through more details.
Best regards,

Salim Al Amri"""
print(re.findall(r'[:;]\s*(.+?)\s*<', s))

打印:

['Al Amri, Salim', 'Al Harthi, Mohammed', 'Al hajri, Malik', 'Omar, Naif']

尽管@JvdV 推荐了出色的 RE 方法,但您可以通过以下分步方法实现此目的:

OCR = """From: Al Amri, Salim <salim.amri@gmail.com>

Sent: 25 August 2021 17:20

To: Al Harthi, Mohammed <mohd4.king@rihal.om>

Ce: Al hajri, Malik <hajri990@ocaa.co.om>; Omar, Naif <nnnn49@apple.com>

Subject: Conference Rooms Booking Details

Dear Mohammed,

As per our last discussion these are the available conference rooms available for booking along
with their rates for full day:

Room: Luban, available on 26/09/2021. Rate: 40

Room: Mazoon, available on 04/12/2021 and 13/02/2022. Rate: 00
Room: Dhofar. Available on 11/11/2021. Rate: 00

Room: Nizwa. Available on 13/12/2022. Rate: 00

   

Please let me know which ones you are interested so we go through more details.
Best regards,

Salim Al Amri"""

names = []
for line in OCR.split('\n'):
    tokens = line.split()
    if tokens and tokens[0] in ['From:', 'To:', 'Ce:']: # Ce or Cc ???
        parts = line.split(';')
        for i, p in enumerate(parts):
            names.append(' '.join(p.split()[i==0:-1]))
print(names)