使用正则表达式分隔 text/text 处理
Separating text/text processing using regex
我有一个段落需要用特定的关键字列表分隔。
这是文本(单个字符串):
"Evaluation Note: Suspected abuse by own mother. Date 3/13/2019 ID: #N/A Contact: Not Specified Name: Cecilia Valore Address: 189 West Moncler Drive Home Phone: 353 273 400 Additional Information: Please tell me when the mother arrives, we will have a meeting with her next Monday, 3/17/2019 Author: social worker"
所以我想用python根据变量名来分隔这一段。 "Evaluation Note"、"Date"、"ID"、"Contact"、"Name"、"Address"、"Home Phone"、"Additional Information" 和 "Author" 是变量名。我认为使用正则表达式似乎不错,但我在正则表达式方面没有太多经验。
这是我正在尝试做的事情:
import re
regex = r"Evaluation Note(?:\:)? (?P<note>\D+) Date(?:\:)? (?P<date>\D+)
ID(?:\:)? (?P<id>\D+) Contact(?:\:)? (?P<contact>\D+)Name(?:\:)? (? P<name>\D+)"
test_str = "Evaluation Note: Suspected abuse by own mother. Date 3/13/2019
ID: #N/A Contact: Not Specified Name: Cecilia Valore "
matches = re.finditer(regex, test_str, re.MULTILINE)
但没有找到任何模式。
您或许可以即时生成该正则表达式。只要参数的顺序固定就可以了。
这是我的尝试,确实可以。它所针对的实际正则表达式类似于 Some Key(?P<some_key>.*)Some Other Key(?P<some_other_key>.*)
,等等。
import re
test_str = r'Evaluation Note: Suspected abuse by own mother. Date 3/13/2019 ID: #N/A Contact: Not Specified Name: Cecilia Valore '
keys = ['Evaluation Note', 'Date', 'ID', 'Contact', 'Name']
def find(keys, string):
keys = [(key, key.replace(' ', '_')) for key in keys] # spaces aren't valid param names
pattern = ''.join([f'{key}(?P<{name}>.*)' for key, name in keys]) # generate the actual regex
for find in re.findall(pattern, test_str):
for item in find:
yield item.strip(':').strip() # clean up the result
for find in find(keys, test_str):
print(find)
哪个returns:
Suspected abuse by own mother.
3/13/2019
#N/A
Not Specified
Cecilia Valore
您可以使用 search 获取变量的位置并相应地解析文本。您可以轻松自定义它。
import re
en = re.compile('Evaluation Note:').search(text)
print(en.group())
d = re.compile('Date').search(text)
print(text[en.end()+1: d.start()-1])
print(d.group())
i_d = re.compile('ID:').search(text)
print(text[d.end()+1: i_d.start()-1])
print(i_d.group())
c = re.compile('Contact:').search(text)
print(text[i_d.end()+1: c.start()-1])
print(c.group())
n = re.compile('Name:').search(text)
print(text[c.end()+1: n.start()-1])
print(n.group())
ad = re.compile('Address:').search(text)
print(text[n.end()+1: ad.start()-1])
print(ad.group())
p = re.compile('Home Phone:').search(text)
print(text[ad.end()+1: p.start()-1])
print(p.group())
ai = re.compile('Additional Information:').search(text)
print(text[p.end()+1: ai.start()-1])
print(ai.group())
aut = re.compile('Author:').search(text)
print(text[ai.end()+1: aut.start()-1])
print(aut.group())
print(text[aut.end()+1:])
这将输出:
Evaluation Note: Suspected abuse by own mother.
Date: 3/13/2019
ID: #N/A
Contact: Not Specified
Name: Cecilia Valore
Address: 189 West Moncler Drive
Home Phone: 353 273 400
Additional Information: Please tell me when the mother arrives, we will have a meeting with her next Monday, 3/17/2019
Author: social worker
希望对您有所帮助
我有一个段落需要用特定的关键字列表分隔。
这是文本(单个字符串):
"Evaluation Note: Suspected abuse by own mother. Date 3/13/2019 ID: #N/A Contact: Not Specified Name: Cecilia Valore Address: 189 West Moncler Drive Home Phone: 353 273 400 Additional Information: Please tell me when the mother arrives, we will have a meeting with her next Monday, 3/17/2019 Author: social worker"
所以我想用python根据变量名来分隔这一段。 "Evaluation Note"、"Date"、"ID"、"Contact"、"Name"、"Address"、"Home Phone"、"Additional Information" 和 "Author" 是变量名。我认为使用正则表达式似乎不错,但我在正则表达式方面没有太多经验。
这是我正在尝试做的事情:
import re
regex = r"Evaluation Note(?:\:)? (?P<note>\D+) Date(?:\:)? (?P<date>\D+)
ID(?:\:)? (?P<id>\D+) Contact(?:\:)? (?P<contact>\D+)Name(?:\:)? (? P<name>\D+)"
test_str = "Evaluation Note: Suspected abuse by own mother. Date 3/13/2019
ID: #N/A Contact: Not Specified Name: Cecilia Valore "
matches = re.finditer(regex, test_str, re.MULTILINE)
但没有找到任何模式。
您或许可以即时生成该正则表达式。只要参数的顺序固定就可以了。
这是我的尝试,确实可以。它所针对的实际正则表达式类似于 Some Key(?P<some_key>.*)Some Other Key(?P<some_other_key>.*)
,等等。
import re
test_str = r'Evaluation Note: Suspected abuse by own mother. Date 3/13/2019 ID: #N/A Contact: Not Specified Name: Cecilia Valore '
keys = ['Evaluation Note', 'Date', 'ID', 'Contact', 'Name']
def find(keys, string):
keys = [(key, key.replace(' ', '_')) for key in keys] # spaces aren't valid param names
pattern = ''.join([f'{key}(?P<{name}>.*)' for key, name in keys]) # generate the actual regex
for find in re.findall(pattern, test_str):
for item in find:
yield item.strip(':').strip() # clean up the result
for find in find(keys, test_str):
print(find)
哪个returns:
Suspected abuse by own mother.
3/13/2019
#N/A
Not Specified
Cecilia Valore
您可以使用 search 获取变量的位置并相应地解析文本。您可以轻松自定义它。
import re
en = re.compile('Evaluation Note:').search(text)
print(en.group())
d = re.compile('Date').search(text)
print(text[en.end()+1: d.start()-1])
print(d.group())
i_d = re.compile('ID:').search(text)
print(text[d.end()+1: i_d.start()-1])
print(i_d.group())
c = re.compile('Contact:').search(text)
print(text[i_d.end()+1: c.start()-1])
print(c.group())
n = re.compile('Name:').search(text)
print(text[c.end()+1: n.start()-1])
print(n.group())
ad = re.compile('Address:').search(text)
print(text[n.end()+1: ad.start()-1])
print(ad.group())
p = re.compile('Home Phone:').search(text)
print(text[ad.end()+1: p.start()-1])
print(p.group())
ai = re.compile('Additional Information:').search(text)
print(text[p.end()+1: ai.start()-1])
print(ai.group())
aut = re.compile('Author:').search(text)
print(text[ai.end()+1: aut.start()-1])
print(aut.group())
print(text[aut.end()+1:])
这将输出:
Evaluation Note: Suspected abuse by own mother.
Date: 3/13/2019
ID: #N/A
Contact: Not Specified
Name: Cecilia Valore
Address: 189 West Moncler Drive
Home Phone: 353 273 400
Additional Information: Please tell me when the mother arrives, we will have a meeting with her next Monday, 3/17/2019
Author: social worker
希望对您有所帮助