如何从未处理的 text/data 中提取人的年龄和性别?
How to extract age and gender of the person from unprocessed text/data?
我有一个包含文本列表(带行的列)的 CSV 文件,我想从每一行中提取患者的年龄,我不能用 "is digit" 因为还有文本中的一些其他数字。我怎么能做这样的事?谢谢
EXTRA:我也想提取性别 - 患者有时被称为 male/female,有时被称为 man/woman,有时被称为 gentleman/lady。
有没有一种方法可以编写 findall,例如,如果文本是 17 岁,如果后跟 -year-old,请打印我的数字
re.findall("[\d].", '-year-old')
文本中的行示例:
This 23-year-old white female presents with...
...pleasant gentleman who is 42 years old...
...The patient is a 10-1/2-year-old born with...
...A 79-year-old Filipino woman...
Patient, 37,...
我怎样才能得到 age/gender
的列表
即:
Age:
['23','42','79','37'...]
Gender:
['female','male','male','female','male'...]
您可以使用 regex(正则表达式)轻松做到这一点。
import re
# returns all numbers
age = re.findall("[\d].", your_text)
# returns all words related to gender
gender = re.findall("female|gentleman|woman", your_text)
你可以用字典来处理的性别部分得到你的正确答案
gender_dict = {"male": ["gentleman", "man", "male"],
"female": ["female", "woman", "girl"]}
gender_aux = []
for g in gender:
if g in gender_dict['male']:
gender_aux.append('male')
elif g in gender_dict['female']:
gender_aux.append('female')
re_list = [
'\d*\-year-old',
'\d*\ year old'
]
matches = []
for r in re_list:
matches += re.findall( r, 'pleasant gentleman who is 42 years old, This 23-year-old white female presents with')
print(matches)
打印出来:
['23-year-old', '42 year old']
我有一个包含文本列表(带行的列)的 CSV 文件,我想从每一行中提取患者的年龄,我不能用 "is digit" 因为还有文本中的一些其他数字。我怎么能做这样的事?谢谢
EXTRA:我也想提取性别 - 患者有时被称为 male/female,有时被称为 man/woman,有时被称为 gentleman/lady。
有没有一种方法可以编写 findall,例如,如果文本是 17 岁,如果后跟 -year-old,请打印我的数字
re.findall("[\d].", '-year-old')
文本中的行示例:
This 23-year-old white female presents with...
...pleasant gentleman who is 42 years old...
...The patient is a 10-1/2-year-old born with...
...A 79-year-old Filipino woman...
Patient, 37,...
我怎样才能得到 age/gender
的列表即:
Age:
['23','42','79','37'...]
Gender:
['female','male','male','female','male'...]
您可以使用 regex(正则表达式)轻松做到这一点。
import re
# returns all numbers
age = re.findall("[\d].", your_text)
# returns all words related to gender
gender = re.findall("female|gentleman|woman", your_text)
你可以用字典来处理的性别部分得到你的正确答案
gender_dict = {"male": ["gentleman", "man", "male"],
"female": ["female", "woman", "girl"]}
gender_aux = []
for g in gender:
if g in gender_dict['male']:
gender_aux.append('male')
elif g in gender_dict['female']:
gender_aux.append('female')
re_list = [
'\d*\-year-old',
'\d*\ year old'
]
matches = []
for r in re_list:
matches += re.findall( r, 'pleasant gentleman who is 42 years old, This 23-year-old white female presents with')
print(matches)
打印出来:
['23-year-old', '42 year old']