从文本文件中获取姓名和年龄
fetching name and age from a text file
我有一个 .txt 文件,我必须从中获取姓名和年龄。
.txt 文件的数据格式如下:
Age: 71 . John is 47 years old. Sam; Born: 05/04/1989(29).
Kenner is a patient Age: 36 yrs Height: 5 feet 1 inch; weight is 56 kgs.
This medical record is 10 years old.
Output 1: John, Sam, Kenner
Output_2: 47, 29, 36
我正在使用正则表达式来提取数据。例如,对于年龄,我使用以下正则表达式:
re.compile(r'age:\s*\d{1,3}',re.I)
re.compile(r'(age:|is|age|a|) \s*\d{1,3}(\s|y)',re.I)
re.compile(r'.* Age\s*:*\s*[0-9]+.*',re.I)
re.compile(r'.* [0-9]+ (?:year|years|yrs|yr) \s*',re.I)
我将对这些正则表达式的输出应用另一个正则表达式来提取数字。问题在于这些正则表达式,我也得到了我不想要的数据。例如
This medical record is 10 years old.
我从上面的句子中得到了“10”,这是我不想要的。
我只想提取人名和年龄。我想知道应该采取什么方法?如果能提供任何帮助,我将不胜感激。
请查看 Cloud Data Loss Prevention API. Here is a GitHub repo 示例。这就是您可能想要的。
def inspect_string(project, content_string, info_types,
min_likelihood=None, max_findings=None, include_quote=True):
"""Uses the Data Loss Prevention API to analyze strings for protected data.
Args:
project: The Google Cloud project id to use as a parent resource.
content_string: The string to inspect.
info_types: A list of strings representing info types to look for.
A full list of info type categories can be fetched from the API.
min_likelihood: A string representing the minimum likelihood threshold
that constitutes a match. One of: 'LIKELIHOOD_UNSPECIFIED',
'VERY_UNLIKELY', 'UNLIKELY', 'POSSIBLE', 'LIKELY', 'VERY_LIKELY'.
max_findings: The maximum number of findings to report; 0 = no maximum.
include_quote: Boolean for whether to display a quote of the detected
information in the results.
Returns:
None; the response from the API is printed to the terminal.
"""
# Import the client library.
import google.cloud.dlp
# Instantiate a client.
dlp = google.cloud.dlp.DlpServiceClient()
# Prepare info_types by converting the list of strings into a list of
# dictionaries (protos are also accepted).
info_types = [{'name': info_type} for info_type in info_types]
# Construct the configuration dictionary. Keys which are None may
# optionally be omitted entirely.
inspect_config = {
'info_types': info_types,
'min_likelihood': min_likelihood,
'include_quote': include_quote,
'limits': {'max_findings_per_request': max_findings},
}
# Construct the `item`.
item = {'value': content_string}
# Convert the project id into a full resource id.
parent = dlp.project_path(project)
# Call the API.
response = dlp.inspect_content(parent, inspect_config, item)
# Print out the results.
if response.result.findings:
for finding in response.result.findings:
try:
if finding.quote:
print('Quote: {}'.format(finding.quote))
except AttributeError:
pass
print('Info type: {}'.format(finding.info_type.name))
print('Likelihood: {}'.format(finding.likelihood))
else:
print('No findings.')
我有一个 .txt 文件,我必须从中获取姓名和年龄。 .txt 文件的数据格式如下:
Age: 71 . John is 47 years old. Sam; Born: 05/04/1989(29).
Kenner is a patient Age: 36 yrs Height: 5 feet 1 inch; weight is 56 kgs.
This medical record is 10 years old.
Output 1: John, Sam, Kenner
Output_2: 47, 29, 36
我正在使用正则表达式来提取数据。例如,对于年龄,我使用以下正则表达式:
re.compile(r'age:\s*\d{1,3}',re.I)
re.compile(r'(age:|is|age|a|) \s*\d{1,3}(\s|y)',re.I)
re.compile(r'.* Age\s*:*\s*[0-9]+.*',re.I)
re.compile(r'.* [0-9]+ (?:year|years|yrs|yr) \s*',re.I)
我将对这些正则表达式的输出应用另一个正则表达式来提取数字。问题在于这些正则表达式,我也得到了我不想要的数据。例如
This medical record is 10 years old.
我从上面的句子中得到了“10”,这是我不想要的。 我只想提取人名和年龄。我想知道应该采取什么方法?如果能提供任何帮助,我将不胜感激。
请查看 Cloud Data Loss Prevention API. Here is a GitHub repo 示例。这就是您可能想要的。
def inspect_string(project, content_string, info_types,
min_likelihood=None, max_findings=None, include_quote=True):
"""Uses the Data Loss Prevention API to analyze strings for protected data.
Args:
project: The Google Cloud project id to use as a parent resource.
content_string: The string to inspect.
info_types: A list of strings representing info types to look for.
A full list of info type categories can be fetched from the API.
min_likelihood: A string representing the minimum likelihood threshold
that constitutes a match. One of: 'LIKELIHOOD_UNSPECIFIED',
'VERY_UNLIKELY', 'UNLIKELY', 'POSSIBLE', 'LIKELY', 'VERY_LIKELY'.
max_findings: The maximum number of findings to report; 0 = no maximum.
include_quote: Boolean for whether to display a quote of the detected
information in the results.
Returns:
None; the response from the API is printed to the terminal.
"""
# Import the client library.
import google.cloud.dlp
# Instantiate a client.
dlp = google.cloud.dlp.DlpServiceClient()
# Prepare info_types by converting the list of strings into a list of
# dictionaries (protos are also accepted).
info_types = [{'name': info_type} for info_type in info_types]
# Construct the configuration dictionary. Keys which are None may
# optionally be omitted entirely.
inspect_config = {
'info_types': info_types,
'min_likelihood': min_likelihood,
'include_quote': include_quote,
'limits': {'max_findings_per_request': max_findings},
}
# Construct the `item`.
item = {'value': content_string}
# Convert the project id into a full resource id.
parent = dlp.project_path(project)
# Call the API.
response = dlp.inspect_content(parent, inspect_config, item)
# Print out the results.
if response.result.findings:
for finding in response.result.findings:
try:
if finding.quote:
print('Quote: {}'.format(finding.quote))
except AttributeError:
pass
print('Info type: {}'.format(finding.info_type.name))
print('Likelihood: {}'.format(finding.likelihood))
else:
print('No findings.')