基于numbers/digits的文本比较
Text comparison based on numbers/digits
我需要通过仅从以下两个文本中提取数字来比较文本:
text_1="source="The previous low was 27,523, recorded in May 1900. The 1.35 trillion (.5 million ) program could start in October. The number of people who left the country plunged 99.8 percent from a year earlier to 2,750, according to the data from the agency."
text_2="The subsidies, totalling 1.35tn, are expected to form part of a second budget. New plans to allocate .5 billion to a new reimbursement programme."
然而,它似乎也与下一个词相关(例如兆/tn,十亿)。
你知道我如何获得这些信息吗?
我试过
t_1=[int(s) for s in text_1.split() if s.isdigit()]
t_2=[int(s) for s in text_2.split() if s.isdigit()]
然后比较它们,但它并没有给我所有文本中的数字。
预期输出:
differences
text_1: {27,523, 1900, 99.8, 2,750}
text_2: {}
common
{1.35,22.5}
按您建议的方式执行并非不可能,但最好使用正则表达式实现:
import re
text_1="The previous low was 27,523, recorded in May 1900. The 1.35 trillion (.5 million ) program could start in October. The number of people who left the country plunged 99.8 percent from a year earlier to 2,750, according to the data from the agency."
print(re.findall("\d+[,.\d]\d+", text_1))
如果您不熟悉,请检查cheatsheet and try it with tester。一旦你得到它,就可以直接得到你预期的输出:
nums_1 = re.findall("\d+[,.\d]\d+", text_1)
nums_2 = re.findall("\d+[,.\d]\d+", text_2)
common_nums = []
for num in nums_1:
if num in nums_2: common_nums.append(num)
print(common_nums)
我需要通过仅从以下两个文本中提取数字来比较文本:
text_1="source="The previous low was 27,523, recorded in May 1900. The 1.35 trillion (.5 million ) program could start in October. The number of people who left the country plunged 99.8 percent from a year earlier to 2,750, according to the data from the agency."
text_2="The subsidies, totalling 1.35tn, are expected to form part of a second budget. New plans to allocate .5 billion to a new reimbursement programme."
然而,它似乎也与下一个词相关(例如兆/tn,十亿)。 你知道我如何获得这些信息吗?
我试过
t_1=[int(s) for s in text_1.split() if s.isdigit()]
t_2=[int(s) for s in text_2.split() if s.isdigit()]
然后比较它们,但它并没有给我所有文本中的数字。
预期输出:
differences
text_1: {27,523, 1900, 99.8, 2,750}
text_2: {}
common
{1.35,22.5}
按您建议的方式执行并非不可能,但最好使用正则表达式实现:
import re
text_1="The previous low was 27,523, recorded in May 1900. The 1.35 trillion (.5 million ) program could start in October. The number of people who left the country plunged 99.8 percent from a year earlier to 2,750, according to the data from the agency."
print(re.findall("\d+[,.\d]\d+", text_1))
如果您不熟悉,请检查cheatsheet and try it with tester。一旦你得到它,就可以直接得到你预期的输出:
nums_1 = re.findall("\d+[,.\d]\d+", text_1)
nums_2 = re.findall("\d+[,.\d]\d+", text_2)
common_nums = []
for num in nums_1:
if num in nums_2: common_nums.append(num)
print(common_nums)