基于numbers/digits的文本比较

Question

我需要通过仅从以下两个文本中提取数字来比较文本：

text_1="source="The previous low was 27,523, recorded in May 1900. The 1.35 trillion (.5 million ) program could start in October. The number of people who left the country plunged 99.8 percent from a year earlier to 2,750, according to the data from the agency."

text_2="The subsidies, totalling 1.35tn, are expected to form part of a second budget. New plans to allocate .5 billion to a new reimbursement programme."

然而，它似乎也与下一个词相关（例如兆/tn，十亿）。你知道我如何获得这些信息吗？

我试过

t_1=[int(s) for s in text_1.split() if s.isdigit()]
t_2=[int(s) for s in text_2.split() if s.isdigit()]

然后比较它们，但它并没有给我所有文本中的数字。

预期输出：

differences

text_1: {27,523, 1900, 99.8, 2,750}

text_2: {}

common
    {1.35,22.5}

Answer 1

按您建议的方式执行并非不可能，但最好使用正则表达式实现：

import re

text_1="The previous low was 27,523, recorded in May 1900. The 1.35 trillion (.5 million ) program could start in October. The number of people who left the country plunged 99.8 percent from a year earlier to 2,750, according to the data from the agency."

print(re.findall("\d+[,.\d]\d+", text_1))

如果您不熟悉，请检查cheatsheet and try it with tester。一旦你得到它，就可以直接得到你预期的输出：

nums_1 = re.findall("\d+[,.\d]\d+", text_1)
nums_2 = re.findall("\d+[,.\d]\d+", text_2)

common_nums = []
for num in nums_1:
  if num in nums_2: common_nums.append(num)

print(common_nums)

基于numbers/digits的文本比较

Text comparison based on numbers/digits

python

text-mining