在 python 中使用自定义正则表达式去除文本

Question

(2, 43) 0.74670222994
(3, 15) 0.74132892839
(3, 31) 0.671141877647
(4, 19) 0.699490245832
(4, 47) 0.422715095257
(4, 48) 0.433278265941
(4, 0)  0.379862196713
(5, 19) 0.653731227092
(5, 72) 0.756726821729

以上是已写入文件的tfidf矩阵。我只想读取 0.74132892839 之类的 tf-idf 值并将它们附加到列表中。

有没有办法做 f.read() 然后去掉索引？

Answer 1

使用re.sub()函数的简单解决方案：

import re

# specify your actual file name
with open('lines.txt', 'r') as fh:
    result = re.sub(r'\([^)]+\)\s*', '', fh.read()).split('\n')

print(result)

输出：

['0.74670222994', '0.74132892839', '0.671141877647', '0.699490245832', '0.422715095257', '0.433278265941', '0.379862196713', '0.653731227092', '0.756726821729']

\([^)]+\) - 匹配括号

之间的序列

在 python 中使用自定义正则表达式去除文本

Strip text with custom regex in python

python

regex

strip