将 .txt 文件中的分隔数据提取到向量中

Extracting delimited data from .txt file into vector

我一直在四处寻找解决这个问题的方法,但运气不佳,如果有任何建议或帮助,我将不胜感激。

我在 Python 3 in Jupyter Notebooks 中工作。

假设我有一个 words.txt 文件,其中包含如下信息:

(50)  PMC3933789  (view)
(36)  PMC4763238  (view)
(35)  PMC2821926  (view)
(26)  PMC3154047  (view)
(25)  PMC3471816  (view)
(25)  PMC4350884  (view)
(24)  PMC2809798  (view)
(23)  PMC2861733  (view)
(22)  PMC4556980  (view)
(22)  PMC4811477  (view)
(19)  PMC3271181  (view)
(19)  PMC3549280  (view)
(19)  PMC4879671  (view)
(18)  PMC2938390  (view)
(18)  PMC3186417  (view)
(18)  PMC3498278  (view)
(18)  PMC3601842  (view)
(16)  PMC3445503  (view)
(16)  PMC3491835  (view)

在 Python 中,我想读取此 .txt 文件,提取带分隔符的数字,并将它们分配给向量,即:

var = [3933789, 4763238, 2821926...]

以前,我一直在使用 Excel 并手动分隔数字,附加逗号,然后手动复制结果值,但是当参数数量增加时,这很乏味。我希望能够在 Python.

中完成此操作

如果这是一致的模式,您可以使用列表推导式按固定索引去除:

with open('my-file','r') as lines:
    numbers = [int(line[9:16]) for line in lines.readlines() if len(line.strip()) > 0]
print(numbers)

对于上面的例子,给出:

[3933789, 4763238, 2821926, 3154047, 3471816, 4350884, 2809798, 2861733, 4556980, 4811477, 3271181, 3549280, 4879671, 2938390, 3186417, 3498278, 3601842, 3445503, 3491835]

一种方法是在列表推导中使用正则表达式

import re
with open('mydata.txt', 'r') as file:  # mydata.txt is name of data file    
    var = [int(re.search(r'PMC(\d+)', line).group(1)) for line in file]

说明

r'PMC(\d+)'                    - regular expression for capturing digits after PMC
re.search(r'PMC(\d+)', line)   - finds and captures digits in a line
re.search(...).group(1)         - correspond to capture group 1 which are the digits
int(...)                       - converts digits from string to number
for line in file               - iterates through the lines of the file