如何从 python 中的 txt 文件中读取特定列？

Question

我有一个这样的 .txt 数据集：

user_000044 2009-04-24  13:47:07    Spandau Ballet  Through The Barricades

我必须阅读最后两篇专栏，Spandau Ballet 是独一无二的，而 Through the Barricades 也是独一无二的。我该怎么做？

需要创建两个数组，artists =[] 和 tracks = []，我将数据放入循环中，但我无法在一行中定义文本部分。

有人可以帮助我吗？

Answer 1

如果文件中的列由制表符分隔，您可以使用np.loadtxt（NumPy 函数）

artists, tracks = np.loadtxt("myfile.txt", delimiter = "\t", dtype = str, usecols = [ 3, 4 ], unpack = True)

这将输出一个 NumPy 数组。或者，您可以将这些数组转换为常规的 Python 字符串列表

artists = [ str(s) for s in artists ]
tracks = [ str(s) for s in tracks ]

Answer 2

使用 python 且无第三方包的选项：

data = open('dataset.txt', 'r').readlines()

artists = []
tracks = []

for line in data:
    artist, track = line.split(' '*2)[-2::]
    artists.append(artist.strip())
    tracks.append(track.strip())

print artists
print tracks

输出：

['Spandau Ballet']
['Through The Barricades']

[-2::] 获取每行的最后 2 列，如果需要调整以获取其他列。

Answer 3

您最好使用 pandas 模块将 .txt 的内容加载到 pandas DataFrame 并从那里继续。如果你不熟悉它...... DataFrame 与 Python 最接近的 Excelsheet。 pandas 将为您处理阅读行，因此您不必编写自己的循环。

假设您的文本文件是四列，以制表符分隔，这看起来像：

# IPython for demo:
import pandas as pd

df = pd.read_csv('ballet.txt', sep='\t', header=None, names=['artists', 'tracks'], usecols=[2, 3])
# usecols here limits the Dataframe to only consist the 3rd and 4th column of your .txt

您的 DataFrame 可能如下所示：

df
# Out: 
          artists                  tracks
0  Spandau Ballet  Through The Barricades
1   Berlin Ballet               Swan Lake

按列名访问单个列：

df.artists  # or by their index e.g. df.iloc[:, 0]
# Out: 
0    Spandau Ballet
1     Berlin Ballet
Name: 2, dtype: object

此时您仍然可以将数据放入数组中，但如果您知道替代方案，我想不出您真正想要这样做的原因。

如何从 python 中的 txt 文件中读取特定列？

How can I read specific colums from a txt file in python?

python

readfile