提取包含特定名称的列

Question

我正在尝试使用它来处理大型 txt-files 中的数据。

我有一个包含 2000 多列的 txt-file，其中大约三分之一的标题包含单词 'Net'。我只想提取这些列并将它们写入新的 txt 文件。关于我该怎么做的任何建议？

我搜索了一下，但没能找到对我有帮助的东西。如果之前提出并解决了类似的问题，我们深表歉意。

编辑 1：谢谢大家！在撰写本文时，已有 3 位用户提出了解决方案，而且它们都运行良好。老实说，我认为人们不会回答，所以我有一两天没有检查，对此感到很高兴。我很感动。

编辑 2：我添加了一张图片，显示了原始 txt-file 的一部分可能是什么样子，以防将来对任何人有所帮助：

Answer 1

这可以通过 Pandas、

来完成

import pandas as pd

df = pd.read_csv('path_to_file.txt', sep='\s+')
print(df.columns)  # check that the  columns are parsed correctly 
selected_columns = [col for col in df.columns if "net" in col]
df_filtered = df[selected_columns]
df_filtered.to_csv('new_file.txt')

当然，由于我们没有您的文本文件的结构，您必须调整 read_csv 的参数才能使它适用于您的情况（请参阅相应的 documentation).

这会将所有文件加载到内存中，然后过滤掉不需要的列。如果您的文件太大以至于无法立即加载到 RAM 中，可以使用 usecols 参数仅加载特定列。

Answer 2

无需安装 numpy/pandas 等第三方模块的一种方法如下。给定一个名为 "input.csv" 的输入文件，如下所示：

a,b,c_net,d,e_net

0,0,1,0,1

(去掉中间的空行，它们只是为了格式化此post)

中的内容

下面的代码可以满足您的需求。

import csv


input_filename = 'input.csv'
output_filename = 'output.csv'

# Instantiate a CSV reader, check if you have the appropriate delimiter
reader = csv.reader(open(input_filename), delimiter=',')

# Get the first row (assuming this row contains the header)
input_header = reader.next()

# Filter out the columns that you want to keep by storing the column
# index
columns_to_keep = []
for i, name in enumerate(input_header):
    if 'net' in name:
        columns_to_keep.append(i)

# Create a CSV writer to store the columns you want to keep
writer = csv.writer(open(output_filename, 'w'), delimiter=',')

# Construct the header of the output file
output_header = []
for column_index in columns_to_keep:
    output_header.append(input_header[column_index])

# Write the header to the output file
writer.writerow(output_header)

# Iterate of the remainder of the input file, construct a row
# with columns you want to keep and write this row to the output file
for row in reader:
    new_row = []
    for column_index in columns_to_keep:
        new_row.append(row[column_index])
    writer.writerow(new_row)

请注意，没有错误处理。至少有两个应该处理。第一个是检查输入文件是否存在（提示：检查 os 和 os.path 模块提供的功能）。第二个是处理空行或列数不一致的行。

Answer 3

您可以使用 pandas 过滤函数 select 基于正则表达式的几列

data_filtered = data.filter(regex='net')

提取包含特定名称的列

Extracting columns containing a certain name

python

extraction

text-files