从 Python 中的 CSV 文件的特定列中提取数据
Pulling out data from CSV files' specific columns in Python
我需要快速帮助使用 Python 读取 CSV 文件并将其存储在 'data-type' 文件中,以便在将所有数据存储在不同文件中后使用数据绘制图表。
我已经搜索过了,但是我发现在所有情况下,数据中都有header。我的数据没有 header 部分。它们是制表符分隔的。而且我只需要存储特定的数据列。例如:
12345601 2345678@abcdef 1 2 365 places
在这种情况下,例如,我只想在新 python 文件中存储“2345678@abcdef”和“365”,以便将来使用它来创建图表。
此外,我在一个文件夹中有 1 个以上的 csv 文件,我需要在每个文件中执行此操作。我找到的资料没有讲,只提到:
# open csv file
with open(csv_file, 'rb') as csvfile:
任何人都可以向我推荐已回答的问题或帮助我解决问题吗?
. . . and storing it in a PY file to use the data to graph after storing all the data in different files . . .
. . . I would want to store only "2345678@abcdef" and "365" in the new python file . . .
您确定要将数据存储在 python 文件中吗? Python 文件应该包含 python 代码并且它们应该可以被 python 解释器执行。最好将数据存储在 data-type 文件中(例如 preprocessed_data.csv
)。
要获取与模式匹配的文件列表,您可以使用 python 的 built-in glob
library.
这是一个示例,说明如何读取目录中的多个 csv 文件并从每个文件中提取所需的列:
import glob
# indices of columns you want to preserve
desired_columns = [1, 4]
# change this to the directory that holds your data files
csv_directory = '/path/to/csv/files/*.csv'
# iterate over files holding data
extracted_data = []
for file_name in glob.glob(csv_directory):
with open(file_name, 'r') as data_file:
while True:
line = data_file.readline()
# stop at the end of the file
if len(line) == 0:
break
# splits the line by whitespace
tokens = line.split()
# only grab the columns we care about
desired_data = [tokens[i] for i in desired_columns]
extracted_data.append(desired_data)
将提取的数据写入新文件很容易。以下示例显示了如何将数据保存到 csv 文件。
output_string = ''
for row in extracted_data:
output_string += ','.join(row) + '\n'
with open('./preprocessed_data.csv', 'w') as csv_file:
csv_file.write(output_string)
编辑:
如果您不想合并所有的 csv 文件,这里有一个可以一次处理一个的版本:
def process_file(input_path, output_path, selected_columns):
extracted_data = []
with open(input_path, 'r') as in_file:
while True:
line = in_file.readline()
if len(line) == 0: break
tokens = line.split()
extracted_data.append([tokens[i] for i in selected_columns])
output_string = ''
for row in extracted_data:
output_string += ','.join(row) + '\n'
with open(output_path, 'w') as out_file:
out_file.write(output_string)
# whenever you need to process a file:
process_file(
'/path/to/input.csv',
'/path/to/processed/output.csv',
[1, 4])
# if you want to process every file in a directory:
target_directory = '/path/to/my/files/*.csv'
for file in glob.glob(target_directory):
process_file(file, file + '.out', [1, 4])
编辑 2:
以下示例将处理目录中的每个文件并将结果写入另一个目录中的 similarly-named 输出文件:
import os
import glob
input_directory = '/path/to/my/files/*.csv'
output_directory = '/path/to/output'
for file in glob.glob(input_directory):
file_name = os.path.basename(file) + '.out'
out_file = os.path.join(output_directory, file_name)
process_file(file, out_file, [1, 4])
如果你想在输出中添加headers,那么process_file
可以这样修改:
def process_file(input_path, output_path, selected_columns, column_headers=[]):
extracted_data = []
with open(input_path, 'r') as in_file:
while True:
line = in_file.readline()
if len(line) == 0: break
tokens = line.split()
extracted_data.append([tokens[i] for i in selected_columns])
output_string = ','.join(column_headers) + '\n'
for row in extracted_data:
output_string += ','.join(row) + '\n'
with open(output_path, 'w') as out_file:
out_file.write(output_string)
这是另一种使用 namedtuple 的方法,它有助于从 csv 文件中提取选定的字段,然后让您将它们写入新的 csv 文件。
from collections import namedtuple
import csv
# Setup named tuple to receive csv data
# p1 to p5 are arbitrary field names associated with the csv file
SomeData = namedtuple('SomeData', 'p1, p2, p3, p4, p5, p6')
# Read data from the csv file and create a generator object to hold a reference to the data
# We use a generator object rather than a list to reduce the amount of memory our program will use
# The captured data will only have data from the 2nd & 5th column from the csv file
datagen = ((d.p2, d.p5) for d in map(SomeData._make, csv.reader(open("mydata.csv", "r"))))
# Write the data to a new csv file
with open("newdata.csv","w", newline='') as csvfile:
cvswriter = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
# Use the generator created earlier to access the filtered data and write it out to a new csv file
for d in datagen:
cvswriter.writerow(d)
"mydata.csv"中的原始数据:
12345601,2345678@abcdef,1,2,365,places
4567,876@def,0,5,200,noplaces
在"newdata.csv"中输出数据:
2345678@abcdef,365
876@def,200
编辑 1:
对于制表符分隔的数据,对代码进行以下更改:
变化
datagen = ((d.p2, d.p5) for d in map(SomeData._make, csv.reader(open("mydata.csv", "r"))))
至
datagen = ((d.p2, d.p5) for d in map(SomeData._make, csv.reader(open("mydata2.csv", "r"), delimiter='\t', quotechar='"')))
和
cvswriter = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
至
cvswriter = csv.writer(csvfile, delimiter='\t', quotechar='"', quoting=csv.QUOTE_MINIMAL)
我需要快速帮助使用 Python 读取 CSV 文件并将其存储在 'data-type' 文件中,以便在将所有数据存储在不同文件中后使用数据绘制图表。
我已经搜索过了,但是我发现在所有情况下,数据中都有header。我的数据没有 header 部分。它们是制表符分隔的。而且我只需要存储特定的数据列。例如:
12345601 2345678@abcdef 1 2 365 places
在这种情况下,例如,我只想在新 python 文件中存储“2345678@abcdef”和“365”,以便将来使用它来创建图表。
此外,我在一个文件夹中有 1 个以上的 csv 文件,我需要在每个文件中执行此操作。我找到的资料没有讲,只提到:
# open csv file
with open(csv_file, 'rb') as csvfile:
任何人都可以向我推荐已回答的问题或帮助我解决问题吗?
. . . and storing it in a PY file to use the data to graph after storing all the data in different files . . .
. . . I would want to store only "2345678@abcdef" and "365" in the new python file . . .
您确定要将数据存储在 python 文件中吗? Python 文件应该包含 python 代码并且它们应该可以被 python 解释器执行。最好将数据存储在 data-type 文件中(例如 preprocessed_data.csv
)。
要获取与模式匹配的文件列表,您可以使用 python 的 built-in glob
library.
这是一个示例,说明如何读取目录中的多个 csv 文件并从每个文件中提取所需的列:
import glob
# indices of columns you want to preserve
desired_columns = [1, 4]
# change this to the directory that holds your data files
csv_directory = '/path/to/csv/files/*.csv'
# iterate over files holding data
extracted_data = []
for file_name in glob.glob(csv_directory):
with open(file_name, 'r') as data_file:
while True:
line = data_file.readline()
# stop at the end of the file
if len(line) == 0:
break
# splits the line by whitespace
tokens = line.split()
# only grab the columns we care about
desired_data = [tokens[i] for i in desired_columns]
extracted_data.append(desired_data)
将提取的数据写入新文件很容易。以下示例显示了如何将数据保存到 csv 文件。
output_string = ''
for row in extracted_data:
output_string += ','.join(row) + '\n'
with open('./preprocessed_data.csv', 'w') as csv_file:
csv_file.write(output_string)
编辑:
如果您不想合并所有的 csv 文件,这里有一个可以一次处理一个的版本:
def process_file(input_path, output_path, selected_columns):
extracted_data = []
with open(input_path, 'r') as in_file:
while True:
line = in_file.readline()
if len(line) == 0: break
tokens = line.split()
extracted_data.append([tokens[i] for i in selected_columns])
output_string = ''
for row in extracted_data:
output_string += ','.join(row) + '\n'
with open(output_path, 'w') as out_file:
out_file.write(output_string)
# whenever you need to process a file:
process_file(
'/path/to/input.csv',
'/path/to/processed/output.csv',
[1, 4])
# if you want to process every file in a directory:
target_directory = '/path/to/my/files/*.csv'
for file in glob.glob(target_directory):
process_file(file, file + '.out', [1, 4])
编辑 2:
以下示例将处理目录中的每个文件并将结果写入另一个目录中的 similarly-named 输出文件:
import os
import glob
input_directory = '/path/to/my/files/*.csv'
output_directory = '/path/to/output'
for file in glob.glob(input_directory):
file_name = os.path.basename(file) + '.out'
out_file = os.path.join(output_directory, file_name)
process_file(file, out_file, [1, 4])
如果你想在输出中添加headers,那么process_file
可以这样修改:
def process_file(input_path, output_path, selected_columns, column_headers=[]):
extracted_data = []
with open(input_path, 'r') as in_file:
while True:
line = in_file.readline()
if len(line) == 0: break
tokens = line.split()
extracted_data.append([tokens[i] for i in selected_columns])
output_string = ','.join(column_headers) + '\n'
for row in extracted_data:
output_string += ','.join(row) + '\n'
with open(output_path, 'w') as out_file:
out_file.write(output_string)
这是另一种使用 namedtuple 的方法,它有助于从 csv 文件中提取选定的字段,然后让您将它们写入新的 csv 文件。
from collections import namedtuple
import csv
# Setup named tuple to receive csv data
# p1 to p5 are arbitrary field names associated with the csv file
SomeData = namedtuple('SomeData', 'p1, p2, p3, p4, p5, p6')
# Read data from the csv file and create a generator object to hold a reference to the data
# We use a generator object rather than a list to reduce the amount of memory our program will use
# The captured data will only have data from the 2nd & 5th column from the csv file
datagen = ((d.p2, d.p5) for d in map(SomeData._make, csv.reader(open("mydata.csv", "r"))))
# Write the data to a new csv file
with open("newdata.csv","w", newline='') as csvfile:
cvswriter = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
# Use the generator created earlier to access the filtered data and write it out to a new csv file
for d in datagen:
cvswriter.writerow(d)
"mydata.csv"中的原始数据:
12345601,2345678@abcdef,1,2,365,places
4567,876@def,0,5,200,noplaces
在"newdata.csv"中输出数据:
2345678@abcdef,365
876@def,200
编辑 1:
对于制表符分隔的数据,对代码进行以下更改:
变化
datagen = ((d.p2, d.p5) for d in map(SomeData._make, csv.reader(open("mydata.csv", "r"))))
至
datagen = ((d.p2, d.p5) for d in map(SomeData._make, csv.reader(open("mydata2.csv", "r"), delimiter='\t', quotechar='"')))
和
cvswriter = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
至
cvswriter = csv.writer(csvfile, delimiter='\t', quotechar='"', quoting=csv.QUOTE_MINIMAL)