遍历多个 CSV 检查每个文件中的整数值

Question

我是 python 的新手，可以使用我能得到的任何帮助。我在 win7 机器上使用 python 3.5 (anaconda)

我正在尝试遍历一个文件夹中的多个 CSV 文件 (10k +)，检查该文件中是否有超过预定义阈值的任何值。

我想建立一个字典，或者 list/tuple（基本上任何最像 sql table）使用文件名的子字符串作为唯一标识符name 字段，并有另一列包含值超过给定阈值的文件总数。

我不希望你们中的任何人为我做这件事，因为这是很好的做法，但我将不胜感激任何可能使这更容易的模块建议。

我已经能够检查一个文件的值，但这只是这个任务的大约 10 分钟，我不确定我将如何遍历多个文件并构建 table 等。 . 谢谢！

import numpy as np
path = 'C:\path' 
file = 'file.csv'
with open(path+file) as f:
    my_data = np.genfromtxt(path+file, delimiter = ",")
    for data in my_data:
        if -1 in my_data:
            print("it sure is")

Answer 1

如果所有文件都在一个文件夹中，您可以使用 glob to step through them all. Then use csv 之类的东西来测试是否存在：

found=[]
for fn in glob.glob('c:\path\*.csv'):
    with open(fn) as f:
        for row in csv.reader(f):
             if tgt_value_as_string in row:
                  found.append(fn)
                  break

类似的东西...

Answer 2

由于您询问了模块和可能的用法。你可能会考虑这样的事情。 import os import sqlite3 for root, dirs, files in os.walk(): //using os module if file == somethingyouwanttoparse: //ie *.csv with open as f: if line data == IWantToSaveThis: insert data into sqlite table //using sqlite3 module
https://docs.python.org/3/library/os.html https://docs.python.org/3.5/library/sqlite3.html 我倾向于尽可能使用实际的 SQL 数据库。

Answer 3

这是有效的Pandas解决方案：

import glob
import os
import pandas as pd

all_files = glob.glob(r'd:/temp/csv/*.csv')

threshold = 100

data = []

for f in all_files:
    data.append([os.path.basename(f),
                (pd.read_csv(f, header=None) > threshold).sum().sum()])

df = pd.DataFrame(data, columns=['file','count'])

print(df)

# optionally save DataFrame to SQL table (`conn` - is a SQLAlchemy connection)
#df.to_sql('table_name', conn)

输出：

    file  count
0  1.csv      2
1  2.csv      3

测试数据：

1.csv:

1,2,3,400
10,111,45,67

2.csv:

1,200,300,4
10,222,45,67

更新：

您可以这样解析文件名中的第一个数字：

In [87]: import re

In [88]: f
Out[88]: '/path/to/touchscreen_data_123456_1456789456_178.16.66.3'

In [89]: re.sub(r'.*_\D+_(\d+)_\d+.*', r'', f)
Out[89]: '123456'

遍历多个 CSV 检查每个文件中的整数值

Iterate through multiple CSV's checking for an integer value in each file

numpy

python-3.x

pandas

genfromtxt