如何将新添加的文件连接到 pandas 数据框？

Question

我正在尝试编写一个脚本，它将从文件夹中抓取新添加的 csv 文件并将其添加到一个大文件中。基本上，我希望所有的 csv 文件都添加到一个特定的文件夹中，并存储在一个生成的 csv 文件中。我在下面有一个生成文件列表的代码，我正在那里选择新添加的文件：

def check_dir(fh,start_path='/Users/.../Desktop/files',new_cb=None,changed_cb=None):
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(start_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            if not os.path.islink(fp):
                fs = os.path.getsize(fp)
                total_size += fs
                if f in fh:
                    if fh[f] == fs:
                        # file unchanged
                        pass
                    else:
                        if changed_cb:
                            changed_cb(fp)
                else:
                    #new file
                    if new_cb:
                        new_cb(fp)
                fh[f] = fs

    return total_size

def new_file(fp):
    print("New File {0}!".format(fp))

def changed_file(fp):
    print("File {0} changed!".format(fp))

if __name__ == '__main__':
    file_history={}
    total = 0

    while(True):
        nt = check_dir(file_history,'/Users/.../Desktop/files',new_file,changed_file)
        if total and nt != total:
            print("Total size changed from {0} to {1}".format(total,nt))
            total = nt
        time.sleep(200)
        print("File list:\n{0}".format(file_history))
        print(list(dict.keys(file_history))[-1])

我真的不知道如何创建这个空的 pandas 数据框，这个最新添加的文件将定期添加到其中（这就是为什么我在那里有一个 time.sleep）。最后，我想要这个包含所有文件的大 csv 文件。

求求你帮忙:(

P.S。我是新手Python，所以请不要判断它是否超级简单..

Answer 1

我认为 pandas.concat() 就是您要找的东西

Answer 2

您打算使用 Pandas 来处理 csv 中的数据还是仅用于连接文件？

如果您只想将每个 csv 文件附加到大文件，那么为什么不使用 python io 来提高速度和简单性。假设所有 csv 文件都使用相同类型的格式。

我已经更新了 new_file 方法以使用 io 附加到大 csv。我添加了一个 append_pandas 函数，它没有被使用，但如果你必须使用 pandas 来完成这项工作，它应该对你有帮助。我还没有测试 pandas 功能，还有更多需要考虑的事情，比如 csv 文件的格式。查看 documentation 了解更多详情。

import os
import time


def check_dir(fh,start_path='/Users/.../Desktop/files',new_cb=None,changed_cb=None,**kwargs):
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(start_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            if not os.path.islink(fp):
                fs = os.path.getsize(fp)
                total_size += fs
                if f in fh:
                    if fh[f] == fs:
                        # file unchanged
                        pass
                    else:
                        if changed_cb:
                            changed_cb(fp,**kwargs)
                else:
                    #new file
                    if new_cb:
                        new_cb(fp, **kwargs)
                fh[f] = fs

    return total_size

def is_csv(f):
    # you can add more to check here
    return 'csv' in f

def append_csv(s,d,skip_header=1):

    with open(s,'r') as readcsv:
        with open(d,'a') as appendcsv:
            for line in readcsv:
                if(skip_header < 1):
                    appendcsv.write(line)
                else:
                    skip_header -= 1

            if not "\n" in line:
                appendcsv.write("\n")

def append_pandas(s,d):
    # i haven't tested this
    pd = pandas.read_csv(s)
    pdb = pandas.read_csv(d)
    newpd = pdb.append(pd)
    DataFrame.to_csv(d)

def new_file(fp, **kwargs):
    if is_csv(fp):
        print("Appending {0}!".format(fp))
        bcsv = kwargs.get('append_to_csv','/default/path/to/big.csv')
        skip = kwargs.get('skip_header',1)
        append_csv(fp,bcsv,skip)

def changed_file(fp, **kwargs):
    print("File {0} changed!".format(fp))

if __name__ == '__main__':
    file_history={}
    total = 0

    while(True):
        nt = check_dir(file_history,'/tmp/test/',new_file,changed_file, append_to_csv ='/tmp/big.csv', skip_header = 1)
        if total and ns != total:
            print("Total size changed from {0} to {1}".format(total,ns))
            total = ns
        time.sleep(10)
        print("File list:\n{0}".format(file_history))

如何将新添加的文件连接到 pandas 数据框？

How to concatenate a newly added file to pandas dataframe?

python

directory

file

concatenation

pandas