我可以按日期和 ID 对文件进行分组并对它们进行比较吗?

Can I group files by Date and ID and do diff on them?

确切地说,我在目录 698 中有一堆文件。每个文件都包含一个日期和一个唯一 ID,以及一个名称。像这样:

import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import csv
import os
import re

20151231_7801_Test_Maps.txt
20151231_7801_Test_Items.txt
20151231_7802_Test_Maps.txt
20151231_7802_Test_Items.txt

我希望按日期和标识符对它们进行分组,打开每个文件(地图和项目),并对文件中的某些 ID 进行差异分析。我该怎么做?

到目前为止,我的代码是这样的,但我不知道如何遍历并打开每组的每个文件:

groups = defaultdict(list)
for filename in os.listdir('F:\Desktop'):
    date = filename[:8]
    identifier = filename[10:14]
    basename, extension = os.path.splitext(filename)
    groups[date, identifier].append(filename)

我的输出正确打印了一些组,但不是全部,例如:

('20151231','7801')['20151231_7801_Test_Maps.txt, 20151231_7801_Test_Items.txt]

有些小组只打印一个文件,即使该日期和标识符有两个文件。

这不是我主要关心的问题,但是一旦它们被分解成组,我想将组中的每个文件分配给一个数据框,如下所示:

for key in groups:
    maps = pd.read_csv(file1, sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
    items = pd.read_csv(file2, sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')

    #checks IDs between the two files and looks for differences
    set(maps.ID).difference(items.ID)

有人可以帮忙按日期和 ID 对文件进行分组,并按组迭代打开文件吗?谢谢!

从那里得到了一些帮助并做到了

   import pandas as pd


from collections import defaultdict

difference = pd.DataFrame(columns=('Filename1', 'Filename2', 'DiffID1','DiffID2'))

pathloc ='C:\Users\shmathew\Desktop\Sample\abc\'
groups = defaultdict(list)
for filename in os.listdir(pathloc):
    date = filename[:8]
    identifier = filename[10:14]
    basename, extension = os.path.splitext(filename)
    groups[date, identifier].append(filename)



for key,filenames  in groups.iteritems():
    #print " processing following files"
    #print filenames
    maps = pd.read_csv(pathloc+filenames[1], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
    Items = pd.read_csv(pathloc+filenames[0]  , sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
    df = pd.concat([maps, Items])
    df = df.reset_index(drop=True)
    df_gpby = df.groupby(list(df.columns))
    idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]




    #print "\n\n Difference \n\n"
    ids= (df.reindex(idx))
    row =list(filenames);
    row.extend(list(ids['ID']))

    print  row 
   # difference.append(row)
    difference.append(row)
print difference

输出

['20151231_7802_Test_Items.txt', '20151231_7802_Test_Maps.txt', '00432931830TRNY1    ', '00432xx0TRNY1    ']
['20151231_7801_Test_Items.txt', '20151231_7801_Test_Maps.txt']
Empty DataFrame
Columns: [Filename1, Filename2, DiffID1, DiffID2]
Index: []

根据四条的回答,我找到了一个很好的方法。

groups = defaultdict(list)
output = []

for filename in os.listdir(pathloc):
date = filename[:8]
identifier = filename[14:18]
basename, extension = os.path.splitext(filename)
groups[date, identifier].append(filename)


for key, fnames in groups.iteritems():
filedicts = {}
print list(fnames)
maps = pd.read_csv(pathloc+fnames[1], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
items = pd.read_csv(pathloc+fnames[0], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')



diffs = set(maps.ID).symmetric_difference(items.ID)

filedicts['FileIDKey'] = list(key)
filedicts['Missing_IDs'] = list(diffs)                         
filedicts['FileNames'] = fnames

output.append(filedicts)

这让我可以将这个主词典列表添加到数据框:

new = pd.DataFrame(output)