我可以按日期和 ID 对文件进行分组并对它们进行比较吗?
Can I group files by Date and ID and do diff on them?
确切地说,我在目录 698 中有一堆文件。每个文件都包含一个日期和一个唯一 ID,以及一个名称。像这样:
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import csv
import os
import re
20151231_7801_Test_Maps.txt
20151231_7801_Test_Items.txt
20151231_7802_Test_Maps.txt
20151231_7802_Test_Items.txt
我希望按日期和标识符对它们进行分组,打开每个文件(地图和项目),并对文件中的某些 ID 进行差异分析。我该怎么做?
到目前为止,我的代码是这样的,但我不知道如何遍历并打开每组的每个文件:
groups = defaultdict(list)
for filename in os.listdir('F:\Desktop'):
date = filename[:8]
identifier = filename[10:14]
basename, extension = os.path.splitext(filename)
groups[date, identifier].append(filename)
我的输出正确打印了一些组,但不是全部,例如:
('20151231','7801')['20151231_7801_Test_Maps.txt, 20151231_7801_Test_Items.txt]
有些小组只打印一个文件,即使该日期和标识符有两个文件。
这不是我主要关心的问题,但是一旦它们被分解成组,我想将组中的每个文件分配给一个数据框,如下所示:
for key in groups:
maps = pd.read_csv(file1, sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
items = pd.read_csv(file2, sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
#checks IDs between the two files and looks for differences
set(maps.ID).difference(items.ID)
有人可以帮忙按日期和 ID 对文件进行分组,并按组迭代打开文件吗?谢谢!
从那里得到了一些帮助并做到了
import pandas as pd
from collections import defaultdict
difference = pd.DataFrame(columns=('Filename1', 'Filename2', 'DiffID1','DiffID2'))
pathloc ='C:\Users\shmathew\Desktop\Sample\abc\'
groups = defaultdict(list)
for filename in os.listdir(pathloc):
date = filename[:8]
identifier = filename[10:14]
basename, extension = os.path.splitext(filename)
groups[date, identifier].append(filename)
for key,filenames in groups.iteritems():
#print " processing following files"
#print filenames
maps = pd.read_csv(pathloc+filenames[1], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
Items = pd.read_csv(pathloc+filenames[0] , sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
df = pd.concat([maps, Items])
df = df.reset_index(drop=True)
df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
#print "\n\n Difference \n\n"
ids= (df.reindex(idx))
row =list(filenames);
row.extend(list(ids['ID']))
print row
# difference.append(row)
difference.append(row)
print difference
输出
['20151231_7802_Test_Items.txt', '20151231_7802_Test_Maps.txt', '00432931830TRNY1 ', '00432xx0TRNY1 ']
['20151231_7801_Test_Items.txt', '20151231_7801_Test_Maps.txt']
Empty DataFrame
Columns: [Filename1, Filename2, DiffID1, DiffID2]
Index: []
根据四条的回答,我找到了一个很好的方法。
groups = defaultdict(list)
output = []
for filename in os.listdir(pathloc):
date = filename[:8]
identifier = filename[14:18]
basename, extension = os.path.splitext(filename)
groups[date, identifier].append(filename)
for key, fnames in groups.iteritems():
filedicts = {}
print list(fnames)
maps = pd.read_csv(pathloc+fnames[1], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
items = pd.read_csv(pathloc+fnames[0], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
diffs = set(maps.ID).symmetric_difference(items.ID)
filedicts['FileIDKey'] = list(key)
filedicts['Missing_IDs'] = list(diffs)
filedicts['FileNames'] = fnames
output.append(filedicts)
这让我可以将这个主词典列表添加到数据框:
new = pd.DataFrame(output)
确切地说,我在目录 698 中有一堆文件。每个文件都包含一个日期和一个唯一 ID,以及一个名称。像这样:
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import csv
import os
import re
20151231_7801_Test_Maps.txt
20151231_7801_Test_Items.txt
20151231_7802_Test_Maps.txt
20151231_7802_Test_Items.txt
我希望按日期和标识符对它们进行分组,打开每个文件(地图和项目),并对文件中的某些 ID 进行差异分析。我该怎么做?
到目前为止,我的代码是这样的,但我不知道如何遍历并打开每组的每个文件:
groups = defaultdict(list)
for filename in os.listdir('F:\Desktop'):
date = filename[:8]
identifier = filename[10:14]
basename, extension = os.path.splitext(filename)
groups[date, identifier].append(filename)
我的输出正确打印了一些组,但不是全部,例如:
('20151231','7801')['20151231_7801_Test_Maps.txt, 20151231_7801_Test_Items.txt]
有些小组只打印一个文件,即使该日期和标识符有两个文件。
这不是我主要关心的问题,但是一旦它们被分解成组,我想将组中的每个文件分配给一个数据框,如下所示:
for key in groups:
maps = pd.read_csv(file1, sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
items = pd.read_csv(file2, sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
#checks IDs between the two files and looks for differences
set(maps.ID).difference(items.ID)
有人可以帮忙按日期和 ID 对文件进行分组,并按组迭代打开文件吗?谢谢!
从那里得到了一些帮助并做到了
import pandas as pd
from collections import defaultdict
difference = pd.DataFrame(columns=('Filename1', 'Filename2', 'DiffID1','DiffID2'))
pathloc ='C:\Users\shmathew\Desktop\Sample\abc\'
groups = defaultdict(list)
for filename in os.listdir(pathloc):
date = filename[:8]
identifier = filename[10:14]
basename, extension = os.path.splitext(filename)
groups[date, identifier].append(filename)
for key,filenames in groups.iteritems():
#print " processing following files"
#print filenames
maps = pd.read_csv(pathloc+filenames[1], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
Items = pd.read_csv(pathloc+filenames[0] , sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
df = pd.concat([maps, Items])
df = df.reset_index(drop=True)
df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
#print "\n\n Difference \n\n"
ids= (df.reindex(idx))
row =list(filenames);
row.extend(list(ids['ID']))
print row
# difference.append(row)
difference.append(row)
print difference
输出
['20151231_7802_Test_Items.txt', '20151231_7802_Test_Maps.txt', '00432931830TRNY1 ', '00432xx0TRNY1 ']
['20151231_7801_Test_Items.txt', '20151231_7801_Test_Maps.txt']
Empty DataFrame
Columns: [Filename1, Filename2, DiffID1, DiffID2]
Index: []
根据四条的回答,我找到了一个很好的方法。
groups = defaultdict(list)
output = []
for filename in os.listdir(pathloc):
date = filename[:8]
identifier = filename[14:18]
basename, extension = os.path.splitext(filename)
groups[date, identifier].append(filename)
for key, fnames in groups.iteritems():
filedicts = {}
print list(fnames)
maps = pd.read_csv(pathloc+fnames[1], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
items = pd.read_csv(pathloc+fnames[0], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
diffs = set(maps.ID).symmetric_difference(items.ID)
filedicts['FileIDKey'] = list(key)
filedicts['Missing_IDs'] = list(diffs)
filedicts['FileNames'] = fnames
output.append(filedicts)
这让我可以将这个主词典列表添加到数据框:
new = pd.DataFrame(output)