Pandas 按功能分组不正确?
Pandas group by function is not grouping properly?
我的数据集如下所示:
NUM 80000 80001 80002 80003 80010 80011 80013 80023
CUSTOM_SITES NAME CC DD EE FF GG HH JJ KK
X 0 0 0 181621 0 0 809 67
Y 0 0 0 1885 0 0 17 0
a Z 0 0 0 43 0 0 0 0
a T 0 0 0 324 0 0 2 0
a W 0 0 0 336 0 0 8 0
a F 0 0 0 21 0 0 0 0
a P 0 0 0 253 0 0 0 0
a D 0 0 0 163 0 0 4 0
a C 0 0 0 122 0 0 2 0
a D 0 0 0 122 0 0 1 0
a PPPP 0 0 0 61 0 0 0 0
a NN 0 0 0 440 0 0 0 0
EE 0 0 0 45530 0 0 166 6
E RR 0 0 0 1726 0 0 4 0
S KKKK 0 0 0 2398 0 0 4 0
SI QQQ 0 0 0 286 0 0 0 0
AAA 0 0 0 13425 0 0 13 1
DDD 0 0 0 11566 0 0 11 0
C WWWW 0 0 0 808 0 0 2 0
C NNN 0 0 0 50 0 0 0 0
C GGGG 0 0 0 633 0 0 1 0
"df.to_dict()"输出-->
{'Unnamed: 0': {0: 'CUSTOM_SITES', 1: nan, 2: nan, 3: 'a', 4: 'a', 5: 'a', 6: 'a', 7: 'a', 8: 'a', 9: 'a', 10: 'a', 11: 'a', 12: 'a', 13: nan, 14: 'E', 15: 'S', 16: 'SI', 17: nan, 18: nan, 19: 'C', 20: 'C', 21: 'C'}, 'NUM': {0: 'NAME', 1: 'X', 2: 'Y', 3: 'Z', 4: 'T', 5: 'W', 6: 'F', 7: 'P', 8: 'D', 9: 'C', 10: 'D', 11: 'PPPP', 12: 'NN', 13: 'EE', 14: 'RR', 15: 'KKKK', 16: 'QQQ', 17: 'AAA', 18: 'DDD', 19: 'WWWW', 20: 'NNN', 21: 'GGGG'}, '80000': {0: 'CC', 1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '0', 14: '0', 15: '0', 16: '0', 17: '0', 18: '0', 19: '0', 20: '0', 21: '0'}, '80001': {0: 'DD', 1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '0', 14: '0', 15: '0', 16: '0', 17: '0', 18: '0', 19: '0', 20: '0', 21: '0'}, '80002': {0: 'EE', 1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '0', 14: '0', 15: '0', 16: '0', 17: '0', 18: '0', 19: '0', 20: '0', 21: '0'}, '80003': {0: 'FF', 1: '181621', 2: '1885', 3: '43', 4: '324', 5: '336', 6: '21', 7: '253', 8: '163', 9: '122', 10: '122', 11: '61', 12: '440', 13: '45530', 14: '1726', 15: '2398', 16: '286', 17: '13425', 18: '11566', 19: '808', 20: '50', 21: '633'}, '80010': {0: 'GG', 1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '0', 14: '0', 15: '0', 16: '0', 17: '0', 18: '0', 19: '0', 20: '0', 21: '0'}, '80011': {0: 'HH', 1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '0', 14: '0', 15: '0', 16: '0', 17: '0', 18: '0', 19: '0', 20: '0', 21: '0'}, '80013': {0: 'JJ', 1: '809', 2: '17', 3: '0', 4: '2', 5: '8', 6: '0', 7: '0', 8: '4', 9: '2', 10: '1', 11: '0', 12: '0', 13: '166', 14: '4', 15: '4', 16: '0', 17: '13', 18: '11', 19: '2', 20: '0', 21: '1'}, '80023': {0: 'KK', 1: '67', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '6', 14: '0', 15: '0', 16: '0', 17: '1', 18: '0', 19: '0', 20: '0', 21: '0'}}
我在代码中的第一步是忽略第一行,然后用第二行重命名 df,然后按 'CUSTOM SITES' 列分组。下面是代码:
dirpath= "..."
df = pd.read_table("...")
header = df.iloc[0]
df = df[1:]
df = df.rename(columns = header)
df = df.reset_index(drop=True)
df.groupby("CUSTOM_SITES",sort=False).sum().to_csv(os.path.join(dirpath,'collapsed_sites_out.txt'), sep='\t', encoding='utf-8',quoting=0, index=True)
所以问题是 groupby 函数没有按自定义站点分组,只是给我一个列作为输出,我的输出应该是折叠的自定义站点和 80000.....80023 作为列。请帮忙!
上述问题的解决方案:
import pandas as pd
import os
dirpath = "..."
df = pd.read_table("...")
#extract row from original df dataframe (this is the second row- with histo names)
header = df.iloc[0]
#overwrite df with row 1 and all columns
df = df[1:]
#rename the columns
df = df.rename(columns = header)
#following three lines collapse the rows into intended sites
df = df.set_index(['CUSTOM_SITES','NAME'])
df = df.apply(pd.to_numeric,errors='coerce')
print(df.head(5))
df = df.reset_index().groupby('CUSTOM_SITES',sort=False).sum()
#write dataFrame to file - make sure index is true so u have row names
df.to_csv(os.path.join(dirpath,'out.txt'), sep='\t', encoding='utf-8',quoting=0, index=True)
我的数据集如下所示:
NUM 80000 80001 80002 80003 80010 80011 80013 80023
CUSTOM_SITES NAME CC DD EE FF GG HH JJ KK
X 0 0 0 181621 0 0 809 67
Y 0 0 0 1885 0 0 17 0
a Z 0 0 0 43 0 0 0 0
a T 0 0 0 324 0 0 2 0
a W 0 0 0 336 0 0 8 0
a F 0 0 0 21 0 0 0 0
a P 0 0 0 253 0 0 0 0
a D 0 0 0 163 0 0 4 0
a C 0 0 0 122 0 0 2 0
a D 0 0 0 122 0 0 1 0
a PPPP 0 0 0 61 0 0 0 0
a NN 0 0 0 440 0 0 0 0
EE 0 0 0 45530 0 0 166 6
E RR 0 0 0 1726 0 0 4 0
S KKKK 0 0 0 2398 0 0 4 0
SI QQQ 0 0 0 286 0 0 0 0
AAA 0 0 0 13425 0 0 13 1
DDD 0 0 0 11566 0 0 11 0
C WWWW 0 0 0 808 0 0 2 0
C NNN 0 0 0 50 0 0 0 0
C GGGG 0 0 0 633 0 0 1 0
"df.to_dict()"输出-->
{'Unnamed: 0': {0: 'CUSTOM_SITES', 1: nan, 2: nan, 3: 'a', 4: 'a', 5: 'a', 6: 'a', 7: 'a', 8: 'a', 9: 'a', 10: 'a', 11: 'a', 12: 'a', 13: nan, 14: 'E', 15: 'S', 16: 'SI', 17: nan, 18: nan, 19: 'C', 20: 'C', 21: 'C'}, 'NUM': {0: 'NAME', 1: 'X', 2: 'Y', 3: 'Z', 4: 'T', 5: 'W', 6: 'F', 7: 'P', 8: 'D', 9: 'C', 10: 'D', 11: 'PPPP', 12: 'NN', 13: 'EE', 14: 'RR', 15: 'KKKK', 16: 'QQQ', 17: 'AAA', 18: 'DDD', 19: 'WWWW', 20: 'NNN', 21: 'GGGG'}, '80000': {0: 'CC', 1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '0', 14: '0', 15: '0', 16: '0', 17: '0', 18: '0', 19: '0', 20: '0', 21: '0'}, '80001': {0: 'DD', 1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '0', 14: '0', 15: '0', 16: '0', 17: '0', 18: '0', 19: '0', 20: '0', 21: '0'}, '80002': {0: 'EE', 1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '0', 14: '0', 15: '0', 16: '0', 17: '0', 18: '0', 19: '0', 20: '0', 21: '0'}, '80003': {0: 'FF', 1: '181621', 2: '1885', 3: '43', 4: '324', 5: '336', 6: '21', 7: '253', 8: '163', 9: '122', 10: '122', 11: '61', 12: '440', 13: '45530', 14: '1726', 15: '2398', 16: '286', 17: '13425', 18: '11566', 19: '808', 20: '50', 21: '633'}, '80010': {0: 'GG', 1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '0', 14: '0', 15: '0', 16: '0', 17: '0', 18: '0', 19: '0', 20: '0', 21: '0'}, '80011': {0: 'HH', 1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '0', 14: '0', 15: '0', 16: '0', 17: '0', 18: '0', 19: '0', 20: '0', 21: '0'}, '80013': {0: 'JJ', 1: '809', 2: '17', 3: '0', 4: '2', 5: '8', 6: '0', 7: '0', 8: '4', 9: '2', 10: '1', 11: '0', 12: '0', 13: '166', 14: '4', 15: '4', 16: '0', 17: '13', 18: '11', 19: '2', 20: '0', 21: '1'}, '80023': {0: 'KK', 1: '67', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: '0', 10: '0', 11: '0', 12: '0', 13: '6', 14: '0', 15: '0', 16: '0', 17: '1', 18: '0', 19: '0', 20: '0', 21: '0'}}
我在代码中的第一步是忽略第一行,然后用第二行重命名 df,然后按 'CUSTOM SITES' 列分组。下面是代码:
dirpath= "..."
df = pd.read_table("...")
header = df.iloc[0]
df = df[1:]
df = df.rename(columns = header)
df = df.reset_index(drop=True)
df.groupby("CUSTOM_SITES",sort=False).sum().to_csv(os.path.join(dirpath,'collapsed_sites_out.txt'), sep='\t', encoding='utf-8',quoting=0, index=True)
所以问题是 groupby 函数没有按自定义站点分组,只是给我一个列作为输出,我的输出应该是折叠的自定义站点和 80000.....80023 作为列。请帮忙!
上述问题的解决方案:
import pandas as pd
import os
dirpath = "..."
df = pd.read_table("...")
#extract row from original df dataframe (this is the second row- with histo names)
header = df.iloc[0]
#overwrite df with row 1 and all columns
df = df[1:]
#rename the columns
df = df.rename(columns = header)
#following three lines collapse the rows into intended sites
df = df.set_index(['CUSTOM_SITES','NAME'])
df = df.apply(pd.to_numeric,errors='coerce')
print(df.head(5))
df = df.reset_index().groupby('CUSTOM_SITES',sort=False).sum()
#write dataFrame to file - make sure index is true so u have row names
df.to_csv(os.path.join(dirpath,'out.txt'), sep='\t', encoding='utf-8',quoting=0, index=True)