KeyError: "None of [['', '']] are in the [columns]" (Pandas Dataframe)

Question

我正在尝试编写一个接受数据框的函数，其中有些列属于同一类型，有些列则不是。列的一个例子是：

['id', 't_dur0', 't_dur1', 't_dur2', 't_dance0', 't_dance1', 't_dance2', 't_energy0', 
 't_energy1', 't_energy2']

我正在尝试生成两个新的数据框，一个是没有重复的列，另一个是只有重复的列，代码如下：

# Function that takes in a dataframe and returns new dataframes with all the sub-dataframes

def sub_dataframes(dataframe):

    copy = dataframe.copy()                  # To avoid SettingWithCopyWarning

    # Iterate through all the columns of the df
    for (col_name, col_data) in copy.iteritems():

        temp = str(col_name)
        rest = copy.iloc[:, 1:]
        new_df = [[]]

        # If it's not a duplicate, we just add it to the new df
        if len(temp) < 6:
            new_df[temp] = copy[col_data]

        # If the length of the column name is greater than or equal to 6, we know it's a duplicate
        if len(temp) >= 6:
            stripped = temp.rstrip(temp[2:])

            # Second for-loop to check the next column
            for (col_name2, col_data2) in rest.iteritems():
                temp2 = str(col_name2)
                rest2 = rest.iloc[:, 1:]
                only_dups = [[]]

                if len(temp2) >= 6:
                    stripped2 = temp2.rstrip(temp2[2:])

                    # Compare the two column names (without the integer 0,1, or 2)
                    if stripped[:-1] == stripped2[:-1]:

                        # Create new df of the two columns
                        only_dups[stripped] = col_data
                        only_dups[stripped2] = col_data2

                        # Third for-loop to check the remaining columns
                        for (col_name3, col_data3) in rest2.iteritems():
                            temp3 = str(col_name3)

                            if len(temp3) >= 6:
                                stripped3 = temp3.rstrip(temp3[2:])

                                # Compare the two column names (without the integer 0,1, or 2)
                                if stripped2[:-1] == stripped3[:-1]:
                                    only_dups[stripped3] = col_data3

    print("Original:\n{}\nWithout duplicates:\n{}\nDuplicates:\n{}".format(copy, new_df, only_dups))


sub_dataframes(df)

当我运行这段代码时，我得到这个错误：

KeyError: "None of [Int64Index([ 22352, 106534,  23608,   8655,  49670, 101988,   9136, 
141284,\n             28564,  14262,\n            ...\n             76690, 150965, 
143106, 142370,  68004,  33980, 110832,  14491,\n            123511,   6207],\n           
dtype='int64', length=2833)] are in the [columns]"

我试着在 Whosebug 上查看其他问题，看看是否可以解决问题，但到目前为止我所了解的是，我无法像现在这样添加列，new_df[temp] = copy[col_data] 或 only_dups[stripped] = col_data，但我似乎无法弄清楚如何正确创建新列。如何根据我现在拥有的变量添加新列？是否可能，或者我是否必须重写代码以使其没有那么多 for 循环？

编辑

我想要的输出示例是：

Original:
        id    t_dur0    t_dur1    t_dur2    ... 
0      22352  292720  293760.0  292733.0  
1     106534  213760  181000.0  245973.0 
2      23608  157124  130446.0  152450.0  
3       8655  127896  176351.0  166968.0  
4      49670  210320  226253.0  211880.0  
...      ...     ...       ...       ...

Without duplicates:
        id  
0      22352
1     106534
2      23608
3       8655
4      49670
...      ..

Duplicates: 
      t_dur0  t_dur1    t_dur2   
0     292720  293760.0  292733.0  
1     213760  181000.0  245973.0 
2     157124  130446.0  152450.0  
3     127896  176351.0  166968.0  
4     210320  226253.0  211880.0  
...      ...     ...       ...

Answer 1

IIUC:

def sub_dataframes(dataframe):
  # extract common prefix -> remove trailing digits
  cols = dataframe.columns.str.replace(r'\d*$', '', regex=True) \
                  .to_series().value_counts()

  # split columns
  unq_cols = cols[cols == 1].index
  dup_cols = dataframe.columns[~dataframe.columns.isin(unq_cols)]

  return (dataframe[unq_cols], dataframe[dup_cols])

df1, df2 = sub_dataframes(df)

输出：

>>> df1
       id
0   22352
1  106534
2   23608
3    8655
4   49670

>>> df2
   t_dur0    t_dur1    t_dur2
0  292720  293760.0  292733.0
1  213760  181000.0  245973.0
2  157124  130446.0  152450.0
3  127896  176351.0  166968.0
4  210320  226253.0  211880.0

Answer 2

您可以删除数字并确定列是否变为 duplicated:

mask = df.columns.str.replace(r'\d+', '', regex=True).duplicated(keep=False)

# duplicated columns
df1 = df.loc[:, mask]

# unique columns
df2 = df.loc[:, ~mask]

KeyError: "None of [['', '']] are in the [columns]" (Pandas Dataframe)

KeyError: "None of [['', '']] are in the [columns]" (Pandas Dataframe)

python

dataframe

pandas

keyerror