合并阶段 Dataframe.apply() 的问题

Problems with Dataframe.apply() in combine phase

问题

我正在尝试使用 DataFrame.apply() 向数据框添加新列。添加的列数取决于原始数据帧的每一行。为每一行创建的列之间存在重叠,这些重叠的列应由单个列表示。

apply 函数似乎在原始数据帧的每一行上工作得很好,但在合并阶段抛出 ValueError: cannot reindex from a duplicate axis。我不确定如何隔离正在复制的轴,因为它隐藏在 .apply()

后面

为了让事情变得更复杂,这个过程适用于数据的各个子集 (n = 23565),但由于某种原因,当我尝试应用于整个数据帧时它失败了。我认为可能有少数几行导致了这个问题,但我无法准确地隔离出哪些行。

欢迎任何关于隔离错误或澄清问题的建议。

背景

原始数据框 om 包含表示分数、分数变化和分数变化的日期范围的列。 om 在 EntityID 和 Date 上建立索引,其中 EntityID 是客户端接收分数的唯一标识符。我想合并来自另一个数据框 services 的值,其中包含有关提供给按日期索引的客户的服务的信息。

对于 om 中的每一行,我想执行以下转换:

数据框信息
>>> om.info() 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 23565 entries, (4198, Timestamp('2018-09-10 00:00:00')) to (69793, Timestamp('2021-04-15 00:00:00'))
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   Childcare                 18770 non-null  float64
 1   Childcare_d               7715 non-null   float64
 2   Education                 22010 non-null  float64
 3   Education_d               9468 non-null   float64
 ..  .....                     ..........      ......
 n   Other Domain Columns      n non-null     float64
 ..  .....                     ..........      ......
 20  Program Collecting Score  23565 non-null  object
 21  LastScoreDate             10423 non-null  datetime64[ns]
 22  ScoreDate                 23565 non-null  datetime64[ns]
dtypes: datetime64[ns](2), float64(20), object(1)
memory usage: 4.9+ MB

>>> services.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 398966 entries, 2013-04-19 00:00:00 to 2020-07-10 00:00:00
Data columns (total 7 columns):
 #   Column           Non-Null Count   Dtype
---  ------           --------------   -----
 0   EntityID         398966 non-null  int64
 1   Description      398966 non-null  object
 2   Units            398966 non-null  float64
 3   Value            398966 non-null  float64
 4   Unit of Measure  398966 non-null  object
 5   Total            398966 non-null  float64
 6   Program          398966 non-null  object
dtypes: float64(3), int64(1), object(3)
memory usage: 24.4+ MB
代码示例
import pandas as pd

# This function processes a csv and returns a dataframe with service data
services = import_service_data()

# services is passed in as a default parameter, since every function call relies on data from services
def pivot_services(row, services = services):
    print(row.name[0]) # This is the EntityID portion of the row index
    try:
        # Filter services by EntityID matching row index
        client_services = services[services.EntityID == row.name[0]]
        
        # Filter services by date range
        time_frame = client_services[(client_services.index >= row.LastScoreDate) & (client_services.index < row.ScoreDate)]
        
        # Calculate sum service totals by service type
        # This returns a pd.Series
        sums = time_frame.groupby('Description')['Total'].agg(sum) by service type
        
        # Since row is also a pd.Series, they can just be stuck together
        with_totals = pd.concat([row,sums])
        
        # Rename the new series to match the original row name
        with_totals.name = row.name
    
    except IndexError:
        # IndexError is thrown when a client received no services in the date range
        # In this case there is nothing to add to the row, so it just returns the row
        return row
    
    return with_totals
    
# This function processes a csv and returns a dataframe with om data
om = import_final_om()
merged = om.apply(pivot_services, axis = 1)
# Output
Traceback (most recent call last):
  File "C:\CaseWorthy-Documentation\Projects\OM\data_processing.py", line 131, in <module>
    merged = om.apply(pivot_services, axis = 1)
  File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\frame.py", line 7768, in apply
    return op.get_result()
  File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\apply.py", line 185, in get_result
    return self.apply_standard()
  File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\apply.py", line 279, in apply_standard
    return self.wrap_results(results, res_index)
  File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\apply.py", line 303, in wrap_results
    return self.wrap_results_for_axis(results, res_index)
  File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\apply.py", line 440, in wrap_results_for_axis
    result = self.infer_to_same_shape(results, res_index)
  File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\apply.py", line 446, in infer_to_same_shape
    result = self.obj._constructor(data=results)
  File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\frame.py", line 529, in __init__
    mgr = init_dict(data, index, columns, dtype=dtype)
  File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\internals\construction.py", line 287, in init_dict
    return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
  File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\internals\construction.py", line 85, in arrays_to_mgr
    arrays = _homogenize(arrays, index, dtype)
  File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\internals\construction.py", line 344, in _homogenize
    val = val.reindex(index, copy=False)
  File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\series.py", line 4345, in reindex
    return super().reindex(index=index, **kwargs)
  File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\generic.py", line 4811, in reindex
    return self._reindex_axes(
  File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\generic.py", line 4832, in _reindex_axes
    obj = obj._reindex_with_indexers(
  File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\generic.py", line 4877, in _reindex_with_indexers
    new_data = new_data.reindex_indexer(
  File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\internals\managers.py", line 1301, in reindex_indexer
    self.axes[axis]._can_reindex(indexer)
  File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\indexes\base.py", line 3476, in _can_reindex
    raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis

我在找什么

一旦创建了所有新行,我希望将它们放入一个数据框中,该数据框具有与 om 所有相同的原始列,但有额外的服务总计列。新的列名称将引用不同的 service.Description 值,行值将是总计。当在该时间段内未向给定客户端提供该类型的服务时,将出现 NaNs

我可以用我的数据子集生成这个数据框,但是当我尝试将它应用到整个 om 数据框时,我得到一个异常 ValueError: cannot reindex from a duplicate axis

我想要的结果看起来像这样

>>> merged.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 49 columns):
 #   Column                                       Non-Null Count  Dtype
---  ------                                       --------------  -----
 0   Childcare                                    79 non-null     float64
 1   Childcare_d                                  37 non-null     float64
 2   Education                                    97 non-null     float64
 3   Education_d                                  85 non-null     float64
 ..  .....                                        ..........      ......
 n   Other Domain Columns                         n non-null      float64
 ..  .....                                        ..........      ......
 20  Program Collecting Score                     100 non-null    object
 21  LastScoreDate                                53 non-null     datetime64[ns]
 22  ScoreDate                                    100 non-null    datetime64[ns]
 ########################################################################
 ##                 The columns below are the new ones                 ##
 ##         The final dataset will have over 100 extra columns         ##
 ########################################################################
 23  Additional Child Tax Credit                  1 non-null      float64
 24  Annual Certification                         1 non-null      float64
 25  Annual Certification and Inspection          4 non-null      float64
 26  Case Management                              1 non-null      float64
 ..  .....                                        ..........      ......
 n   Other Service Type Columns                   n non-null      float64
 ..  .....                                        ..........      ......
 47  Utility Payment                              2 non-null      float64
 48  Voucher Issuance                             2 non-null      float64
dtypes: datetime64[ns](2), float64(46), object(1)
memory usage: 39.1+ KB

问题是在 pivot_services() 中创建的一些新列与现有域得分列的名称完全相同。为了修复它,我添加了一行以在返回每一行之前重命名系列中的项目。

def pivot_services(row, services = services):
    print(row.name[0]) # This is the EntityID portion of the row index
    try:
        # Filter services by EntityID matching row index
        client_services = services[services.EntityID == row.name[0]]
        
        # Filter services by date range
        time_frame = client_services[(client_services.index >= row.LastScoreDate) & (client_services.index < row.ScoreDate)]
        
        # Calculate sum service totals by service type
        # This returns a pd.Series
        sums = time_frame.groupby('Description')['Total'].agg(sum) by service type
        
        ######################
        # This is the new line
        sums.rename(lambda x: x.replace(' ','_').replace('-','_').replace('/','_')+'_s', inplace = True)

        # Since row is also a pd.Series, they can just be stuck together
        with_totals = pd.concat([row,sums])
        
        # Rename the new series to match the original row name
        with_totals.name = row.name

    except IndexError:
        # IndexError is thrown when a client received no services in the date range
        # In this case there is nothing to add to the row, so it just returns the row
        return row
    
    return with_totals