合并阶段 Dataframe.apply() 的问题
Problems with Dataframe.apply() in combine phase
问题
我正在尝试使用 DataFrame.apply()
向数据框添加新列。添加的列数取决于原始数据帧的每一行。为每一行创建的列之间存在重叠,这些重叠的列应由单个列表示。
apply 函数似乎在原始数据帧的每一行上工作得很好,但在合并阶段抛出 ValueError: cannot reindex from a duplicate axis
。我不确定如何隔离正在复制的轴,因为它隐藏在 .apply()
后面
为了让事情变得更复杂,这个过程适用于数据的各个子集 (n = 23565),但由于某种原因,当我尝试应用于整个数据帧时它失败了。我认为可能有少数几行导致了这个问题,但我无法准确地隔离出哪些行。
欢迎任何关于隔离错误或澄清问题的建议。
背景
原始数据框 om
包含表示分数、分数变化和分数变化的日期范围的列。 om
在 EntityID 和 Date 上建立索引,其中 EntityID 是客户端接收分数的唯一标识符。我想合并来自另一个数据框 services
的值,其中包含有关提供给按日期索引的客户的服务的信息。
对于 om
中的每一行,我想执行以下转换:
- 按
EntityID
过滤 service
并介于 om.ScoreDate
和 om.LastScoreDate
之间
- 通过
service.Description
求service.Total
的和
- 将结果系列附加到原始行
数据框信息
>>> om.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 23565 entries, (4198, Timestamp('2018-09-10 00:00:00')) to (69793, Timestamp('2021-04-15 00:00:00'))
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Childcare 18770 non-null float64
1 Childcare_d 7715 non-null float64
2 Education 22010 non-null float64
3 Education_d 9468 non-null float64
.. ..... .......... ......
n Other Domain Columns n non-null float64
.. ..... .......... ......
20 Program Collecting Score 23565 non-null object
21 LastScoreDate 10423 non-null datetime64[ns]
22 ScoreDate 23565 non-null datetime64[ns]
dtypes: datetime64[ns](2), float64(20), object(1)
memory usage: 4.9+ MB
>>> services.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 398966 entries, 2013-04-19 00:00:00 to 2020-07-10 00:00:00
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 EntityID 398966 non-null int64
1 Description 398966 non-null object
2 Units 398966 non-null float64
3 Value 398966 non-null float64
4 Unit of Measure 398966 non-null object
5 Total 398966 non-null float64
6 Program 398966 non-null object
dtypes: float64(3), int64(1), object(3)
memory usage: 24.4+ MB
代码示例
import pandas as pd
# This function processes a csv and returns a dataframe with service data
services = import_service_data()
# services is passed in as a default parameter, since every function call relies on data from services
def pivot_services(row, services = services):
print(row.name[0]) # This is the EntityID portion of the row index
try:
# Filter services by EntityID matching row index
client_services = services[services.EntityID == row.name[0]]
# Filter services by date range
time_frame = client_services[(client_services.index >= row.LastScoreDate) & (client_services.index < row.ScoreDate)]
# Calculate sum service totals by service type
# This returns a pd.Series
sums = time_frame.groupby('Description')['Total'].agg(sum) by service type
# Since row is also a pd.Series, they can just be stuck together
with_totals = pd.concat([row,sums])
# Rename the new series to match the original row name
with_totals.name = row.name
except IndexError:
# IndexError is thrown when a client received no services in the date range
# In this case there is nothing to add to the row, so it just returns the row
return row
return with_totals
# This function processes a csv and returns a dataframe with om data
om = import_final_om()
merged = om.apply(pivot_services, axis = 1)
# Output
Traceback (most recent call last):
File "C:\CaseWorthy-Documentation\Projects\OM\data_processing.py", line 131, in <module>
merged = om.apply(pivot_services, axis = 1)
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\frame.py", line 7768, in apply
return op.get_result()
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\apply.py", line 185, in get_result
return self.apply_standard()
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\apply.py", line 279, in apply_standard
return self.wrap_results(results, res_index)
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\apply.py", line 303, in wrap_results
return self.wrap_results_for_axis(results, res_index)
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\apply.py", line 440, in wrap_results_for_axis
result = self.infer_to_same_shape(results, res_index)
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\apply.py", line 446, in infer_to_same_shape
result = self.obj._constructor(data=results)
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\frame.py", line 529, in __init__
mgr = init_dict(data, index, columns, dtype=dtype)
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\internals\construction.py", line 287, in init_dict
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\internals\construction.py", line 85, in arrays_to_mgr
arrays = _homogenize(arrays, index, dtype)
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\internals\construction.py", line 344, in _homogenize
val = val.reindex(index, copy=False)
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\series.py", line 4345, in reindex
return super().reindex(index=index, **kwargs)
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\generic.py", line 4811, in reindex
return self._reindex_axes(
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\generic.py", line 4832, in _reindex_axes
obj = obj._reindex_with_indexers(
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\generic.py", line 4877, in _reindex_with_indexers
new_data = new_data.reindex_indexer(
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\internals\managers.py", line 1301, in reindex_indexer
self.axes[axis]._can_reindex(indexer)
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\indexes\base.py", line 3476, in _can_reindex
raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis
我在找什么
一旦创建了所有新行,我希望将它们放入一个数据框中,该数据框具有与 om
所有相同的原始列,但有额外的服务总计列。新的列名称将引用不同的 service.Description
值,行值将是总计。当在该时间段内未向给定客户端提供该类型的服务时,将出现 NaNs
。
我可以用我的数据子集生成这个数据框,但是当我尝试将它应用到整个 om
数据框时,我得到一个异常 ValueError: cannot reindex from a duplicate axis
我想要的结果看起来像这样
>>> merged.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 49 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Childcare 79 non-null float64
1 Childcare_d 37 non-null float64
2 Education 97 non-null float64
3 Education_d 85 non-null float64
.. ..... .......... ......
n Other Domain Columns n non-null float64
.. ..... .......... ......
20 Program Collecting Score 100 non-null object
21 LastScoreDate 53 non-null datetime64[ns]
22 ScoreDate 100 non-null datetime64[ns]
########################################################################
## The columns below are the new ones ##
## The final dataset will have over 100 extra columns ##
########################################################################
23 Additional Child Tax Credit 1 non-null float64
24 Annual Certification 1 non-null float64
25 Annual Certification and Inspection 4 non-null float64
26 Case Management 1 non-null float64
.. ..... .......... ......
n Other Service Type Columns n non-null float64
.. ..... .......... ......
47 Utility Payment 2 non-null float64
48 Voucher Issuance 2 non-null float64
dtypes: datetime64[ns](2), float64(46), object(1)
memory usage: 39.1+ KB
问题是在 pivot_services()
中创建的一些新列与现有域得分列的名称完全相同。为了修复它,我添加了一行以在返回每一行之前重命名系列中的项目。
def pivot_services(row, services = services):
print(row.name[0]) # This is the EntityID portion of the row index
try:
# Filter services by EntityID matching row index
client_services = services[services.EntityID == row.name[0]]
# Filter services by date range
time_frame = client_services[(client_services.index >= row.LastScoreDate) & (client_services.index < row.ScoreDate)]
# Calculate sum service totals by service type
# This returns a pd.Series
sums = time_frame.groupby('Description')['Total'].agg(sum) by service type
######################
# This is the new line
sums.rename(lambda x: x.replace(' ','_').replace('-','_').replace('/','_')+'_s', inplace = True)
# Since row is also a pd.Series, they can just be stuck together
with_totals = pd.concat([row,sums])
# Rename the new series to match the original row name
with_totals.name = row.name
except IndexError:
# IndexError is thrown when a client received no services in the date range
# In this case there is nothing to add to the row, so it just returns the row
return row
return with_totals
问题
我正在尝试使用 DataFrame.apply()
向数据框添加新列。添加的列数取决于原始数据帧的每一行。为每一行创建的列之间存在重叠,这些重叠的列应由单个列表示。
apply 函数似乎在原始数据帧的每一行上工作得很好,但在合并阶段抛出 ValueError: cannot reindex from a duplicate axis
。我不确定如何隔离正在复制的轴,因为它隐藏在 .apply()
为了让事情变得更复杂,这个过程适用于数据的各个子集 (n = 23565),但由于某种原因,当我尝试应用于整个数据帧时它失败了。我认为可能有少数几行导致了这个问题,但我无法准确地隔离出哪些行。
欢迎任何关于隔离错误或澄清问题的建议。
背景
原始数据框 om
包含表示分数、分数变化和分数变化的日期范围的列。 om
在 EntityID 和 Date 上建立索引,其中 EntityID 是客户端接收分数的唯一标识符。我想合并来自另一个数据框 services
的值,其中包含有关提供给按日期索引的客户的服务的信息。
对于 om
中的每一行,我想执行以下转换:
- 按
EntityID
过滤service
并介于om.ScoreDate
和om.LastScoreDate
之间
- 通过
service.Description
求 - 将结果系列附加到原始行
service.Total
的和
>>> om.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 23565 entries, (4198, Timestamp('2018-09-10 00:00:00')) to (69793, Timestamp('2021-04-15 00:00:00'))
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Childcare 18770 non-null float64
1 Childcare_d 7715 non-null float64
2 Education 22010 non-null float64
3 Education_d 9468 non-null float64
.. ..... .......... ......
n Other Domain Columns n non-null float64
.. ..... .......... ......
20 Program Collecting Score 23565 non-null object
21 LastScoreDate 10423 non-null datetime64[ns]
22 ScoreDate 23565 non-null datetime64[ns]
dtypes: datetime64[ns](2), float64(20), object(1)
memory usage: 4.9+ MB
>>> services.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 398966 entries, 2013-04-19 00:00:00 to 2020-07-10 00:00:00
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 EntityID 398966 non-null int64
1 Description 398966 non-null object
2 Units 398966 non-null float64
3 Value 398966 non-null float64
4 Unit of Measure 398966 non-null object
5 Total 398966 non-null float64
6 Program 398966 non-null object
dtypes: float64(3), int64(1), object(3)
memory usage: 24.4+ MB
代码示例
import pandas as pd
# This function processes a csv and returns a dataframe with service data
services = import_service_data()
# services is passed in as a default parameter, since every function call relies on data from services
def pivot_services(row, services = services):
print(row.name[0]) # This is the EntityID portion of the row index
try:
# Filter services by EntityID matching row index
client_services = services[services.EntityID == row.name[0]]
# Filter services by date range
time_frame = client_services[(client_services.index >= row.LastScoreDate) & (client_services.index < row.ScoreDate)]
# Calculate sum service totals by service type
# This returns a pd.Series
sums = time_frame.groupby('Description')['Total'].agg(sum) by service type
# Since row is also a pd.Series, they can just be stuck together
with_totals = pd.concat([row,sums])
# Rename the new series to match the original row name
with_totals.name = row.name
except IndexError:
# IndexError is thrown when a client received no services in the date range
# In this case there is nothing to add to the row, so it just returns the row
return row
return with_totals
# This function processes a csv and returns a dataframe with om data
om = import_final_om()
merged = om.apply(pivot_services, axis = 1)
# Output
Traceback (most recent call last):
File "C:\CaseWorthy-Documentation\Projects\OM\data_processing.py", line 131, in <module>
merged = om.apply(pivot_services, axis = 1)
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\frame.py", line 7768, in apply
return op.get_result()
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\apply.py", line 185, in get_result
return self.apply_standard()
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\apply.py", line 279, in apply_standard
return self.wrap_results(results, res_index)
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\apply.py", line 303, in wrap_results
return self.wrap_results_for_axis(results, res_index)
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\apply.py", line 440, in wrap_results_for_axis
result = self.infer_to_same_shape(results, res_index)
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\apply.py", line 446, in infer_to_same_shape
result = self.obj._constructor(data=results)
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\frame.py", line 529, in __init__
mgr = init_dict(data, index, columns, dtype=dtype)
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\internals\construction.py", line 287, in init_dict
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\internals\construction.py", line 85, in arrays_to_mgr
arrays = _homogenize(arrays, index, dtype)
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\internals\construction.py", line 344, in _homogenize
val = val.reindex(index, copy=False)
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\series.py", line 4345, in reindex
return super().reindex(index=index, **kwargs)
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\generic.py", line 4811, in reindex
return self._reindex_axes(
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\generic.py", line 4832, in _reindex_axes
obj = obj._reindex_with_indexers(
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\generic.py", line 4877, in _reindex_with_indexers
new_data = new_data.reindex_indexer(
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\internals\managers.py", line 1301, in reindex_indexer
self.axes[axis]._can_reindex(indexer)
File "D:\Anaconda3\envs\om\lib\site-packages\pandas\core\indexes\base.py", line 3476, in _can_reindex
raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis
我在找什么
一旦创建了所有新行,我希望将它们放入一个数据框中,该数据框具有与 om
所有相同的原始列,但有额外的服务总计列。新的列名称将引用不同的 service.Description
值,行值将是总计。当在该时间段内未向给定客户端提供该类型的服务时,将出现 NaNs
。
我可以用我的数据子集生成这个数据框,但是当我尝试将它应用到整个 om
数据框时,我得到一个异常 ValueError: cannot reindex from a duplicate axis
我想要的结果看起来像这样
>>> merged.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 49 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Childcare 79 non-null float64
1 Childcare_d 37 non-null float64
2 Education 97 non-null float64
3 Education_d 85 non-null float64
.. ..... .......... ......
n Other Domain Columns n non-null float64
.. ..... .......... ......
20 Program Collecting Score 100 non-null object
21 LastScoreDate 53 non-null datetime64[ns]
22 ScoreDate 100 non-null datetime64[ns]
########################################################################
## The columns below are the new ones ##
## The final dataset will have over 100 extra columns ##
########################################################################
23 Additional Child Tax Credit 1 non-null float64
24 Annual Certification 1 non-null float64
25 Annual Certification and Inspection 4 non-null float64
26 Case Management 1 non-null float64
.. ..... .......... ......
n Other Service Type Columns n non-null float64
.. ..... .......... ......
47 Utility Payment 2 non-null float64
48 Voucher Issuance 2 non-null float64
dtypes: datetime64[ns](2), float64(46), object(1)
memory usage: 39.1+ KB
问题是在 pivot_services()
中创建的一些新列与现有域得分列的名称完全相同。为了修复它,我添加了一行以在返回每一行之前重命名系列中的项目。
def pivot_services(row, services = services):
print(row.name[0]) # This is the EntityID portion of the row index
try:
# Filter services by EntityID matching row index
client_services = services[services.EntityID == row.name[0]]
# Filter services by date range
time_frame = client_services[(client_services.index >= row.LastScoreDate) & (client_services.index < row.ScoreDate)]
# Calculate sum service totals by service type
# This returns a pd.Series
sums = time_frame.groupby('Description')['Total'].agg(sum) by service type
######################
# This is the new line
sums.rename(lambda x: x.replace(' ','_').replace('-','_').replace('/','_')+'_s', inplace = True)
# Since row is also a pd.Series, they can just be stuck together
with_totals = pd.concat([row,sums])
# Rename the new series to match the original row name
with_totals.name = row.name
except IndexError:
# IndexError is thrown when a client received no services in the date range
# In this case there is nothing to add to the row, so it just returns the row
return row
return with_totals