替换 dask map_partitions 中的现有列会给出 SettingWithCopyWarning
Replacing existing column in dask map_partitions gives SettingWithCopyWarning
我正在使用 map_partitions
替换 dask
数据框中的列 id2
。结果是值被替换,但带有 pandas
警告。
此警告是什么以及如何在下面的示例中应用 .loc
建议?
pdf = pd.DataFrame({
'dummy2': [10, 10, 10, 20, 20, 15, 10, 30, 20, 26],
'id2': [1, 1, 1, 2, 2, 1, 1, 1, 2, 2],
'balance2': [150, 140, 130, 280, 260, 150, 140, 130, 280, 260]
})
ddf = dd.from_pandas(pdf, npartitions=3)
def func2(df):
df['id2'] = df['balance2'] + 1
return df
ddf = ddf.map_partitions(func2)
ddf.compute()
C:\Users\xxxxxx\AppData\Local\Temp\ipykernel_300768155462.py:2:
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame. Try using .loc[row_indexer,col_indexer] =
value instead
See the caveats in the documentation:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df['id2'] = df['balance2'] + 1
快速修复是添加数据框的副本:
def func2(df):
df = df.copy() # will make a copy of the dataframe
df['id2'] = df['balance2'] + 1
return df
但是,据我所知,不需要复制数据帧,因为 dask 数据帧的延迟特性意味着更改不会传播回 dask 数据帧分区。
更新:有一个relevant question解释了pandas
中.copy
的原因。在下面的代码片段中,应用该函数将修改原始的 pandas 数据框,这可能是不可取的:
from pandas import DataFrame
def addcol(df):
df['a'] = 1
return df
df = DataFrame()
df1 = addcol(df)
# without .copy, df is also modified, which might be undesirable
在 dask
的上下文中,此警告只是一个警告,因此不需要 .copy
。
from dask.dataframe import from_pandas
ddf = from_pandas(df, npartitions=1)
ddf1 = ddf.map_partitions(addcol)
# will show warning, but original ddf is not modified
我正在使用 map_partitions
替换 dask
数据框中的列 id2
。结果是值被替换,但带有 pandas
警告。
此警告是什么以及如何在下面的示例中应用 .loc
建议?
pdf = pd.DataFrame({
'dummy2': [10, 10, 10, 20, 20, 15, 10, 30, 20, 26],
'id2': [1, 1, 1, 2, 2, 1, 1, 1, 2, 2],
'balance2': [150, 140, 130, 280, 260, 150, 140, 130, 280, 260]
})
ddf = dd.from_pandas(pdf, npartitions=3)
def func2(df):
df['id2'] = df['balance2'] + 1
return df
ddf = ddf.map_partitions(func2)
ddf.compute()
C:\Users\xxxxxx\AppData\Local\Temp\ipykernel_300768155462.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['id2'] = df['balance2'] + 1
快速修复是添加数据框的副本:
def func2(df):
df = df.copy() # will make a copy of the dataframe
df['id2'] = df['balance2'] + 1
return df
但是,据我所知,不需要复制数据帧,因为 dask 数据帧的延迟特性意味着更改不会传播回 dask 数据帧分区。
更新:有一个relevant question解释了pandas
中.copy
的原因。在下面的代码片段中,应用该函数将修改原始的 pandas 数据框,这可能是不可取的:
from pandas import DataFrame
def addcol(df):
df['a'] = 1
return df
df = DataFrame()
df1 = addcol(df)
# without .copy, df is also modified, which might be undesirable
在 dask
的上下文中,此警告只是一个警告,因此不需要 .copy
。
from dask.dataframe import from_pandas
ddf = from_pandas(df, npartitions=1)
ddf1 = ddf.map_partitions(addcol)
# will show warning, but original ddf is not modified