引发 InvalidIndexError(key) 奇怪
raise InvalidIndexError(key) strange
我正在尝试这个
import dask.dataframe as dd
import pandas as pd
salary_df = pd.DataFrame({"Salary":[10000, 50000, 25000, 30000, 7000, 100000]})
salary_category = pd.DataFrame({"Hi":[5000, 20000, 25000, 30000, 90000, 120000],
"Low":[0, 5001, 20001, 25001, 30001, 90001],
"category":["Very Poor", "Poor", "Medium", "Rich", "Super Rich", "Ultra Rich" ]
})
sal_ddf = dd.from_pandas(salary_df, npartitions=10)
salary_category.index = pd.IntervalIndex.from_arrays(salary_category['Low'],salary_category['Hi'],closed='both')
sal_ddf['Category'] = sal_ddf['Salary'].map_partitions(lambda x : salary_category.iloc[salary_category.index.get_loc(x)]['category'], meta=('Category', 'str'))
print(salary_category)
print(sal_ddf.head())
我 Salary_category 的输出是
Hi Low category
[0, 5000] 5000 0 Very Poor
[5001, 20000] 20000 5001 Poor
[20001, 25000] 25000 20001 Medium
[25001, 30000] 30000 25001 Rich
[30001, 90000] 90000 30001 Super Rich
[90001, 120000] 120000 90001 Ultra Rich
不是 10000 属于穷人类别吗?
但是我仍然得到这样的索引错误
sal_ddf['Category'] = sal_ddf['Salary'].map_partitions(lambda x : salary_category.iloc[salary_category.index.get_loc(x)]['category'], meta=('Category', 'str'))
File "C:\Python\Python310\lib\site-packages\pandas\core\indexes\interval.py", line 613, in get_loc
raise InvalidIndexError(key)
pandas.errors.InvalidIndexError: 0 10000
为什么按键错误?
使用.map_partitions
假设传递了一个完整的数据帧,而上面的代码将一个dask系列传递给它。这会导致问题。一个快速的纠正方法是定义一个自定义函数并将其应用 .map_partitions
:
sal_ddf = dd.from_pandas(salary_df, npartitions=10)
salary_category.index = pd.IntervalIndex.from_arrays(salary_category['Low'],salary_category['Hi'],closed='both')
def get_salary(df):
df = df.copy()
df['category'] = df['Salary'].apply(lambda x: salary_category.iloc[salary_category.index.get_loc(x)]['category'])
return df
sal_ddf = sal_ddf.map_partitions(get_salary)
print(salary_category)
print(sal_ddf.compute())
我正在尝试这个
import dask.dataframe as dd
import pandas as pd
salary_df = pd.DataFrame({"Salary":[10000, 50000, 25000, 30000, 7000, 100000]})
salary_category = pd.DataFrame({"Hi":[5000, 20000, 25000, 30000, 90000, 120000],
"Low":[0, 5001, 20001, 25001, 30001, 90001],
"category":["Very Poor", "Poor", "Medium", "Rich", "Super Rich", "Ultra Rich" ]
})
sal_ddf = dd.from_pandas(salary_df, npartitions=10)
salary_category.index = pd.IntervalIndex.from_arrays(salary_category['Low'],salary_category['Hi'],closed='both')
sal_ddf['Category'] = sal_ddf['Salary'].map_partitions(lambda x : salary_category.iloc[salary_category.index.get_loc(x)]['category'], meta=('Category', 'str'))
print(salary_category)
print(sal_ddf.head())
我 Salary_category 的输出是
Hi Low category
[0, 5000] 5000 0 Very Poor
[5001, 20000] 20000 5001 Poor
[20001, 25000] 25000 20001 Medium
[25001, 30000] 30000 25001 Rich
[30001, 90000] 90000 30001 Super Rich
[90001, 120000] 120000 90001 Ultra Rich
不是 10000 属于穷人类别吗? 但是我仍然得到这样的索引错误
sal_ddf['Category'] = sal_ddf['Salary'].map_partitions(lambda x : salary_category.iloc[salary_category.index.get_loc(x)]['category'], meta=('Category', 'str'))
File "C:\Python\Python310\lib\site-packages\pandas\core\indexes\interval.py", line 613, in get_loc
raise InvalidIndexError(key)
pandas.errors.InvalidIndexError: 0 10000
为什么按键错误?
使用.map_partitions
假设传递了一个完整的数据帧,而上面的代码将一个dask系列传递给它。这会导致问题。一个快速的纠正方法是定义一个自定义函数并将其应用 .map_partitions
:
sal_ddf = dd.from_pandas(salary_df, npartitions=10)
salary_category.index = pd.IntervalIndex.from_arrays(salary_category['Low'],salary_category['Hi'],closed='both')
def get_salary(df):
df = df.copy()
df['category'] = df['Salary'].apply(lambda x: salary_category.iloc[salary_category.index.get_loc(x)]['category'])
return df
sal_ddf = sal_ddf.map_partitions(get_salary)
print(salary_category)
print(sal_ddf.compute())