在分类列上过滤 Dask Dataframe?
Filter Dask Dataframe on categorical column?
假设我有一个很大的水果数据框。我有数千行,但只有大约 30 个不同的水果名称,所以我将该列作为一个类别:
df['fruit_name'] = df.fruit_name.astype('category')
现在这是一个类别,我可以不再过滤它吗?例如,
df_kiwi = df[df['fruit_name'] == 'kiwi']
将return TypeError("invalid type comparison")
如果我尝试创建一个 "dummy" 数据框并对其进行合并,我会得到一个 ValueError:"You are trying to merge on int8 and category columns..."
df_dummy = pd.DataFrame(data={'fruit_name': 'kiwi'}, index=range(1))
df_dummy['fruit_name'] = df_dummy.fruit_name.astype('category')
df_new = df.merge(df_dummy, how="inner", on="fruit_name")
我是否丢失了分类列上的某些合并和筛选功能,或者我只是做错了(我对 dask 和 pandas 仍然非常陌生)。谢谢!
这里有一个例子可以很好地展示它:
In [1]: import dask
In [2]: df = dask.datasets.timeseries()
In [3]: df.head()
Out[3]:
id name x y
timestamp
2000-01-01 00:00:00 978 Hannah 0.194721 0.518782
2000-01-01 00:00:01 973 Michael -0.894162 -0.454409
2000-01-01 00:00:02 1043 Bob 0.829046 -0.585921
2000-01-01 00:00:03 1027 Edith -0.109735 0.563914
2000-01-01 00:00:04 970 Patricia -0.621248 -0.655324
In [4]: df['name'] = df.name.astype('category')
In [5]: df[df.name == 'Alice'].head()
Out[5]:
id name x y
timestamp
2000-01-01 00:00:23 997 Alice -0.662165 -0.260169
2000-01-01 00:00:58 1012 Alice -0.840159 -0.036770
2000-01-01 00:01:23 961 Alice 0.831663 0.022570
2000-01-01 00:01:27 987 Alice -0.874289 -0.358708
2000-01-01 00:02:09 984 Alice 0.445238 -0.658470
我建议构建一个minimal failing example
假设我有一个很大的水果数据框。我有数千行,但只有大约 30 个不同的水果名称,所以我将该列作为一个类别:
df['fruit_name'] = df.fruit_name.astype('category')
现在这是一个类别,我可以不再过滤它吗?例如,
df_kiwi = df[df['fruit_name'] == 'kiwi']
将return TypeError("invalid type comparison")
如果我尝试创建一个 "dummy" 数据框并对其进行合并,我会得到一个 ValueError:"You are trying to merge on int8 and category columns..."
df_dummy = pd.DataFrame(data={'fruit_name': 'kiwi'}, index=range(1))
df_dummy['fruit_name'] = df_dummy.fruit_name.astype('category')
df_new = df.merge(df_dummy, how="inner", on="fruit_name")
我是否丢失了分类列上的某些合并和筛选功能,或者我只是做错了(我对 dask 和 pandas 仍然非常陌生)。谢谢!
这里有一个例子可以很好地展示它:
In [1]: import dask
In [2]: df = dask.datasets.timeseries()
In [3]: df.head()
Out[3]:
id name x y
timestamp
2000-01-01 00:00:00 978 Hannah 0.194721 0.518782
2000-01-01 00:00:01 973 Michael -0.894162 -0.454409
2000-01-01 00:00:02 1043 Bob 0.829046 -0.585921
2000-01-01 00:00:03 1027 Edith -0.109735 0.563914
2000-01-01 00:00:04 970 Patricia -0.621248 -0.655324
In [4]: df['name'] = df.name.astype('category')
In [5]: df[df.name == 'Alice'].head()
Out[5]:
id name x y
timestamp
2000-01-01 00:00:23 997 Alice -0.662165 -0.260169
2000-01-01 00:00:58 1012 Alice -0.840159 -0.036770
2000-01-01 00:01:23 961 Alice 0.831663 0.022570
2000-01-01 00:01:27 987 Alice -0.874289 -0.358708
2000-01-01 00:02:09 984 Alice 0.445238 -0.658470
我建议构建一个minimal failing example