如何通过索引列表 select 来自 dask 数据框的数据?
How can I select data from a dask dataframe by a list of indices?
我想 select 基于索引列表的 dask 数据帧中的行。我怎样才能做到这一点?
示例:
比方说,我有以下 dask 数据框。
dict_ = {'A':[1,2,3,4,5,6,7], 'B':[2,3,4,5,6,7,8], 'index':['x1', 'a2', 'x3', 'c4', 'x5', 'y6', 'x7']}
pdf = pd.DataFrame(dict_)
pdf = pdf.set_index('index')
ddf = dask.dataframe.from_pandas(pdf, npartitions = 2)
此外,我有一个我感兴趣的索引列表,例如
indices_i_want_to_select = ['x1','x3', 'y6']
据此,我想生成一个仅包含 indices_i_want_to_select
中指定行的 dask 数据框
编辑:dask 现在支持列表上的 loc:
ddf_selected = ddf.loc[indices_i_want_to_select]
以下应该仍然有效,但不再需要:
import pandas as pd
import dask.dataframe as dd
#generate example dataframe
pdf = pd.DataFrame(dict(A = [1,2,3,4,5], B = [6,7,8,9,0]), index=['i1', 'i2', 'i3', 4, 5])
ddf = dd.from_pandas(pdf, npartitions = 2)
#list of indices I want to select
l = ['i1', 4, 5]
#generate new dask dataframe containing only the specified indices
ddf_selected = ddf.map_partitions(lambda x: x[x.index.isin(l)], meta = ddf.dtypes)
使用 dask
版本 '1.2.0' 由于混合索引类型导致错误。
在任何情况下都可以选择使用 loc
。
import pandas as pd
import dask.dataframe as dd
#generate example dataframe
pdf = pd.DataFrame(dict(A = [1,2,3,4,5], B = [6,7,8,9,0]), index=['i1', 'i2', 'i3', '4', '5'])
ddf = dd.from_pandas(pdf, npartitions = 2,)
# #list of indices I want to select
l = ['i1', '4', '5']
# #generate new dask dataframe containing only the specified indices
# ddf_selected = ddf.map_partitions(lambda x: x[x.index.isin(l)], meta = ddf.dtypes)
ddf_selected = ddf.loc[l]
ddf_selected.head()
我想 select 基于索引列表的 dask 数据帧中的行。我怎样才能做到这一点?
示例: 比方说,我有以下 dask 数据框。
dict_ = {'A':[1,2,3,4,5,6,7], 'B':[2,3,4,5,6,7,8], 'index':['x1', 'a2', 'x3', 'c4', 'x5', 'y6', 'x7']}
pdf = pd.DataFrame(dict_)
pdf = pdf.set_index('index')
ddf = dask.dataframe.from_pandas(pdf, npartitions = 2)
此外,我有一个我感兴趣的索引列表,例如
indices_i_want_to_select = ['x1','x3', 'y6']
据此,我想生成一个仅包含 indices_i_want_to_select
编辑:dask 现在支持列表上的 loc:
ddf_selected = ddf.loc[indices_i_want_to_select]
以下应该仍然有效,但不再需要:
import pandas as pd
import dask.dataframe as dd
#generate example dataframe
pdf = pd.DataFrame(dict(A = [1,2,3,4,5], B = [6,7,8,9,0]), index=['i1', 'i2', 'i3', 4, 5])
ddf = dd.from_pandas(pdf, npartitions = 2)
#list of indices I want to select
l = ['i1', 4, 5]
#generate new dask dataframe containing only the specified indices
ddf_selected = ddf.map_partitions(lambda x: x[x.index.isin(l)], meta = ddf.dtypes)
使用 dask
版本 '1.2.0' 由于混合索引类型导致错误。
在任何情况下都可以选择使用 loc
。
import pandas as pd
import dask.dataframe as dd
#generate example dataframe
pdf = pd.DataFrame(dict(A = [1,2,3,4,5], B = [6,7,8,9,0]), index=['i1', 'i2', 'i3', '4', '5'])
ddf = dd.from_pandas(pdf, npartitions = 2,)
# #list of indices I want to select
l = ['i1', '4', '5']
# #generate new dask dataframe containing only the specified indices
# ddf_selected = ddf.map_partitions(lambda x: x[x.index.isin(l)], meta = ddf.dtypes)
ddf_selected = ddf.loc[l]
ddf_selected.head()