Python Pandas: 无法进行切片索引
Python Pandas: cannot do slice indexing
我正在尝试使用如下所示的 pandas 多索引数据框:
end ref|alt
chrom start
chr1 3000714 3000715 T|G
3001065 3001066 G|T
3001110 3001111 G|C
3001131 3001132 G|A
我希望能够做到这一点:
df.loc[('chr1', slice(3000714, 3001110))]
失败并出现以下错误:
cannot do slice indexing on with these indexers [1204741] of
df.index.levels[1].dtype
returns dtype('int64')
,所以 应该 使用整数切片对吗?
此外,关于如何高效执行此操作的任何评论都将很有价值,因为数据框有 1200 万行,我需要使用这种切片查询来查询它 ~70百万次。
我认为您需要在末尾添加 ,:
- 这意味着您需要对行进行切片,但需要所有列:
print (df.loc[('chr1', slice(3000714, 3001110)),:])
end ref|alt
chrom start
chr1 3000714 3000715 T|G
3001065 3001066 G|T
3001110 3001111 G|C
另一个解决方案是将 axis=0
添加到 loc
:
print (df.loc(axis=0)[('chr1', slice(3000714, 3001110))])
end ref|alt
chrom start
chr1 3000714 3000715 T|G
3001065 3001066 G|T
3001110 3001111 G|C
但如果只需要3000714
和3001110
:
print (df.loc[('chr1', [3000714, 3001110]),:])
end ref|alt
chrom start
chr1 3000714 3000715 T|G
3001110 3001111 G|C
idx = pd.IndexSlice
print (df.loc[idx['chr1', [3000714, 3001110]],:])
end ref|alt
chrom start
chr1 3000714 3000715 T|G
3001110 3001111 G|C
时间:
In [21]: %timeit (df.loc[('chr1', slice(3000714, 3001110)),:])
1000 loops, best of 3: 757 µs per loop
In [22]: %timeit (df.loc(axis=0)[('chr1', slice(3000714, 3001110))])
1000 loops, best of 3: 743 µs per loop
In [23]: %timeit (df.loc[('chr1', [3000714, 3001110]),:])
1000 loops, best of 3: 824 µs per loop
In [24]: %timeit (df.loc[pd.IndexSlice['chr1', [3000714, 3001110]],:])
The slowest run took 5.35 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 826 µs per loop
我正在尝试使用如下所示的 pandas 多索引数据框:
end ref|alt
chrom start
chr1 3000714 3000715 T|G
3001065 3001066 G|T
3001110 3001111 G|C
3001131 3001132 G|A
我希望能够做到这一点:
df.loc[('chr1', slice(3000714, 3001110))]
失败并出现以下错误:
cannot do slice indexing on with these indexers [1204741] of
df.index.levels[1].dtype
returns dtype('int64')
,所以 应该 使用整数切片对吗?
此外,关于如何高效执行此操作的任何评论都将很有价值,因为数据框有 1200 万行,我需要使用这种切片查询来查询它 ~70百万次。
我认为您需要在末尾添加 ,:
- 这意味着您需要对行进行切片,但需要所有列:
print (df.loc[('chr1', slice(3000714, 3001110)),:])
end ref|alt
chrom start
chr1 3000714 3000715 T|G
3001065 3001066 G|T
3001110 3001111 G|C
另一个解决方案是将 axis=0
添加到 loc
:
print (df.loc(axis=0)[('chr1', slice(3000714, 3001110))])
end ref|alt
chrom start
chr1 3000714 3000715 T|G
3001065 3001066 G|T
3001110 3001111 G|C
但如果只需要3000714
和3001110
:
print (df.loc[('chr1', [3000714, 3001110]),:])
end ref|alt
chrom start
chr1 3000714 3000715 T|G
3001110 3001111 G|C
idx = pd.IndexSlice
print (df.loc[idx['chr1', [3000714, 3001110]],:])
end ref|alt
chrom start
chr1 3000714 3000715 T|G
3001110 3001111 G|C
时间:
In [21]: %timeit (df.loc[('chr1', slice(3000714, 3001110)),:])
1000 loops, best of 3: 757 µs per loop
In [22]: %timeit (df.loc(axis=0)[('chr1', slice(3000714, 3001110))])
1000 loops, best of 3: 743 µs per loop
In [23]: %timeit (df.loc[('chr1', [3000714, 3001110]),:])
1000 loops, best of 3: 824 µs per loop
In [24]: %timeit (df.loc[pd.IndexSlice['chr1', [3000714, 3001110]],:])
The slowest run took 5.35 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 826 µs per loop