使用 python（或 pandas 数据帧）在特定 x 范围内对部分数据进行切片？

Question

我运行对如何在 python 3.x 中使用追加运算符有疑问。在我的 python 代码中，我试图删除 y 值为 0 的数据点。我的数据如下所示：

x          y
400.01  0.000e0
420.02  0.000e0
450.03  10.000e0
48.04   2.000e0
520.05  0.000e0
570.06  0.000e0
570.23  5.000e0
600.24  0.000e0
620.25  3.600e-1
700.26  8.400e-1
900.31  2.450e0

我想提取特定 x 范围内的数据。例如，我想获得 x 和 y 值，其中 x 大于 520 但小于 1000。

期望的输出看起来像..

  x        y
520.05  0.000e0
570.06  0.000e0
570.23  5.000e0
600.24  0.000e0
620.25  3.600e-1
700.26  8.400e-1
900.31  2.450e0

我目前的代码如下所示。

import numpy as np
import os

myfiles = os.listdir('input')

for file in myfiles:
    with open('input/'+file, 'r') as f:
        data = np.loadtxt(f,delimiter='\t') 


        for row in data: ## remove data points where y is zero
            data_filtered_both = data[data[:,1] != 0.000]
            x_array=(data_filtered_both[:,0])
            y_array=(data_filtered_both[:,1])
            y_norm=(y_array/np.max(y_array))
            x_and_y= np.array([list (i) for i in zip(x_array,y_array)])

    precursor_x=[]
    precursor_y=[]
    for precursor in row: ## get data points where x is 
        precursor = x_and_y[:, np.abs(x_and_y[0,:]) > 520 and np.abs(x_and_y[0,:]) <1000]
        precursor_x=np.array(precursor[0])
        precursor_y=np.array(precursor[1])

我收到一条错误消息说..

  File "<ipython-input-45-0506fab0ad9a>", line 4, in <module>
    precursor = x_and_y[:, np.abs(x_and_y[0,:]) > 2260 and np.abs(x_and_y[0,:]) <2290]

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

我该怎么办？我可以使用任何推荐的运营商吗？

P.S 我意识到 pandas 数据框对于处理这样的数据集非常有用。我对 pandas 语言不是很熟悉，但如果需要的话可以使用它。因此，我也将添加 pandas 作为我的标签。

Answer 1

df[(df['x'] > 520) & (df['x'] < 1000) & (df['y'] != 0)]

更新：

我记得我在某处读到 query() 对于巨大的 dfs 更快，但我的简单基准测试表明 df[(df['x'] > 520) & (df['x'] < 1000)] 总是比 query() 快。

df1 = pd.DataFrame({"X":np.random.randint(100,1300,10000),"Y":np.random.randint(0,200,10000)})
df2 = pd.DataFrame({"X":np.random.randint(100,1300,1000000),"Y":np.random.randint(0,200,1000000)})
df3 = pd.DataFrame({"X":np.random.randint(100,1300,100000000),"Y":np.random.randint(0,200,100000000)})

小型数据框：

%timeit df1[(df1['X'] > 520) & (df1['X'] < 1000)]
%timeit df1.query('X > 520 & X < 1000')
%timeit df1[df1['X'].between(520,1000)]
#2.46 ms ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#5.05 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#2.45 ms ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

中型数据帧：

%timeit df2[(df2['X'] > 520) & (df2['X'] < 1000)]
%timeit df2.query('X > 520 & Y < 1000')
%timeit df2[df2['X'].between(520,1000)]
#31.2 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#42.8 ms ± 799 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#32.3 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

大型数据框：

%timeit df3[(df3['X'] > 520) & (df3['X'] < 1000)]
%timeit df3.query('X > 520 & Y < 1000')
%timeit df3[df3['X'].between(520,1000)]
#4.04 s ± 23.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#6.37 s ± 56.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#3.68 s ± 38.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

有趣的是，df3[df3['X'].between(520,1000)] 是最大 df 最快的。随着 df 变大，第一个和第三个选项之间的相对差异正在缩小，所以也许在某些时候（或在其他一些情况下）query() 表现更好。

Answer 2

您可以使用 between with boolean indexing:

df = df[df['x'].between(520,1000)]
print (df)
         x     y
4   520.05  0.00
5   570.06  0.00
6   570.23  5.00
7   600.24  0.00
8   620.25  0.36
9   700.26  0.84
10  900.31  2.45

...并从 y 列中删除 0：

df = df[df['x'].between(520,1000) & (df['y'] != 0)]
print (df)
         x     y
6   570.23  5.00
8   620.25  0.36
9   700.26  0.84
10  900.31  2.45

或query as commented :

df = df.query("x>500 & x<1000")
print (df)
         x     y
4   520.05  0.00
5   570.06  0.00
6   570.23  5.00
7   600.24  0.00
8   620.25  0.36
9   700.26  0.84
10  900.31  2.45

如果需要也过滤掉y列中的0:

df = df.query("x>500 & x<1000 & y != 0")
print (df)
         x     y
6   570.23  5.00
8   620.25  0.36
9   700.26  0.84
10  900.31  2.45

Answer 3

df=pd.DataFrame({"X":np.random.randint(100,1000,10),"Y":np.random.randint(0,0.001,10)})

df2=df.query("X>5 & X<1000")


print(df2)

     X         Y
0  188 -0.923096
1  953  1.327985
2  190 -0.970169
3  975  0.819512
4  900 -0.782465
5  180  0.357470
6  874  1.746500
7  369  0.078113
8  287  1.642208
9  739  2.238841

Answer 4

我同意之前使用 Pandas 的答案，因为它更简单。如果你不想使用它，我建议将逻辑分为两部分：

row = np.array([[1, 0], [2, 0], [3, 7], [4, 8], [5, 9]])
print(row)

array([[1, 0], [2, 0], [3, 7], [4, 8], [5, 9]])

x_and_y = []
for x, y in row: ## remove data points where y is zero
    if y > 0:
        x_and_y.append((x, y))
print(x_and_y)

[(3, 7), (4, 8), (5, 9)]

precursor_x = []
precursor_y = []
for x, y in x_and_y: ## get data points where x is
    if x > 3 and x < 9:
        precursor_x.append(x)
        precursor_y.append(y)
print(precursor_x, precursor_y)

[4, 5] [8, 9]

这会让你把所有的 X 都变成 precursor_x，把所有的 Y 都变成 precursor_y。如果您愿意，可以将它们压缩：

np.array(list(zip(precursor_x, precursor_y)))

array([[4, 8], [5, 9]])

使用 python（或 pandas 数据帧）在特定 x 范围内对部分数据进行切片？

slicing a portion of data within a certain x range using python(or pandas dataframe)?

date-range

dataframe

python-3.x

pandas