根据条件迭代删除未排序的行，直到达到定义的数据帧大小

Question

如何根据条件（partnersCount =1，& selectionWeighting 按顺序 - 从最低到最高）一次减少数据帧的大小，直到数据帧达到指定大小。

伪装：

迭代 df 的大小比我们预期的要大 // 如果它存在任何单个 partnerIDs (partnersCount=1) // 如果 partnerCount:1 && selectionWeighting < 0, 删除它们

理想情况下，我想继续上面的内容，这样：

如果 df 仍然太大， // 找到像 partnerID 分组具有最低 selectionWeighting 的倍数并迭代地删除那些 grouping/pairs 直到达到所需的大小。这里的困难是相同的 partnerIDs 可能有不同的 selectionWeightings

当前的方法是实用的，但很难相信这是实现这一目标的最佳方法。 Data/df 尺寸总是 <2k。也许有人可以推荐一种替代方法。注意：由于我不会深入的原因，但是应用排序并从 bottom/top 向上删除不是一个选项。

import pandas as pd

# some pretend data
df = pd.DataFrame({'ID': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
                   'selectionWeighting': [1,1,.45,.45,.3,.3,.2, .2,.2,.2,.1,.1,0,0,0,1,1,.45,.3,.3],
                  'partnerID': [1,1,4,4,3,9,2, 2,11,2,1,1,0,7,0,1,1,4,3,1],
                  'partnersCount': [6,6,3,3,2,1,3, 3,1,3,6,6,2,1,2,6,6,3,2,1]}
                 
                 )
df.reset_index(drop=True, inplace=True)
df = df.sort_values(['selectionWeighting'], ascending=[False])

# init 
# in reality, these are variable and coming from elsewhere. 
targetDfSize = 18 # target size changes everytime. 
currentDfSize = df.shape[0]
difference = max(0,currentDfSize - targetDfSize)

if difference:
    for i in range(difference):
        canRemove = df[(df['partnersCount']==1) & (df['selectionWeighting']!=1) ] # get those prioritised for removal
        #display(canRemove)
        df = df[df['partnersCount']>1] # clean up before we put the remove rows back
        #display(df)
        if canRemove.shape[0]>0: 
            # so there are some single partners we can remove
            
            display(df)
            canRemove = canRemove.iloc[:-1] # drop 1, remove last row 

            df = df.append(canRemove, ignore_index=True) # append the remainder back, then go check if we're still too big
            print(df.shape[0])
df

Answer 1

我认为有很多昂贵的操作可能在您的示例中没有用到。

如果我理解你的例子，你首先用 df = df[df['partnersCount']>1] 从 canRemove 中删除所有行，然后用 df = df.append(canRemove, ignore_index=True) 从 canRemove 中再次附加除一行之外的所有行，为什么不一次只删除一行时间？

其次，如果您知道要删除多少行，为什么要循环？

我建议这样实现：

## select all rows such that partnersCount is 1 and selectionWeighting is not 1
can_remove = df.query("partnersCount == 1 and selectionWeighting != 1")

## select only the n last rows you can remove
to_remove = can_remove.iloc[-difference :, :]

## construct a mask returning false if a row is in to_remove
df_mask = df.ne(to_remove)

## new dataframe using the mask
df = df[df_mask].dropna()

对于大数据，我强烈建议您使用查询来加速您的代码，并且不要使用 Pandas 的 for 循环。 Pandas 有很多功能优化，您可以利用。

希望我回答了你的问题

编辑以更好地回答操作员的问题并删除无用的查询。

根据条件迭代删除未排序的行，直到达到定义的数据帧大小

Iteratively remove unsorted rows on condition until defined dataframe size reached

weighting

dataframe

pandas