与 pandas 中的特定年份列匹配

Question

我正在尝试使用以下数据框获取“给定”值并将其与同一行中的“年份”匹配：

data = {
    'Given' : [0.45, 0.39, 0.99, 0.58, None],
    'Year 1' : [0.25, 0.15, 0.3, 0.23, 0.25],
    'Year 2' : [0.39, 0.27, 0.55, 0.3, 0.4],
    'Year 3' : [0.43, 0.58, 0.78, 0.64, 0.69],
    'Year 4' : [0.65, 0.83, 0.95, 0.73, 0.85],
    'Year 5' : [0.74, 0.87, 0.99, 0.92, 0.95]
}
   
df = pd.DataFrame(data)

print(df)

输出：

   Given  Year 1  Year 2  Year 3  Year 4  Year 5
0   0.45    0.25    0.39    0.43    0.65    0.74
1   0.39    0.15    0.27    0.58    0.83    0.87
2   0.99    0.30    0.55    0.78    0.95    0.99
3   0.58    0.23    0.30    0.64    0.73    0.92
4    NaN    0.25    0.40    0.69    0.85    0.95

但是，匹配过程有一些注意事项。在计算到第一个“年”超过 70% 的时间之前，我试图匹配最接近给定值的年份。因此第 0 行将匹配“第 3 年”，我们可以在同一行中看到“第 5 年”需要两年时间，这是 70% 以上的行中的第一次出现。

对于任何已经超过 70% 的“给定”值，我们可以只输出“完整”，对于任何不包含数据的“给定”值，我们可以只输出第一年超过 70%。输出将如下所示：

   Given  Year 1  Year 2  Year 3  Year 4  Year 5 Output
0   0.45    0.25    0.39    0.43    0.65    0.74      2
1   0.39    0.15    0.27    0.58    0.83    0.87      2
2   0.99    0.30    0.55    0.78    0.95    0.99   full
3   0.58    0.23    0.30    0.64    0.73    0.92      1
4    NaN    0.25    0.40    0.69    0.85    0.95      4

我花了非常长的时间来清理这些数据，所以目前我想不出除了使用 .abs() 来开始匹配过程之外的其他方法。感谢所有帮助。

Answer 1

给你！

import numpy as np

def output(df):
    output = []
    for i in df.iterrows():
        row = i[1].to_list()
        given = row[0]
        compare = np.array(row[1:])
        first_70 = np.argmax(compare > 0.7)
        
        if np.isnan(given):
            output.append(first_70 + 1)
            continue
            
        if given > 0.7:
            output.append('full')
            continue
            
        diff = np.abs(np.array(compare) - np.array(given))
        closest_year = diff.argmin()

        output.append(first_70 - closest_year)
    
    return output

df['output'] = output(df)

Answer 2

向量化Pandas方法：

reset_index() 的列名和 .T，这样你就可以拥有相同的列名，并以矢量化的方式相互减去数据帧。 pd.concat() 和 * 创建一个复制第一列的数据帧，这样您就可以以更矢量化的方式获得数据帧的绝对差异，而不是循环遍历列。
使用 idxmax 和 idxmin 根据您的标准确定列号。

根据您的条件使用np.select。

import pandas as pd
import numpy as np
# identify 70% columns
pct_70 = (df.T.reset_index(drop=True).T > .7).idxmax(axis=1)
# identify column number of lowest absolute difference to Given
nearest_col = ((df.iloc[:,1:].T.reset_index(drop=True).T 
 - pd.concat([df.iloc[:,0]] * len(df.columns[1:]), axis=1)
  .T.reset_index(drop=True).T)).abs().idxmin(axis=1) 
# Generate an output series
output = pct_70 - nearest_col - 1
# Conditionally apply the output series
df['Output'] = np.select([output.gt(0),output.lt(0),output.isnull()],
                          [output, 'full', pct_70],np.nan)
df
Out[1]: 
   Given  Year 1  Year 2  Year 3  Year 4  Year 5 Output
0   0.45    0.25    0.39    0.43    0.65    0.74    2.0
1   0.39    0.15    0.27    0.58    0.83    0.87    2.0
2   0.99    0.30    0.55    0.78    0.95    0.99   full
3   0.58    0.23    0.30    0.64    0.73    0.92    1.0
4    NaN    0.25    0.40    0.69    0.85    0.95      4

与 pandas 中的特定年份列匹配

Matching to a specific year column in pandas

python

numpy

dataframe

pandas