根据其他数据框中的 header 列成员资格（按日期）在 pandas 数据框中设置布尔值（按日期）

Question

我有两个 pandas 数据框（X 和 Y），我正在尝试根据 X 轴和 Y 的 columns/constituents 轴之间的相互关系用布尔值填充第三个 (Z)。我只能设法通过嵌套循环来做到这一点，代码适用于我的玩具示例，但对于我的实际数据集来说太慢了。

# define X, Y and Z
idx=pd.date_range('2016-1-31',periods=3,freq='M')
codes = list('ABCD')
X = np.random.randn(3,4)
X = pd.DataFrame(X,columns=codes,index=idx)

Y = [['A','A','B'],['C','B','C'],['','C','D']]
Y = pd.DataFrame(Y,columns=idx)

Z = pd.DataFrame(columns=X.columns, index=X.index)

如您所见，在此示例中，X 的索引与 Y 的列相匹配。在我的真实示例中，Y 的列是 X 的索引的子集。

Z 轴与 X 轴匹配。如果 Z 的列 header 在 Y 的列中且 header 等于 Z 的索引，我想用 True 填充 Z 的元素。我的工作代码如下：

for r in Y:
    for c in Z:
        Z.loc[r,c] = c in Y[r].values

代码非常简洁，但在较大的数据集上运行需要很长时间。我希望有矢量化的方法可以更快地实现同样的目标。

如有任何帮助，我们将不胜感激

谢谢！

Answer 1

您可以使用 stack method, where values of DataFrame are converted to columns and columns to values of DataFrames. Last test NaN by notnull:

print (Y.replace({'':np.nan})
        .stack()
        .reset_index(0)
        .set_index(0, append=True)
        .squeeze()
        .unstack()
        .rename_axis(None, axis=1)
        .notnull())

                A      B     C      D
2016-01-31   True  False  True  False
2016-02-29   True   True  True  False
2016-03-31  False   True  True   True

pivot的另一个解决方案：

print (Y.replace({'':np.nan})
        .stack()
        .reset_index(name='a')
        .pivot(index='level_1', columns='a', values='level_0')
        .rename_axis(None, axis=1)
        .rename_axis(None)        
        .notnull())

                A      B     C      D
2016-01-31   True  False  True  False
2016-02-29   True   True  True  False
2016-03-31  False   True  True   True

通过评论编辑：

使用 reindex if indexes are unique and then fillna False:

import pandas as pd
import numpy as np

# define X, Y and Z
idx=pd.date_range('2016-1-31',periods=5,freq='M')
codes = list('ABCD')
X = np.random.randn(5,4)
X = pd.DataFrame(X,columns=codes,index=idx)

Y = [['A','A','B'],['C','B','C'],['','C','D']]
Y = pd.DataFrame(Y,columns=idx[:3])
Z = pd.DataFrame(columns=X.columns, index=X.index)

print (X)
                   A         B         C         D
2016-01-31  0.810348 -0.737780 -0.523869 -0.585772
2016-02-29 -1.126655 -0.494999 -1.388351  0.460340
2016-03-31 -1.578155  0.950643 -1.699921  1.149540
2016-04-30 -2.320711  1.263740 -1.401714  0.090788
2016-05-31  1.218036  0.565395  0.172278  0.288698

print (Y)
  2016-01-31 2016-02-29 2016-03-31
0          A          A          B
1          C          B          C
2                     C          D

print (Z)
              A    B    C    D
2016-01-31  NaN  NaN  NaN  NaN
2016-02-29  NaN  NaN  NaN  NaN
2016-03-31  NaN  NaN  NaN  NaN
2016-04-30  NaN  NaN  NaN  NaN
2016-05-31  NaN  NaN  NaN  NaN

Y1 = Y.replace({'':np.nan})
      .stack()
      .reset_index(name='a')
      .pivot(index='level_1', columns='a', values='level_0')
      .rename_axis(None, axis=1)
      .rename_axis(None)
      .notnull()
print (Y1)
                A      B     C      D
2016-01-31   True  False  True  False
2016-02-29   True   True  True  False
2016-03-31  False   True  True   True

print (Y1.reindex(X.index).fillna(False))
                A      B      C      D
2016-01-31   True  False   True  False
2016-02-29   True   True   True  False
2016-03-31  False   True   True   True
2016-04-30  False  False  False  False
2016-05-31  False  False  False  False

根据其他数据框中的 header 列成员资格（按日期）在 pandas 数据框中设置布尔值（按日期）

Setting boolean values in pandas dataframe (by date) based on column header membership in other dataframe (by date)

python

boolean

intersection

pandas