如何根据未命名列上的字符串匹配条件重新排列 Pandas 上的行?

How to rearrange rows on Pandas based on based on string matching conditions on unnamed columns?

我们有一个熊猫数据框如下。

   Unnamed:0               T1    T2    T3   ...  T120
 0  cheetah Running         x1    x2    x1   ...   x3
 1  Running Jaguar          x1    x10   x3   ...   x7
 2  Cougar Running          x1    x2    x1   ...   x3
 3  Bengal Tiger Running    x5    x2    x4   ...   x17
 4  Sleeping Bali Tiger     x55   x61   x11  ...   x31
 5  Javan Leopard Sleeping  x42   x67   x17  ...   x34
 6  Leopard Running         x2    x5    x2   ...   x3
 7  Bengal Tiger Running    x5    x2    x4   ...   x17
..      ...                ...   ...   ...  ...   ...
199 Florida Panther Eating  x71   x80   x101 ...   x94
200 Running Eastern Cougar  x5    x1     x2  ...   x3
201 Congo Lion Sleeping     x57   x61    x14  ...  x38

我们正在尝试重组此数据框,如下所示。在上面的数据框中,第一列是未命名的。我们尝试在那个未命名的列中检查已知的常见行为,例如“运行”、“睡觉”等,并尝试重新排列数据框,如下所示。

        Unnamed:0               T1    T2    T3   ...  T120
     0  cheetah Running         x1    x2    x1   ...   x3
     1  Running Jaguar          x1    x10   x3   ...   x7
     2  Cougar Running          x1    x2    x1   ...   x3
     3  Bengal Tiger Running    x5    x2    x4   ...   x17
     4  Running Eastern Cougar  x5    x1     x2  ...   x3
     5  Bengal Tiger Running    x5    x2    x4   ...   x17
     6  Leopard Running         x2    x5    x2   ...   x3
     4  Sleeping Bali Tiger     x55   x61   x11  ...   x31
     5  Javan Leopard Sleeping  x42   x67   x17  ...   x34
     6  Congo Lion Sleeping     x57   x61    x14  ...  x38  
     ..      ...                ...   ...   ...  ...   ...
    201 Florida Panther Eating  x71   x80   x101 ...   x94 
    

我尝试了以下方式,但我为该列添加了一个名称。我尝试了 df[df.columns.str.contains('^Unnamed')] 但没有成功。

import pandas as pd

df = pd.read_csv('a_behav_cat.csv')

df_new = pd.DataFrame()
df_new = df_new.append(df[df["name"].str.contains("Running")])
df_new = df_new.append(df[df["name"].str.contains("Sleeping")])
print(df_new.to_string())

问题 1: 我认为应该有更好的或 Pythonic 的方式来做到这一点。感谢您对此的考虑。这也检查了字符串的精确匹配,这并不理想,因为数据集可能有简单的“运行”和简单的“sleeping”:) 等。我尝试了 .lower() 函数,但没有用.

目的: 这样做的目的是为单个观察确定有多少个不同的 x 类别。这里 T1, T2, T3, ... T120 是观测值。我们需要确定每个观察值有多少共同值。即对于 T1,对于“运行”,有 3 个“x1”和 3 个“x5”以及 1 个 'x2'.

为此,我们首先想到了如上所述重新排列数据框。

然而,我们不确定这种重新排列是否需要达到目的。此外,输出看起来是多维的。那就是对于T1来说,对于“运行”需要存储多少个x1,x3,x5。同样,这需要应用到其他行为,如“吃”、“睡”等

问题 2: 实现这一目标的最佳方法是什么?任何适合此目的的数据结构?有没有更好的方法在不重新排列数据帧的情况下实现上述目的?

如果你想做测试,这里有一个示例 csv。

,T1,T2,T3,T4
cheetah Running,x1,x2,x1,x3
Running Jaguar,x1,x10,x3,x7
Cougar Running,x1,x2,x1,x3
Bengal Tiger Running,x5,x2,x4,x17
Sleeping Bali Tiger,x55,x61,x11,x31
Javan Leopard Sleeping,x42,x67,x17,x34
Leopard Running,x2,x5,x2,x3
Bengal Tiger Running,x5,x2,x4,x17
Florida Panther Eating,x71,x80,x101,x94
Running Eastern Cougar,x5,x1,x2,x3
Congo Lion Sleeping,x57,x61,x14,x38

IIUC,您可以使用字典映射将正确的类别设置为行:

# Your list of patterns
MAPPING = {'S': ['sleep', 'sleeping'],
           'R': ['run', 'running'],
           'E': ['eat', 'eating']}

# Reverse the mapping (swap keys and values)
rev = {v: k for k, l in MAPPING.items() for v in l}

# Create the regex pattern
pat = fr"\b({'|'.join(rev)})\b"

# Extract from data
df['CAT'] = df['Unnamed: 0'].str.lower().str.extract(pat, expand=False).map(rev)

输出:

>>> df
                Unnamed: 0   T1   T2    T3   T4 CAT
0          cheetah Running   x1   x2    x1   x3   R
1           Running Jaguar   x1  x10    x3   x7   R
2           Cougar Running   x1   x2    x1   x3   R
3     Bengal Tiger Running   x5   x2    x4  x17   R
4      Sleeping Bali Tiger  x55  x61   x11  x31   S
5   Javan Leopard Sleeping  x42  x67   x17  x34   S
6          Leopard Running   x2   x5    x2   x3   R
7     Bengal Tiger Running   x5   x2    x4  x17   R
8   Florida Panther Eating  x71  x80  x101  x94   E
9   Running Eastern Cougar   x5   x1    x2   x3   R
10     Congo Lion Sleeping  x57  x61   x14  x38   S