如何根据未命名列上的字符串匹配条件重新排列 Pandas 上的行?
How to rearrange rows on Pandas based on based on string matching conditions on unnamed columns?
我们有一个熊猫数据框如下。
Unnamed:0 T1 T2 T3 ... T120
0 cheetah Running x1 x2 x1 ... x3
1 Running Jaguar x1 x10 x3 ... x7
2 Cougar Running x1 x2 x1 ... x3
3 Bengal Tiger Running x5 x2 x4 ... x17
4 Sleeping Bali Tiger x55 x61 x11 ... x31
5 Javan Leopard Sleeping x42 x67 x17 ... x34
6 Leopard Running x2 x5 x2 ... x3
7 Bengal Tiger Running x5 x2 x4 ... x17
.. ... ... ... ... ... ...
199 Florida Panther Eating x71 x80 x101 ... x94
200 Running Eastern Cougar x5 x1 x2 ... x3
201 Congo Lion Sleeping x57 x61 x14 ... x38
我们正在尝试重组此数据框,如下所示。在上面的数据框中,第一列是未命名的。我们尝试在那个未命名的列中检查已知的常见行为,例如“运行”、“睡觉”等,并尝试重新排列数据框,如下所示。
Unnamed:0 T1 T2 T3 ... T120
0 cheetah Running x1 x2 x1 ... x3
1 Running Jaguar x1 x10 x3 ... x7
2 Cougar Running x1 x2 x1 ... x3
3 Bengal Tiger Running x5 x2 x4 ... x17
4 Running Eastern Cougar x5 x1 x2 ... x3
5 Bengal Tiger Running x5 x2 x4 ... x17
6 Leopard Running x2 x5 x2 ... x3
4 Sleeping Bali Tiger x55 x61 x11 ... x31
5 Javan Leopard Sleeping x42 x67 x17 ... x34
6 Congo Lion Sleeping x57 x61 x14 ... x38
.. ... ... ... ... ... ...
201 Florida Panther Eating x71 x80 x101 ... x94
我尝试了以下方式,但我为该列添加了一个名称。我尝试了 df[df.columns.str.contains('^Unnamed')]
但没有成功。
import pandas as pd
df = pd.read_csv('a_behav_cat.csv')
df_new = pd.DataFrame()
df_new = df_new.append(df[df["name"].str.contains("Running")])
df_new = df_new.append(df[df["name"].str.contains("Sleeping")])
print(df_new.to_string())
问题 1:
我认为应该有更好的或 Pythonic 的方式来做到这一点。感谢您对此的考虑。这也检查了字符串的精确匹配,这并不理想,因为数据集可能有简单的“运行”和简单的“sleeping”:) 等。我尝试了 .lower()
函数,但没有用.
目的:
这样做的目的是为单个观察确定有多少个不同的 x 类别。这里 T1, T2, T3, ... T120 是观测值。我们需要确定每个观察值有多少共同值。即对于 T1,对于“运行”,有 3 个“x1”和 3 个“x5”以及 1 个 'x2'.
为此,我们首先想到了如上所述重新排列数据框。
然而,我们不确定这种重新排列是否需要达到目的。此外,输出看起来是多维的。那就是对于T1来说,对于“运行”需要存储多少个x1,x3,x5。同样,这需要应用到其他行为,如“吃”、“睡”等
问题 2:
实现这一目标的最佳方法是什么?任何适合此目的的数据结构?有没有更好的方法在不重新排列数据帧的情况下实现上述目的?
如果你想做测试,这里有一个示例 csv。
,T1,T2,T3,T4
cheetah Running,x1,x2,x1,x3
Running Jaguar,x1,x10,x3,x7
Cougar Running,x1,x2,x1,x3
Bengal Tiger Running,x5,x2,x4,x17
Sleeping Bali Tiger,x55,x61,x11,x31
Javan Leopard Sleeping,x42,x67,x17,x34
Leopard Running,x2,x5,x2,x3
Bengal Tiger Running,x5,x2,x4,x17
Florida Panther Eating,x71,x80,x101,x94
Running Eastern Cougar,x5,x1,x2,x3
Congo Lion Sleeping,x57,x61,x14,x38
IIUC,您可以使用字典映射将正确的类别设置为行:
# Your list of patterns
MAPPING = {'S': ['sleep', 'sleeping'],
'R': ['run', 'running'],
'E': ['eat', 'eating']}
# Reverse the mapping (swap keys and values)
rev = {v: k for k, l in MAPPING.items() for v in l}
# Create the regex pattern
pat = fr"\b({'|'.join(rev)})\b"
# Extract from data
df['CAT'] = df['Unnamed: 0'].str.lower().str.extract(pat, expand=False).map(rev)
输出:
>>> df
Unnamed: 0 T1 T2 T3 T4 CAT
0 cheetah Running x1 x2 x1 x3 R
1 Running Jaguar x1 x10 x3 x7 R
2 Cougar Running x1 x2 x1 x3 R
3 Bengal Tiger Running x5 x2 x4 x17 R
4 Sleeping Bali Tiger x55 x61 x11 x31 S
5 Javan Leopard Sleeping x42 x67 x17 x34 S
6 Leopard Running x2 x5 x2 x3 R
7 Bengal Tiger Running x5 x2 x4 x17 R
8 Florida Panther Eating x71 x80 x101 x94 E
9 Running Eastern Cougar x5 x1 x2 x3 R
10 Congo Lion Sleeping x57 x61 x14 x38 S
我们有一个熊猫数据框如下。
Unnamed:0 T1 T2 T3 ... T120
0 cheetah Running x1 x2 x1 ... x3
1 Running Jaguar x1 x10 x3 ... x7
2 Cougar Running x1 x2 x1 ... x3
3 Bengal Tiger Running x5 x2 x4 ... x17
4 Sleeping Bali Tiger x55 x61 x11 ... x31
5 Javan Leopard Sleeping x42 x67 x17 ... x34
6 Leopard Running x2 x5 x2 ... x3
7 Bengal Tiger Running x5 x2 x4 ... x17
.. ... ... ... ... ... ...
199 Florida Panther Eating x71 x80 x101 ... x94
200 Running Eastern Cougar x5 x1 x2 ... x3
201 Congo Lion Sleeping x57 x61 x14 ... x38
我们正在尝试重组此数据框,如下所示。在上面的数据框中,第一列是未命名的。我们尝试在那个未命名的列中检查已知的常见行为,例如“运行”、“睡觉”等,并尝试重新排列数据框,如下所示。
Unnamed:0 T1 T2 T3 ... T120
0 cheetah Running x1 x2 x1 ... x3
1 Running Jaguar x1 x10 x3 ... x7
2 Cougar Running x1 x2 x1 ... x3
3 Bengal Tiger Running x5 x2 x4 ... x17
4 Running Eastern Cougar x5 x1 x2 ... x3
5 Bengal Tiger Running x5 x2 x4 ... x17
6 Leopard Running x2 x5 x2 ... x3
4 Sleeping Bali Tiger x55 x61 x11 ... x31
5 Javan Leopard Sleeping x42 x67 x17 ... x34
6 Congo Lion Sleeping x57 x61 x14 ... x38
.. ... ... ... ... ... ...
201 Florida Panther Eating x71 x80 x101 ... x94
我尝试了以下方式,但我为该列添加了一个名称。我尝试了 df[df.columns.str.contains('^Unnamed')]
但没有成功。
import pandas as pd
df = pd.read_csv('a_behav_cat.csv')
df_new = pd.DataFrame()
df_new = df_new.append(df[df["name"].str.contains("Running")])
df_new = df_new.append(df[df["name"].str.contains("Sleeping")])
print(df_new.to_string())
问题 1:
我认为应该有更好的或 Pythonic 的方式来做到这一点。感谢您对此的考虑。这也检查了字符串的精确匹配,这并不理想,因为数据集可能有简单的“运行”和简单的“sleeping”:) 等。我尝试了 .lower()
函数,但没有用.
目的: 这样做的目的是为单个观察确定有多少个不同的 x 类别。这里 T1, T2, T3, ... T120 是观测值。我们需要确定每个观察值有多少共同值。即对于 T1,对于“运行”,有 3 个“x1”和 3 个“x5”以及 1 个 'x2'.
为此,我们首先想到了如上所述重新排列数据框。
然而,我们不确定这种重新排列是否需要达到目的。此外,输出看起来是多维的。那就是对于T1来说,对于“运行”需要存储多少个x1,x3,x5。同样,这需要应用到其他行为,如“吃”、“睡”等
问题 2: 实现这一目标的最佳方法是什么?任何适合此目的的数据结构?有没有更好的方法在不重新排列数据帧的情况下实现上述目的?
如果你想做测试,这里有一个示例 csv。
,T1,T2,T3,T4
cheetah Running,x1,x2,x1,x3
Running Jaguar,x1,x10,x3,x7
Cougar Running,x1,x2,x1,x3
Bengal Tiger Running,x5,x2,x4,x17
Sleeping Bali Tiger,x55,x61,x11,x31
Javan Leopard Sleeping,x42,x67,x17,x34
Leopard Running,x2,x5,x2,x3
Bengal Tiger Running,x5,x2,x4,x17
Florida Panther Eating,x71,x80,x101,x94
Running Eastern Cougar,x5,x1,x2,x3
Congo Lion Sleeping,x57,x61,x14,x38
IIUC,您可以使用字典映射将正确的类别设置为行:
# Your list of patterns
MAPPING = {'S': ['sleep', 'sleeping'],
'R': ['run', 'running'],
'E': ['eat', 'eating']}
# Reverse the mapping (swap keys and values)
rev = {v: k for k, l in MAPPING.items() for v in l}
# Create the regex pattern
pat = fr"\b({'|'.join(rev)})\b"
# Extract from data
df['CAT'] = df['Unnamed: 0'].str.lower().str.extract(pat, expand=False).map(rev)
输出:
>>> df
Unnamed: 0 T1 T2 T3 T4 CAT
0 cheetah Running x1 x2 x1 x3 R
1 Running Jaguar x1 x10 x3 x7 R
2 Cougar Running x1 x2 x1 x3 R
3 Bengal Tiger Running x5 x2 x4 x17 R
4 Sleeping Bali Tiger x55 x61 x11 x31 S
5 Javan Leopard Sleeping x42 x67 x17 x34 S
6 Leopard Running x2 x5 x2 x3 R
7 Bengal Tiger Running x5 x2 x4 x17 R
8 Florida Panther Eating x71 x80 x101 x94 E
9 Running Eastern Cougar x5 x1 x2 x3 R
10 Congo Lion Sleeping x57 x61 x14 x38 S