动态拆分 DataFrame 的列并将其存储为新列

Question

我正在尝试拆分一列并将最后一个“_”之后的部分存储为新列。

import pandas as pd
import numpy as np
names= ['John', 'Jane', 'Brian','Suzan', 'John']
expertise = ['primary_chat', 'follow_email', 'repeat_chat', 'primary_video_chat', 'tech_chat']

data  = list(zip(names,expertise))
df = pd.DataFrame(data, columns=['Name', 'Communication'])
df

输出

    Name       Communication
0   John        primary_chat
1   Jane        follow_email
2  Brian         repeat_chat
3  Suzan  primary_video_chat
4   John           tech_chat

当我通过拆分列添加新列时：

df['Platform'] = df['Communication'].str.split('_', expand=True)[1]
df

输出

    Name       Communication Platform
0   John        primary_chat     chat
1   Jane        follow_email    email
2  Brian         repeat_chat     chat
3  Suzan  primary_video_chat    video
4   John           tech_chat     chat

但问题是，[1] 占用了拆分的第二部分。当我们只有一个“_”时这不是问题，第二部分就是我们需要的。但是当你有 2 个“_”时，比如第 3 个（Suzan），[1] 给你的是短语“video”而不是“email”，我们应该在那里有 [2] 索引。

我们可以动态获取“_”的数量并使用这个值，但是，下面的代码即使它输出正确的值，当我在 [] 中使用它作为索引值时我得到一个错误。

df['Communication'].str.count('_')

0    1
1    1
2    1
3    2
4    1
Name: Communication, dtype: int64

给我正确的“_”数。但是当我尝试在我使用 split() 并创建新列的前一行代码中使用它时，我得到一个错误

df['Platform'] = df['Communication'].str.split('_', expand=True)[df['Agent Expertise'].str.count('_')]

但是我收到错误..

也许我应该尝试使用 apply() 和 lambda，但我想知道是否有办法解决这个问题..

Answer 1

您可以使用正则表达式来查找字符串末尾 _ 以外的所有字符（由 $ 表示）：

df['Platform'] = df['Communication'].str.extract('([^_]+)$')

Answer 2

您可以使用 str.rsplit 并将分割数限制为 1:

df['Platform'] = df['Communication'].str.rsplit('_', n=1).str[1]
print(df)

# Output
    Name       Communication Platform
0   John        primary_chat     chat
1   Jane        follow_email    email
2  Brian         repeat_chat     chat
3  Suzan  primary_video_chat     chat
4   John           tech_chat     chat

动态拆分 DataFrame 的列并将其存储为新列

Dynamically Splitting Column of a DataFrame and Store it as a New Column

python

split

dataframe

pandas