Pandas

Question

我有一个来源的 DataFrame，其中的名称背靠背重复，没有分隔符。

示例：

In [1] 
data = {"Names": ["JakeJake", "ThomasThomas", "HarryHarry"],
       "Scores": [70, 81, 23]}
df = pd.DataFrame(data)

Out [1]

    Names       Scores
0   JakeJake        70
1   ThomasThomas    81
2   HarryHarry      23

我想要一种只保留 'Names' 列的前半部分的方法。我最初的想法是执行以下操作：

In [2]
df["N"] = df["Names"].str.len()//2
df["X"] = df["Names"].str[:df["N"]]

但是这给出了输出

Out [2]

Names             Scores N    X
0   JakeJake         70  4  nan
1   ThomasThomas     81  6  nan
2   HarryHarry       23  5  nan

所需的输出将是

Out [2]

Names            Scores N        X
0   JakeJake        70  4   Jake
1   ThomasThomas    81  6   Thomas
2   HarryHarry      23  5   Harry

我相信答案会很简单，但我无法理解。干杯

Answer 1

您可以在 Names 列上使用 apply，然后只取所需字符串的一部分。

>>> df.assign(x=df['Names'].apply(lambda x: x[:len(x)//2]))

          Names  Scores       x
0      JakeJake      70    Jake
1  ThomasThomas      81  Thomas
2    HarryHarry      23   Harry

Answer 2

使用 regex 提取名称和 str.len 提取长度：

df["X"] = df.Names.str.extract(r"^(.+)$")
df["N"] = df.X.str.len()

其中正则表达式查找任何重复 2 次的完全匹配项（</code> 指的是正则表达式中的第一个捕获组）。</p> <pre><code>>>> df Names Scores X N 0 JakeJake 70 Jake 4 1 ThomasThomas 81 Thomas 6 2 HarryHarry 23 Harry 5

Answer 3

使用正则表达式来拆分驼峰式大小写，我们可以设置规则来拆分紧跟小写字母的任何大写字母

 n = df['Names'].str.split('(?<=[a-z])(?=[A-Z])',expand=True)[0]
 df['N'], df['X'] = n, n.str.len()

print(df)

          Names  Scores       N  X
0      JakeJake      70    Jake  4
1  ThomasThomas      81  Thomas  6
2    HarryHarry      23   Harry  5

Answer 4

您可以在 Names 列上使用 .map()，如下所示：

df['X'] = df['Names'].map(lambda x: x[:len(x)//2])

结果：

print(df)

          Names  Scores       X
0      JakeJake      70    Jake
1  ThomasThomas      81  Thomas
2    HarryHarry      23   Harry

Pandas - 保留第 N 列中定义的前 n 个字符

Pandas - Keep the first n characters where n is defined in the column N

python

split

delimiter

dataframe