Pandas 将随机字符串分配给每个组作为新列
Pandas assigning random string to each group as new column
我们有一个像
这样的数据框
Out[90]:
customer_id created_at
0 11492288 2017-03-15 10:20:18.280437
1 8953727 2017-03-16 12:51:00.145629
2 11492288 2017-03-15 10:20:18.284974
3 11473213 2017-03-09 14:15:22.712369
4 9526296 2017-03-14 18:56:04.665410
5 9526296 2017-03-14 18:56:04.662082
我想在这里创建新列,基于 customer_id 组,每组分配 8 个字符的随机字符串。
例如,输出看起来像
Out[90]:
customer_id created_at code
0 11492288 2017-03-15 10:20:18.280437 nKAILfyV
1 8953727 2017-03-16 12:51:00.145629 785Vsw0b
2 11492288 2017-03-15 10:20:18.284974 nKAILfyV
3 11473213 2017-03-09 14:15:22.712369 dk6JXq3u
4 9526296 2017-03-14 18:56:04.665410 1WESdAsD
5 9526296 2017-03-14 18:56:04.662082 1WESdAsD
我习惯了 R 和 dplyr,用它们写这个转换非常容易。我正在 Pandas 中寻找与此类似的内容:
library(dplyr)
library(stringi)
df %>%
group_by(customer_id) %>%
mutate(code = stri_rand_strings(1, 8))
我可以算出随机字符部分。只是好奇 Pandas groupby 在这种情况下是如何工作的。
谢谢!
import random
from string import ascii_letters, digits
chars = list(ascii_letters + digits)
choose = lambda x, k=8: ''.join(random.choices(chars, k=k))
df.assign(code=df.groupby('customer_id').transform(choose))
customer_id created_at code
0 11492288 2017-03-15 10:20:18.280437 S5HtmbeN
1 8953727 2017-03-16 12:51:00.145629 MMfFFn8U
2 11492288 2017-03-15 10:20:18.284974 S5HtmbeN
3 11473213 2017-03-09 14:15:22.712369 4VsKmDZ5
4 9526296 2017-03-14 18:56:04.665410 VhQfu2Rf
5 9526296 2017-03-14 18:56:04.662082 VhQfu2Rf
灵感来自@Wen对pd.util.testing.rands_array
的使用
f, u = pd.factorize(df.customer_id.values)
df.assign(code=pd.util.testing.rands_array(8, u.size)[f])
customer_id created_at code
0 11492288 2017-03-15 10:20:18.280437 tSuQbTBm
1 8953727 2017-03-16 12:51:00.145629 qmCl6NEX
2 11492288 2017-03-15 10:20:18.284974 tSuQbTBm
3 11473213 2017-03-09 14:15:22.712369 Wsa3lNxh
4 9526296 2017-03-14 18:56:04.665410 jBfXS2Nk
5 9526296 2017-03-14 18:56:04.662082 jBfXS2Nk
在pandas(R的mutate
)是transform
df['code']=df.groupby('customer_id').transform(lambda x:pd.util.testing.rands_array(8,1))
df
Out[314]:
customer_id created_at code
0 11492288 2017-03-15 L6Odf65d
1 8953727 2017-03-16 fwLpgLnt
2 11492288 2017-03-15 L6Odf65d
3 11473213 2017-03-09 AuSUPnJ9
4 9526296 2017-03-14 U1AiLyx0
5 9526296 2017-03-14 U1AiLyx0
编辑(来自 cᴏʟᴅsᴘᴇᴇᴅ):df.groupby('customer_id').customer_id.transform(lambda x:pd.util.testing.rands_array(8,1))
你的 R 代码也有一些改进,
Match=data.frame(A=unique(df$customer_id),B=replicate(length(unique(df$year)), stri_rand_strings(1, 8)))
df$Code=Match$B[match(df$customer_id,Match$A)]
我们有一个像
这样的数据框Out[90]:
customer_id created_at
0 11492288 2017-03-15 10:20:18.280437
1 8953727 2017-03-16 12:51:00.145629
2 11492288 2017-03-15 10:20:18.284974
3 11473213 2017-03-09 14:15:22.712369
4 9526296 2017-03-14 18:56:04.665410
5 9526296 2017-03-14 18:56:04.662082
我想在这里创建新列,基于 customer_id 组,每组分配 8 个字符的随机字符串。
例如,输出看起来像
Out[90]:
customer_id created_at code
0 11492288 2017-03-15 10:20:18.280437 nKAILfyV
1 8953727 2017-03-16 12:51:00.145629 785Vsw0b
2 11492288 2017-03-15 10:20:18.284974 nKAILfyV
3 11473213 2017-03-09 14:15:22.712369 dk6JXq3u
4 9526296 2017-03-14 18:56:04.665410 1WESdAsD
5 9526296 2017-03-14 18:56:04.662082 1WESdAsD
我习惯了 R 和 dplyr,用它们写这个转换非常容易。我正在 Pandas 中寻找与此类似的内容:
library(dplyr)
library(stringi)
df %>%
group_by(customer_id) %>%
mutate(code = stri_rand_strings(1, 8))
我可以算出随机字符部分。只是好奇 Pandas groupby 在这种情况下是如何工作的。
谢谢!
import random
from string import ascii_letters, digits
chars = list(ascii_letters + digits)
choose = lambda x, k=8: ''.join(random.choices(chars, k=k))
df.assign(code=df.groupby('customer_id').transform(choose))
customer_id created_at code
0 11492288 2017-03-15 10:20:18.280437 S5HtmbeN
1 8953727 2017-03-16 12:51:00.145629 MMfFFn8U
2 11492288 2017-03-15 10:20:18.284974 S5HtmbeN
3 11473213 2017-03-09 14:15:22.712369 4VsKmDZ5
4 9526296 2017-03-14 18:56:04.665410 VhQfu2Rf
5 9526296 2017-03-14 18:56:04.662082 VhQfu2Rf
灵感来自@Wen对pd.util.testing.rands_array
f, u = pd.factorize(df.customer_id.values)
df.assign(code=pd.util.testing.rands_array(8, u.size)[f])
customer_id created_at code
0 11492288 2017-03-15 10:20:18.280437 tSuQbTBm
1 8953727 2017-03-16 12:51:00.145629 qmCl6NEX
2 11492288 2017-03-15 10:20:18.284974 tSuQbTBm
3 11473213 2017-03-09 14:15:22.712369 Wsa3lNxh
4 9526296 2017-03-14 18:56:04.665410 jBfXS2Nk
5 9526296 2017-03-14 18:56:04.662082 jBfXS2Nk
在pandas(R的mutate
)是transform
df['code']=df.groupby('customer_id').transform(lambda x:pd.util.testing.rands_array(8,1))
df
Out[314]:
customer_id created_at code
0 11492288 2017-03-15 L6Odf65d
1 8953727 2017-03-16 fwLpgLnt
2 11492288 2017-03-15 L6Odf65d
3 11473213 2017-03-09 AuSUPnJ9
4 9526296 2017-03-14 U1AiLyx0
5 9526296 2017-03-14 U1AiLyx0
编辑(来自 cᴏʟᴅsᴘᴇᴇᴅ):df.groupby('customer_id').customer_id.transform(lambda x:pd.util.testing.rands_array(8,1))
你的 R 代码也有一些改进,
Match=data.frame(A=unique(df$customer_id),B=replicate(length(unique(df$year)), stri_rand_strings(1, 8)))
df$Code=Match$B[match(df$customer_id,Match$A)]