通过 space 将 datraframe 中的单词拆分为行,同时复制其他列中的信息(python、pandas)
Split words from datraframe by space to rows while duplicating the info from other columns ( python,pandas)
我有一个 df,它由 5 列组成,其中一列是人们发表评论的评论栏。
我想做的是将该评论列 space 拆分为多行,同时复制其他列:
df:
r_id
start
comments
1
2021-01-01
i am the text that needs splitting by space to rows
2
2021-01-02
hello hello
想要的结果:
r_id
start
comments
1
2021-01-01
i
1
2021-01-01
am
1
2021-01-01
the
1
2021-01-01
text
2
2021-01-02
hello
2
2021-01-02
hello
我已经尝试了从 str.split()
到正则表达式的任何方法,但没有结果。
-- 代码为:
df = df.apply(lambda x: x.str.lower() if x.dtype == "object" else x)
(df
.assign(comments=df['comments'].str.split())
.explode('comments')
)
print(df)
df['comments'] = df['comments'].str.replace('ă','a')
df['comments'] = df['comments'].str.replace('â','a')
df['comments'] = df['comments'].str.replace('î','i')
df['comments'] = df['comments'].str.replace('ș','s')
df['comments'] = df['comments'].str.replace('ț','t')
df.replace('[^a-zA-Z0-9]', ' ',regex=True)
df.dropna(inplace=True)
print(df)
但它不会拆分评论
df2 = (df
.assign(comments=df['comments'].str.split())
.explode('comments')
)
输出:
r_id start comments
0 1 2021-01-01 i
0 1 2021-01-01 am
0 1 2021-01-01 the
0 1 2021-01-01 text
0 1 2021-01-01 that
0 1 2021-01-01 needs
0 1 2021-01-01 splitting
0 1 2021-01-01 by
0 1 2021-01-01 space
0 1 2021-01-01 to
0 1 2021-01-01 rows
1 2 2021-01-02 hello
1 2 2021-01-02 hello
我想做的是将该评论列 space 拆分为多行,同时复制其他列:
df:
r_id | start | comments |
---|---|---|
1 | 2021-01-01 | i am the text that needs splitting by space to rows |
2 | 2021-01-02 | hello hello |
想要的结果:
r_id | start | comments |
---|---|---|
1 | 2021-01-01 | i |
1 | 2021-01-01 | am |
1 | 2021-01-01 | the |
1 | 2021-01-01 | text |
2 | 2021-01-02 | hello |
2 | 2021-01-02 | hello |
我已经尝试了从 str.split()
到正则表达式的任何方法,但没有结果。
-- 代码为:
df = df.apply(lambda x: x.str.lower() if x.dtype == "object" else x)
(df
.assign(comments=df['comments'].str.split())
.explode('comments')
)
print(df)
df['comments'] = df['comments'].str.replace('ă','a')
df['comments'] = df['comments'].str.replace('â','a')
df['comments'] = df['comments'].str.replace('î','i')
df['comments'] = df['comments'].str.replace('ș','s')
df['comments'] = df['comments'].str.replace('ț','t')
df.replace('[^a-zA-Z0-9]', ' ',regex=True)
df.dropna(inplace=True)
print(df)
但它不会拆分评论
df2 = (df
.assign(comments=df['comments'].str.split())
.explode('comments')
)
输出:
r_id start comments
0 1 2021-01-01 i
0 1 2021-01-01 am
0 1 2021-01-01 the
0 1 2021-01-01 text
0 1 2021-01-01 that
0 1 2021-01-01 needs
0 1 2021-01-01 splitting
0 1 2021-01-01 by
0 1 2021-01-01 space
0 1 2021-01-01 to
0 1 2021-01-01 rows
1 2 2021-01-02 hello
1 2 2021-01-02 hello