如何在保留其他列的同时分解字符串列表?
how to explode a list of strings while keeping the other columns?
考虑这个简单的例子
import pandas as pd
df = pd.DataFrame({'col1' : [1,2,3],
'col2' : ['A','B','C'],
'paragraph': ['sentence one. sentence two',
'sentence three. and sentence four',
'crazy sentence!! and the final one.']})
df
Out[11]:
col1 col2 paragraph
0 1 A sentence one. sentence two
1 2 B sentence three. and sentence four
2 3 C crazy sentence!! and the final one.
我想将段落拆分成句子(最好使用 spacy
),但我需要将信息保留在其他栏中。
我知道如何在 .
上分解列并(天真地)拆分
df.paragraph.str.split('.').explode()
Out[10]:
0 sentence one
0 sentence two
1 sentence three
1 and sentence four
2 crazy sentence!! and the final one
2
Name: paragraph, dtype: object
但这会丢失 col1
和 col2
中的信息(这些信息应该在逐句数据框中保留和重复)并且不会正确地拆分带有感叹号的句子。
使用 Spacy 和 nlp(paragraph).sents
仍然会丢失两列。
我能做什么?
谢谢!
分两步完成:
df['paragraph'] = df['paragraph'].str.split('.')
df.explode('paragraph')
输出:
col1 col2 paragraph
0 1 A sentence one
0 1 A sentence two
1 2 B sentence three
1 2 B and sentence four
2 3 C crazy sentence!! and the final one
2 3 C
在 .
/!
上拆分:
df['paragraph'] = df['paragraph'].str.split('[.!]+')
df.explode('paragraph')
col1 col2 paragraph
0 1 A sentence one
0 1 A sentence two
1 2 B sentence three
1 2 B and sentence four
2 3 C crazy sentence
2 3 C and the final one
2 3 C
将您的扩展定义为一个单独的临时数据框,并用它来加入主数据框。
import pandas as pd
df = pd.DataFrame({'col1' : [1,2,3],
'col2' : ['A','B','C'],
'paragraph': ['sentence one. sentence two',
'sentence three. and sentence four',
'crazy sentence!! and the final one.']})
alfa = pd.DataFrame(df.paragraph.str.split('.').explode())
alfa.rename(columns={"paragraph":"paragraphSplit"},inplace=True)
alfa.join(df)
paragraphSplit
col1
col2
paragraph
sentence one
1
A
sentence one. sentence two
sentence two
1
A
sentence one. sentence two
sentence three
2
B
sentence three. and sentence four
and sentence four
2
B
sentence three. and sentence four
crazy sentence!! and the final one
3
C
crazy sentence!! and the final one.
3
C
crazy sentence!! and the final one.
希望这就是您要找的。
如果您更喜欢 SpaCy 将文本拆分成句子,请使用
import spacy
from spacy.lang.en import English
nlp = English()
nlp.add_pipe('sentencizer')
def split_in_sentences(text):
return [sent.text.strip() for sent in nlp(text).sents]
df['paragraph'] = df['paragraph'].apply(split_in_sentences)
>>> df.explode('paragraph')
col1 col2 paragraph
0 1 A sentence one.
0 1 A sentence two
1 2 B sentence three.
1 2 B and sentence four
2 3 C crazy sentence!!
2 3 C and the final one.
参见 SO post。
考虑这个简单的例子
import pandas as pd
df = pd.DataFrame({'col1' : [1,2,3],
'col2' : ['A','B','C'],
'paragraph': ['sentence one. sentence two',
'sentence three. and sentence four',
'crazy sentence!! and the final one.']})
df
Out[11]:
col1 col2 paragraph
0 1 A sentence one. sentence two
1 2 B sentence three. and sentence four
2 3 C crazy sentence!! and the final one.
我想将段落拆分成句子(最好使用 spacy
),但我需要将信息保留在其他栏中。
我知道如何在 .
df.paragraph.str.split('.').explode()
Out[10]:
0 sentence one
0 sentence two
1 sentence three
1 and sentence four
2 crazy sentence!! and the final one
2
Name: paragraph, dtype: object
但这会丢失 col1
和 col2
中的信息(这些信息应该在逐句数据框中保留和重复)并且不会正确地拆分带有感叹号的句子。
使用 Spacy 和 nlp(paragraph).sents
仍然会丢失两列。
我能做什么? 谢谢!
分两步完成:
df['paragraph'] = df['paragraph'].str.split('.')
df.explode('paragraph')
输出:
col1 col2 paragraph
0 1 A sentence one
0 1 A sentence two
1 2 B sentence three
1 2 B and sentence four
2 3 C crazy sentence!! and the final one
2 3 C
在 .
/!
上拆分:
df['paragraph'] = df['paragraph'].str.split('[.!]+')
df.explode('paragraph')
col1 col2 paragraph
0 1 A sentence one
0 1 A sentence two
1 2 B sentence three
1 2 B and sentence four
2 3 C crazy sentence
2 3 C and the final one
2 3 C
将您的扩展定义为一个单独的临时数据框,并用它来加入主数据框。
import pandas as pd
df = pd.DataFrame({'col1' : [1,2,3],
'col2' : ['A','B','C'],
'paragraph': ['sentence one. sentence two',
'sentence three. and sentence four',
'crazy sentence!! and the final one.']})
alfa = pd.DataFrame(df.paragraph.str.split('.').explode())
alfa.rename(columns={"paragraph":"paragraphSplit"},inplace=True)
alfa.join(df)
paragraphSplit | col1 | col2 | paragraph |
---|---|---|---|
sentence one | 1 | A | sentence one. sentence two |
sentence two | 1 | A | sentence one. sentence two |
sentence three | 2 | B | sentence three. and sentence four |
and sentence four | 2 | B | sentence three. and sentence four |
crazy sentence!! and the final one | 3 | C | crazy sentence!! and the final one. |
3 | C | crazy sentence!! and the final one. |
希望这就是您要找的。
如果您更喜欢 SpaCy 将文本拆分成句子,请使用
import spacy
from spacy.lang.en import English
nlp = English()
nlp.add_pipe('sentencizer')
def split_in_sentences(text):
return [sent.text.strip() for sent in nlp(text).sents]
df['paragraph'] = df['paragraph'].apply(split_in_sentences)
>>> df.explode('paragraph')
col1 col2 paragraph
0 1 A sentence one.
0 1 A sentence two
1 2 B sentence three.
1 2 B and sentence four
2 3 C crazy sentence!!
2 3 C and the final one.
参见