如何在保留其他列的同时分解字符串列表？

Question

考虑这个简单的例子

import pandas as pd

df = pd.DataFrame({'col1' : [1,2,3],
                   'col2' : ['A','B','C'],
                   'paragraph': ['sentence one. sentence two',
                                 'sentence three. and sentence four',
                                 'crazy sentence!! and the final one.']})

df
Out[11]: 
   col1 col2                            paragraph
0     1    A           sentence one. sentence two
1     2    B    sentence three. and sentence four
2     3    C  crazy sentence!! and the final one.

我想将段落拆分成句子（最好使用 spacy），但我需要将信息保留在其他栏中。

我知道如何在 .

上分解列并（天真地）拆分

df.paragraph.str.split('.').explode()
Out[10]: 
0                          sentence one
0                          sentence two
1                        sentence three
1                     and sentence four
2    crazy sentence!! and the final one
2                                      
Name: paragraph, dtype: object

但这会丢失 col1 和 col2 中的信息（这些信息应该在逐句数据框中保留和重复）并且不会正确地拆分带有感叹号的句子。

使用 Spacy 和 nlp(paragraph).sents 仍然会丢失两列。

我能做什么？谢谢！

Answer 1

分两步完成：

df['paragraph'] = df['paragraph'].str.split('.')
df.explode('paragraph')

输出：

   col1 col2                           paragraph
0     1    A                        sentence one
0     1    A                        sentence two
1     2    B                      sentence three
1     2    B                   and sentence four
2     3    C  crazy sentence!! and the final one
2     3    C

在 ./! 上拆分：

df['paragraph'] = df['paragraph'].str.split('[.!]+')
df.explode('paragraph')

   col1 col2           paragraph
0     1    A        sentence one
0     1    A        sentence two
1     2    B      sentence three
1     2    B   and sentence four
2     3    C      crazy sentence
2     3    C   and the final one
2     3    C

Answer 2

将您的扩展定义为一个单独的临时数据框，并用它来加入主数据框。

import pandas as pd
df = pd.DataFrame({'col1' : [1,2,3],
               'col2' : ['A','B','C'],
               'paragraph': ['sentence one. sentence two',
                             'sentence three. and sentence four',
                             'crazy sentence!! and the final one.']})
alfa = pd.DataFrame(df.paragraph.str.split('.').explode())
alfa.rename(columns={"paragraph":"paragraphSplit"},inplace=True)
alfa.join(df)

paragraphSplit	col1	col2	paragraph
sentence one	1	A	sentence one. sentence two
sentence two	1	A	sentence one. sentence two
sentence three	2	B	sentence three. and sentence four
and sentence four	2	B	sentence three. and sentence four
crazy sentence!! and the final one	3	C	crazy sentence!! and the final one.
	3	C	crazy sentence!! and the final one.

希望这就是您要找的。

Answer 3

如果您更喜欢 SpaCy 将文本拆分成句子，请使用

import spacy
from spacy.lang.en import English
nlp = English()
nlp.add_pipe('sentencizer')

def split_in_sentences(text):
    return [sent.text.strip() for sent in nlp(text).sents]

df['paragraph'] = df['paragraph'].apply(split_in_sentences)
>>> df.explode('paragraph')
   col1 col2           paragraph
0     1    A       sentence one.
0     1    A        sentence two
1     2    B     sentence three.
1     2    B   and sentence four
2     3    C    crazy sentence!!
2     3    C  and the final one.

参见 SO post。

如何在保留其他列的同时分解字符串列表？

how to explode a list of strings while keeping the other columns?

python

pandas

spacy