如何在保留其他列的同时分解字符串列表?

how to explode a list of strings while keeping the other columns?

考虑这个简单的例子

import pandas as pd

df = pd.DataFrame({'col1' : [1,2,3],
                   'col2' : ['A','B','C'],
                   'paragraph': ['sentence one. sentence two',
                                 'sentence three. and sentence four',
                                 'crazy sentence!! and the final one.']})

df
Out[11]: 
   col1 col2                            paragraph
0     1    A           sentence one. sentence two
1     2    B    sentence three. and sentence four
2     3    C  crazy sentence!! and the final one.

我想将段落拆分成句子(最好使用 spacy),但我需要将信息保留在其他栏中。

我知道如何在 .

上分解列并(天真地)拆分
df.paragraph.str.split('.').explode()
Out[10]: 
0                          sentence one
0                          sentence two
1                        sentence three
1                     and sentence four
2    crazy sentence!! and the final one
2                                      
Name: paragraph, dtype: object

但这会丢失 col1col2 中的信息(这些信息应该在逐句数据框中保留和重复)并且不会正确地拆分带有感叹号的句子。

使用 Spacy 和 nlp(paragraph).sents 仍然会丢失两列。

我能做什么? 谢谢!

分两步完成:

df['paragraph'] = df['paragraph'].str.split('.')
df.explode('paragraph')

输出:

   col1 col2                           paragraph
0     1    A                        sentence one
0     1    A                        sentence two
1     2    B                      sentence three
1     2    B                   and sentence four
2     3    C  crazy sentence!! and the final one
2     3    C                                    

./! 上拆分:

df['paragraph'] = df['paragraph'].str.split('[.!]+')
df.explode('paragraph')
   col1 col2           paragraph
0     1    A        sentence one
0     1    A        sentence two
1     2    B      sentence three
1     2    B   and sentence four
2     3    C      crazy sentence
2     3    C   and the final one
2     3    C                    

将您的扩展定义为一个单独的临时数据框,并用它来加入主数据框。

import pandas as pd
df = pd.DataFrame({'col1' : [1,2,3],
               'col2' : ['A','B','C'],
               'paragraph': ['sentence one. sentence two',
                             'sentence three. and sentence four',
                             'crazy sentence!! and the final one.']})
alfa = pd.DataFrame(df.paragraph.str.split('.').explode())
alfa.rename(columns={"paragraph":"paragraphSplit"},inplace=True)
alfa.join(df)
paragraphSplit col1 col2 paragraph
sentence one 1 A sentence one. sentence two
sentence two 1 A sentence one. sentence two
sentence three 2 B sentence three. and sentence four
and sentence four 2 B sentence three. and sentence four
crazy sentence!! and the final one 3 C crazy sentence!! and the final one.
3 C crazy sentence!! and the final one.

希望这就是您要找的。

如果您更喜欢 SpaCy 将文本拆分成句子,请使用

import spacy
from spacy.lang.en import English
nlp = English()
nlp.add_pipe('sentencizer')

def split_in_sentences(text):
    return [sent.text.strip() for sent in nlp(text).sents]

df['paragraph'] = df['paragraph'].apply(split_in_sentences)
>>> df.explode('paragraph')
   col1 col2           paragraph
0     1    A       sentence one.
0     1    A        sentence two
1     2    B     sentence three.
1     2    B   and sentence four
2     3    C    crazy sentence!!
2     3    C  and the final one.

参见 SO post。