将段落分成 python 和 link 中的句子回到一个 ID

break paragraph into sentences in python and link back to an ID

我有两个列表,一个有 ID,另一个有每个 ID 的相应评论。

list_responseid = ['id1', 'id2', 'id3', 'id4'] 

list_paragraph = [['I like working and helping them reach their goals.'],
 ['The communication is broken.',
  'Information that should have come to me is found out later.'],
 ['Try to promote from within.'],
 ['I would relax the required hours to be available outside.',
  'We work a late night each week.']]

ResponseID 'id1' 与段落('I like working and helping them reach their goals.')等相关。

我可以使用以下函数将段落分成句子:

list_sentence = list(itertools.chain(*list_paragraph))

获得最终结果的语法是什么,即数据框(或列表)具有单独的句子条目并具有与该句子关联的 ID(现在链接到段落)。最终结果将如下所示(我将在最后将列表转换为熊猫数据框)。

id1 'I like working with students and helping them reach their goals.'
id2 'The communication from top to bottom is broken.'
id2 'Information that should have come to me is found out later and in some cases students know more about what is going on than we do!'
id3 'Try to promote from within.'
id4 'I would relax the required 10 hours to be available outside of 8 to 5 back to 9 to 5 like it used to be.'
id4 'We work a late night each week and rarely do students take advantage of those extended hours.'

谢谢。

如果你经常这样做,它会更清晰,并且可能更有效,具体取决于数组的大小,如果你为它创建一个具有两个常规嵌套循环的专用函数,但如果你需要一个快速的一个衬里它(它就是这样做的):

id_sentence_tuples = [(list_responseid[id_list_idx], sentence) for id_list_idx in range(len(list_responseid)) for sentence in list_paragraph[id_list_idx]]

id_sentence_tuples 将是一个元组列表,其中每个元素都是一对,如 (paragraph_id, sentence) 正如您期望的结果。 另外,我建议您在执行此操作之前检查两个列表的长度是否相同,以防万一它们没有出现有意义的错误。

if len(list_responseid) != len(list_paragraph):
    IndexError('Lists must have same cardinality')

我有一个带有 ID 和评论的数据框 (col = ['ID','Review'])。如果您可以组合这些列表来制作数据框,那么您可以使用我的方法。我使用 nltk 将这些评论分成句子,然后在循环中链接回 ID。以下是您可以使用的代码。

## Breaking feedback into sentences
import nltk
count = 0
df_sentences = pd.DataFrame()
for index, row in df.iterrows():
    feedback = row['Reviews']
    sent_text = nltk.sent_tokenize(feedback) # this gives us a list of sentences
    for j in range(0,len(sent_text)):
        # print(index, "-", sent_text[j])
        df_sentences = df_sentences.append({'ID':row['ID'],'Count':int(count),'Sentence':sent_text[j]}, ignore_index=True)
        count = count + 1
print(df_sentences)