有没有一种简单的方法可以将 Pandas DataFrame 上的大字符串拆分为相等数量的单词?
Is there an easy way to split a large string on a Pandas DataFrame into equal number of words?
我有一个数据集,由 1000 行组成,其中包含给定的作者和属于该作者的大量文本。我最终想要实现的是将文本行分解为包含相同数量单词的多行,如:
Author - - - - - - - - text
Jack - - - - - - -- - -"This is a sentence that contains eight words"
John - - - - - - - - -"This is also a sentence containing eight words"
所以如果我想为 4 个单词的块做这件事,那就是:
Author- - - - - - text
Jack- - - - - - - "This is a sentence"
Jack- - - - - - -"that contains eight words"
John- - - - - - - "This is also a"
John- - - - - - - "sentence containing eight words"
我已经可以使用 textwrapper 按字符数来完成,但理想情况下我想按字数来完成。
任何可以导致这种情况的帮助将不胜感激,
谢谢!
假设您正在使用 pandas >= 0.25(支持 df.explode),您可以使用以下方法:
def split_by_equal_number_of_words(df, num_of_words, separator=" "):
"""
1. Split each text entry to a list separated by 'separator'
2. Explode to a row per word
3. group by number of the desired words, and aggregate by joining with the 'separator' provided
:param df:
:param num_of_words:
:param separator:
:return:
"""
df["text"] = df["text"].str.split(separator)
df = df.explode("text").reset_index(drop=True)
df = df.groupby([df.index // num_of_words, 'author'])['text'].agg(separator.join)
return df
我有一个数据集,由 1000 行组成,其中包含给定的作者和属于该作者的大量文本。我最终想要实现的是将文本行分解为包含相同数量单词的多行,如:
Author - - - - - - - - text
Jack - - - - - - -- - -"This is a sentence that contains eight words"
John - - - - - - - - -"This is also a sentence containing eight words"
所以如果我想为 4 个单词的块做这件事,那就是:
Author- - - - - - text
Jack- - - - - - - "This is a sentence"
Jack- - - - - - -"that contains eight words"
John- - - - - - - "This is also a"
John- - - - - - - "sentence containing eight words"
我已经可以使用 textwrapper 按字符数来完成,但理想情况下我想按字数来完成。 任何可以导致这种情况的帮助将不胜感激, 谢谢!
假设您正在使用 pandas >= 0.25(支持 df.explode),您可以使用以下方法:
def split_by_equal_number_of_words(df, num_of_words, separator=" "):
"""
1. Split each text entry to a list separated by 'separator'
2. Explode to a row per word
3. group by number of the desired words, and aggregate by joining with the 'separator' provided
:param df:
:param num_of_words:
:param separator:
:return:
"""
df["text"] = df["text"].str.split(separator)
df = df.explode("text").reset_index(drop=True)
df = df.groupby([df.index // num_of_words, 'author'])['text'].agg(separator.join)
return df