按第一个词 (python) 输出的组 nltk.FreqDist

Group nltk.FreqDist output by first word (python)

我是一名业余爱好者,具有 python 的基本编码技能,我正在处理一个包含如下列的数据框。目的是将 nltk.FreqDist 的输出按第一个词

分组

我目前有什么

t_words = df_tech['message']
data_analysis = nltk.FreqDist(t_words)

# Let's take the specific words only if their frequency is greater than 3.
filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3])

for key in sorted(filter_words):
    print("%s: %s" % (key, filter_words[key]))

sample current output:
click full refund showing currently viewed rr number: 1
click go: 1
click post refund: 1
click refresh like  replace tokens sending: 1
click refund: 1
click refund order: 1
click resend email confirmation: 1
click responsible party: 1
click send right: 1
click tick mark right: 1

我的输出中有 10000 多行。

我的预期输出

我想按第一个词对输出进行分组并将其提取为数据帧

我在其他解决方案中尝试过的

我已尝试调整给出的解决方案 and here,但没有令人满意的结果。

任何help/guidance赞赏。

尝试以下操作(文档在代码中):

# I assume the input, t_words is a list of strings (Each containing multiple words)
t_words = ...

# This creates a counter from a string to it's occurrences
input_frequencies = nltk.FreqDist(t_words)

# Taking inputs only if they appear 3 or more times.
# This is similar to your code, but looks at the frequency. Your previous code
# did len(m) where m was the message. If you want to filter by the string length,
# you can restore it to len(input_str) > 3
frequent_inputs = {
    input_str: count
    for input_str, count in input_frequencies.items()
    if count > 3
}

# We will apply this function on each string to get the first word (to be
# used as the key for the grouping)
def first_word(value):
    # You can replace this by a better implementation from nltk
    return value.split(' ')[0]

# Now we will use itertools.groupby for the grouping, as documented in
# https://docs.python.org/3/library/itertools.html#itertools.groupby
first_word_to_inputs = itertools.groupby(
    # Take the strings from the above dictionary
    frequent_inputs.keys(),
    # And key by the first word
    first_word)

# If you would also want to keep the count of each word, we can map from
# first word to a list of (string, count) pairs:
first_word_to_inpus_and_counts = itertools.groupby(
    # Pairs of words and count
    frequent_inputs.items(),
    # Extract the string from the pair, and then take the first word
    lambda pair: first_word(pair[0])
)

我设法做到了,如下所示。可能有一个更容易的实现。但就目前而言,这给了我预期的效果。

temp = pd.DataFrame(sorted(data_analysis.items()), columns=['word', 'frequency'])
temp['word'] = temp['word'].apply(lambda x: x.strip())

#Removing emtpy rows
filter = temp["word"] != ""
dfNew = temp[filter]

#Splitting first word
dfNew['first_word'] = dfNew.word.str.split().str.get(0)
#New column with setences split without first word
dfNew['rest_words'] = dfNew['word'].str.split(n=1).str[1]
#Subsetting required columns
dfNew = dfNew[['first_word','rest_words']]
# Grouping by first word
dfNew= dfNew.groupby('first_word').agg(lambda x: x.tolist()).reset_index()
#Transpose
dfNew.T

示例输出