按第一个词 (python) 输出的组 nltk.FreqDist
Group nltk.FreqDist output by first word (python)
我是一名业余爱好者,具有 python 的基本编码技能,我正在处理一个包含如下列的数据框。目的是将 nltk.FreqDist 的输出按第一个词
分组
我目前有什么
t_words = df_tech['message']
data_analysis = nltk.FreqDist(t_words)
# Let's take the specific words only if their frequency is greater than 3.
filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3])
for key in sorted(filter_words):
print("%s: %s" % (key, filter_words[key]))
sample current output:
click full refund showing currently viewed rr number: 1
click go: 1
click post refund: 1
click refresh like replace tokens sending: 1
click refund: 1
click refund order: 1
click resend email confirmation: 1
click responsible party: 1
click send right: 1
click tick mark right: 1
我的输出中有 10000 多行。
我的预期输出
我想按第一个词对输出进行分组并将其提取为数据帧
我在其他解决方案中尝试过的
我已尝试调整给出的解决方案 and here,但没有令人满意的结果。
任何help/guidance赞赏。
尝试以下操作(文档在代码中):
# I assume the input, t_words is a list of strings (Each containing multiple words)
t_words = ...
# This creates a counter from a string to it's occurrences
input_frequencies = nltk.FreqDist(t_words)
# Taking inputs only if they appear 3 or more times.
# This is similar to your code, but looks at the frequency. Your previous code
# did len(m) where m was the message. If you want to filter by the string length,
# you can restore it to len(input_str) > 3
frequent_inputs = {
input_str: count
for input_str, count in input_frequencies.items()
if count > 3
}
# We will apply this function on each string to get the first word (to be
# used as the key for the grouping)
def first_word(value):
# You can replace this by a better implementation from nltk
return value.split(' ')[0]
# Now we will use itertools.groupby for the grouping, as documented in
# https://docs.python.org/3/library/itertools.html#itertools.groupby
first_word_to_inputs = itertools.groupby(
# Take the strings from the above dictionary
frequent_inputs.keys(),
# And key by the first word
first_word)
# If you would also want to keep the count of each word, we can map from
# first word to a list of (string, count) pairs:
first_word_to_inpus_and_counts = itertools.groupby(
# Pairs of words and count
frequent_inputs.items(),
# Extract the string from the pair, and then take the first word
lambda pair: first_word(pair[0])
)
我设法做到了,如下所示。可能有一个更容易的实现。但就目前而言,这给了我预期的效果。
temp = pd.DataFrame(sorted(data_analysis.items()), columns=['word', 'frequency'])
temp['word'] = temp['word'].apply(lambda x: x.strip())
#Removing emtpy rows
filter = temp["word"] != ""
dfNew = temp[filter]
#Splitting first word
dfNew['first_word'] = dfNew.word.str.split().str.get(0)
#New column with setences split without first word
dfNew['rest_words'] = dfNew['word'].str.split(n=1).str[1]
#Subsetting required columns
dfNew = dfNew[['first_word','rest_words']]
# Grouping by first word
dfNew= dfNew.groupby('first_word').agg(lambda x: x.tolist()).reset_index()
#Transpose
dfNew.T
示例输出
我是一名业余爱好者,具有 python 的基本编码技能,我正在处理一个包含如下列的数据框。目的是将 nltk.FreqDist 的输出按第一个词
分组我目前有什么
t_words = df_tech['message']
data_analysis = nltk.FreqDist(t_words)
# Let's take the specific words only if their frequency is greater than 3.
filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3])
for key in sorted(filter_words):
print("%s: %s" % (key, filter_words[key]))
sample current output:
click full refund showing currently viewed rr number: 1
click go: 1
click post refund: 1
click refresh like replace tokens sending: 1
click refund: 1
click refund order: 1
click resend email confirmation: 1
click responsible party: 1
click send right: 1
click tick mark right: 1
我的输出中有 10000 多行。
我的预期输出
我想按第一个词对输出进行分组并将其提取为数据帧
我在其他解决方案中尝试过的
我已尝试调整给出的解决方案
任何help/guidance赞赏。
尝试以下操作(文档在代码中):
# I assume the input, t_words is a list of strings (Each containing multiple words)
t_words = ...
# This creates a counter from a string to it's occurrences
input_frequencies = nltk.FreqDist(t_words)
# Taking inputs only if they appear 3 or more times.
# This is similar to your code, but looks at the frequency. Your previous code
# did len(m) where m was the message. If you want to filter by the string length,
# you can restore it to len(input_str) > 3
frequent_inputs = {
input_str: count
for input_str, count in input_frequencies.items()
if count > 3
}
# We will apply this function on each string to get the first word (to be
# used as the key for the grouping)
def first_word(value):
# You can replace this by a better implementation from nltk
return value.split(' ')[0]
# Now we will use itertools.groupby for the grouping, as documented in
# https://docs.python.org/3/library/itertools.html#itertools.groupby
first_word_to_inputs = itertools.groupby(
# Take the strings from the above dictionary
frequent_inputs.keys(),
# And key by the first word
first_word)
# If you would also want to keep the count of each word, we can map from
# first word to a list of (string, count) pairs:
first_word_to_inpus_and_counts = itertools.groupby(
# Pairs of words and count
frequent_inputs.items(),
# Extract the string from the pair, and then take the first word
lambda pair: first_word(pair[0])
)
我设法做到了,如下所示。可能有一个更容易的实现。但就目前而言,这给了我预期的效果。
temp = pd.DataFrame(sorted(data_analysis.items()), columns=['word', 'frequency'])
temp['word'] = temp['word'].apply(lambda x: x.strip())
#Removing emtpy rows
filter = temp["word"] != ""
dfNew = temp[filter]
#Splitting first word
dfNew['first_word'] = dfNew.word.str.split().str.get(0)
#New column with setences split without first word
dfNew['rest_words'] = dfNew['word'].str.split(n=1).str[1]
#Subsetting required columns
dfNew = dfNew[['first_word','rest_words']]
# Grouping by first word
dfNew= dfNew.groupby('first_word').agg(lambda x: x.tolist()).reset_index()
#Transpose
dfNew.T
示例输出