如何将短语列表拆分为单词以便我可以对它们使用计数器?

How do I split a list of phrases into words so I can use counter on them?

我的数据是来自网络论坛的对话线程。我创建了一个函数来清理停用词、标点符号等数据。然后我创建了一个循环来清理我的 csv 文件中的所有帖子并将它们放入列表中。然后我做了字数统计。我的问题是列表包含 unicode 短语而不是单个单词。我怎样才能把短语分开,所以它们是我可以数的单个单词。下面是我的代码:

 def post_to_words(raw_post):
      HTML_text = BeautifulSoup(raw_post).get_text()
      letters_only = re.sub("[^a-zA-Z]", " ", HTML_text)
      words = letters_only.lower().split()
      stops = set(stopwords.words("english"))   
      meaningful_words = [w for w in words if not w in stops]
      return( " ".join(meaningful_words))

clean_Post_Text = post_to_words(fiance_forum["Post_Text"][0])
clean_Post_Text_split = clean_Post_Text.lower().split()
num_Post_Text = fiance_forum["Post_Text"].size
clean_posts_list = [] 

for i in range(0, num_Post_Text):
    clean_posts_list.append( post_to_words( fiance_forum["Post_Text"][i]))

from collections import Counter
     counts = Counter(clean_posts_list)
     print(counts)

我的输出是这样的:u'please follow instructions notice move receiver':1 我希望它看起来像这样:

请:1

关注:1

说明:1

等等....非常感谢!

您快完成了,您只需将字符串拆分为单词即可:

>>> from collections import Counter
>>> Counter('please follow instructions notice move receiver'.split())
Counter({'follow': 1,
         'instructions': 1,
         'move': 1,
         'notice': 1,
         'please': 1,
         'receiver': 1})

你已经有了一个单词列表,所以你不需要拆分任何东西,忘记调用 str.join" ".join(meaningful_words) 并创建一个 Counter dict 并在每次调用 post_to_words 时更新,您还需要做很多工作,您需要做的就是遍历 fiance_forum["Post_Text"] 将每个元素传递给功能。您还只需要创建一次停用词集,而不是每次迭代都创建:

from collections import Counter

def post_to_words(raw_pos, st):
    HTML_text = BeautifulSoup(raw_post).get_text()
    letters_only = re.sub("[^a-zA-Z]", " ", HTML_text)
    words = letters_only.lower().split()
    return (w for w in words if w not in st)



cn = Counter()
st = set(stopwords.words("english"))
for post in fiance_forum["Post_Text"]:
    cn.update(post_to_words(post, st)

这也避免了通过边做边数来创建大量单词列表的需要。