在 Python 中标记推文

Tokenizing tweets in Python

enneg3clear.txt 是一个推文文件,每一行都没有标点符号和停用词。

import re, string
import sys

#this code tokenizes
input_file = 'enneg3clear.txt'

with open(input_file) as f:
    lines = f.readlines()

results = []
texts = []

for line in lines:
    texts = ([word for word in line.lower().split()])
    results.append(texts)
print results
[['\xef\xbb\xbfmy', 'good', 'sis', 'kelly', 'bouta', 'compete', 'with', 'adele', 'that', 'over', 'weinvm'], ['going', 'miss', 'japppaaannnnn'], ['its', 'so', 'hard', 'get', 'out', 'bed', 'morning', 'vote5sos', 'kca'
], ['police', 'fatally', 'shoot', 'homeless', 'man', 'losangeles', 'gtgt'], ['my', 'trumpet', 'has', 'been', 'idle', 'days', 'now'], ['mercenaries', 'was', 'game', 'i', 'lent', 'friend', 'never', 'saw', 'again'], ['
yeah', 'i', 'miss', 'you', 'all', 'so', 'much', 'already'], ['acabou', 'talitaaraujonomaisvoce'], ['im', 'at', 'strain', 'station', 'waiting', 'train', 'arrive', 'sigh', 'im', 'sooo', 'tired']]


#remove words that appear only once
all_tokens = sum(results, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
print tokens_once
set(['all', 'already', 'tired', 'sigh', 'over', 'hard', 'sooo', 'yeah', 'strain', '\xef\xbb\xbfmy', 'japppaaannnnn', 'adele', 'at', 'homeless', 'trumpet', 'its', 'out', 'sis', 'again', 'police', 'vote5sos', 'gtgt',
'saw', 'that', 'idle', 'been', 'mercenaries', 'waiting', 'station', 'you', 'has', 'was', 'friend', 'losangeles', 'kca', 'get', 'never', 'much', 'game', 'train', 'lent', 'now', 'with', 'bouta', 'man', 'shoot', 'going
', 'talitaaraujonomaisvoce', 'fatally', 'days', 'bed', 'morning', 'weinvm', 'good', 'compete', 'acabou', 'kelly', 'arrive', 'my'])

results = [[word for word in results if word not in tokens_once]]

print (results)

File "atokenize.py", line 25, in <module>
    results = [[word for word in results if word not in tokens_once]]
TypeError: unhashable type: 'list'

所以错误可以在倒数第二行找到。知道如何解决这个问题吗?

您的 results 包含 lists.So 的列表,您必须将其展平。

所以简单地说

results = [j for i in results for j in i]

线以上

results = [[word for word in results if word not in tokens_once]]

另一个解决方案

append 更改为 extend

for line in lines:
    texts = ([word for word in line.lower().split()])
    results.extend(texts) # or results += texts