在 Python 中使用决策树进行文本分类

Text Classification using Decision Trees in Python

我不熟悉 Python 以及机器学习。我的实现基于 IEEE 研究论文 http://ieeexplore.ieee.org/document/7320414/(Bug 报告、功能请求或简单的表扬?关于自动对应用评论进行分类)

我想对文本进行分类。文字是来自 google play store 或 apple app store 的用户评论。研究中使用的类别是 Bug、Feature、User Experience、Rating。鉴于这种情况,我正在尝试使用 python 中的 sklearn 包来实现决策树。我遇到了 sklearn 'IRIS' 提供的示例数据集,它使用特征及其映射到目标的值构建树模型。在这个例子中,它是数字数据。

我正在尝试对文本而不是数字数据进行分类。示例:

  1. 我非常喜欢升级到pdfs。但是,他们不再显示了修复它,它将是完美的[BUG]
  2. 我只是希望它会在我低于某个金额时通知我 [FEATURE]
  3. 这个应用程序对我的业务很有帮助[评分]
  4. 在 iTunes 中轻松查找和购买歌曲 [UserExperience]

鉴于这些文本和这些类别的更多用户评论,我想创建一个分类器,可以使用数据进行训练并预测任何给定用户评论的目标。

到目前为止,我已经对文本进行了预处理,并以包含预处理数据及其目标的元组列表的形式创建了训练数据。

我的预处理:

  1. 将多行评论标记为单个句子
  2. 将每个句子标记为单词
  3. 删除标记化句子中的停用词
  4. 对标记化句子中的词进行词形还原

(['i', 'liked', 'much', 'upgrade', 'pdfs', 'however', 'displaying', 'anymore', 'fix', 'perfect'], "BUG")

这是我目前的情况:

import json
from sklearn import tree
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, RegexpTokenizer

# define a tokenizer to tokenize sentences and also remove punctuation
tokenizer = RegexpTokenizer(r'\w+')

# this list stores all the training data along with it's label
tagged_tokenized_comments_corpus = []


# Method: to add data to training set
# Parameter: Tuple in the format (Data, Label)
def tag_tokenized_comments_corpus(*tuple_data):
tagged_tokenized_comments_corpus.append(tuple_data)


# step 1: Load all the stop words from the nltk package
stop_words = stopwords.words("english")
stop_words.remove('not')

# creating a temporary list to copy the existing stop words
temp_stop_words = stop_words

for word in temp_stop_words:
if "n't" in word:
    stop_words.remove(word)

# load the data set
files = ["Bug.txt", "Feature.txt", "Rating.txt", "UserExperience.txt"]

d = {"Bug": 0, "Feature": 1, "Rating": 2, "UserExperience": 3}

for file in files:
input_file = open(file, "r")
file_text = input_file.read()
json_content = json.loads(file_text)

# step 3: Tokenize multi sentence into single sentences from the user comments
comments_corpus = []
for i in range(len(json_content)):
    comments = json_content[i]['comment']
    if len(sent_tokenize(comments)) > 1:
        for comment in sent_tokenize(comments):
            comments_corpus.append(comment)
    else:
        comments_corpus.append(comments)

# step 4: Tokenize each sentence, remove stop words and lemmatize the comments corpus
lemmatizer = WordNetLemmatizer()
tokenized_comments_corpus = []
for i in range(len(comments_corpus)):
    words = tokenizer.tokenize(comments_corpus[i])
    tokenized_sentence = []
    for w in words:
        if w not in stop_words:
            tokenized_sentence.append(lemmatizer.lemmatize(w.lower()))
    if tokenized_sentence:
        tokenized_comments_corpus.append(tokenized_sentence)
        tag_tokenized_comments_corpus(tokenized_sentence, d[input_file.name.split(".")[0]])

# step 5: Create a dictionary of words from the tokenized comments corpus
unique_words = []
for sentence in tagged_tokenized_comments_corpus:
for word in sentence[0]:
    unique_words.append(word)
unique_words = set(unique_words)

dictionary = {}
i = 0
for dict_word in unique_words:

dictionary.update({i, dict_word})
i = i + 1


train_target = []
train_data = []
for sentence in tagged_tokenized_comments_corpus:
train_target.append(sentence[0])
train_data.append(sentence[1])

clf = tree.DecisionTreeClassifier()
clf.fit(train_data, train_target)

test_data = "Beautiful Keep it up.. this far is the most usable app editor.. 
it makes my photos more beautiful and alive.."

test_words = tokenizer.tokenize(test_data)
test_tokenized_sentence = []
for test_word in test_words:
    if test_word not in stop_words:
     test_tokenized_sentence.append(lemmatizer.lemmatize(test_word.lower()))

#predict using the classifier
print("predicting the labels: ")
print(clf.predict(test_tokenized_sentence))

但是,这似乎不起作用,因为它在我们训练算法的 运行 时间内抛出错误。我在想如果我可以将元组中的单词映射到字典并将文本转换为数字形式并训练算法。但我不确定这是否可行。

谁能建议我如何修复此代码?或者有没有更好的方法来实现这个决策树

Traceback (most recent call last):
  File "C:/Users/venka/Documents/GitHub/RE-18/Test.py", line 87, in <module>
clf.fit(train_data, train_target)
  File "C:\Users\venka\Anaconda3\lib\site-packages\sklearn\tree\tree.py", line 790, in fit
X_idx_sorted=X_idx_sorted)
 File "C:\Users\venka\Anaconda3\lib\site-packages\sklearn\tree\tree.py", line 116, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "C:\Users\venka\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 441, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[ 0.  0.  0. ...,  3.  3.  3.].
Reshape your data either using array.reshape(-1, 1) if your data has a 
single feature or array.reshape(1, -1) if it contains a single sample.

决策树只有在你的特征向量长度都相同时才能起作用。就我个人而言,我不知道决策树在这样的文本分析中有多有效,但如果你想尝试去做,我建议的方法是 "one-hot" "bag of words" 样式向量。

本质上,标记单词在示例中出现的次数,并将它们放入表示整个语料库的向量中。比如说,一旦你删除了所有的停用词,整个语料库的集合就是:

{"Apple", "Banana", "Cherry", "Date", "Eggplant"}

你用一个与语料库大小相同的向量来表示它,每个值代表这个词是否出现。在我们的示例中,一个长度为 5 的向量,其中第一个元素与 "Apple" 关联,第二个元素与 "Banana" 关联,依此类推。你可能会得到类似的东西:

bag("Apple Banana Date")
#: [1, 1, 0, 1, 0]
bag("Cherry")
#: [0, 0, 1, 0, 0]
bag("Date Eggplant Banana Banana")
#: [0, 1, 0, 1, 1]
# For this case, I have no clue if Banana having the value 2 would improve results.
# It might. It might not. Something you'd need to test.

这样,无论输入如何,您都拥有相同大小的向量,并且决策树知道到哪里寻找某些输出。假设 "Banana" 强烈对应错误报告,在这种情况下,决策树将知道第二个元素中的 1 意味着错误报告更有可能。

当然,您的语料库可能有数千个单词。在那种情况下,您的决策树可能不是完成这项工作的最佳工具。除非你先花点时间 trim 了解你的功能。