如何在 python 中打印 SVM 集群

Question

我想使用 SVM 聚类方法对列的行进行分类。我可以在网上找到很多生成图表或打印预测准确性的内容，但我找不到打印我的集群的方法。下面的示例将更好地解释我正在尝试做的事情：

我有一个数据框用作测试数据集

import pandas as pd
train_data = {'Serial': [1,2,3,4,5,6,7,8,9,10],
        'Text': ['Dog is a faithful animal',cat are not reliable','Tortoise can live a long life',
        'camel stores water in its hump','horse are used as means of transport','pen is a powerful weapon',
        'stop when the signal is red','oxygen is a life gas','chocolates are bad for health','lets grab a cup of coffee'],
        'classification':['Animal','Animal','Animal','Animal','Animal','Thing','Thing','Miscellenous','Thing','Thing']
        }

df = pd.DataFrame(train_data, columns = ['Serial', 'Text', 'classification'])
print (df)

我想预测文本行是在谈论 Animal/Thing 还是 miscelleneus。我要传的测试数据是

test_data = {'Serial': [1,2,3,4,5],
        'Text': ['Is this your dog?','Lets talk about the problem','You have a cat eye',
        'Donot forget to take the camel ride when u goto dessert','Plants give us O2']
        }

df = pd.DataFrame(test_data, columns = ['Serial', 'Text'])

预期结果是在测试数据框中创建了一个附加列 'Classification'，其值为 ['Animal'、'Miscellenous'、'Animal'、'Animal'、'Miscellenous']

Answer 1

这是您问题的解决方案：

# import tfidf-vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# import support vector classifier
from sklearn.svm import SVC 
import pandas as pd

train_data = {'Serial': [1,2,3,4,5,6,7,8,9,10],
        'Text': ['Dog is a faithful animal','cat are not reliable','Tortoise can live a long life',
        'camel stores water in its hump','horse are used as means of transport','pen is a powerful weapon',
        'stop when the signal is red','oxygen is a life gas','chocolates are bad for health','lets grab a cup of coffee'],
        'classification':['Animal','Animal','Animal','Animal','Animal','Thing','Thing','Miscellenous','Thing','Thing']
        }

train_df = pd.DataFrame(train_data, columns = ['Serial', 'Text', 'classification'])
display(train_df)


test_data = {'Serial': [1,2,3,4,5],
        'Text': ['Is this your dog?','Lets talk about the problem','You have a cat eye',
        'Donot forget to take the camel ride when u goto dessert','Plants give us O2']
        }

test_df = pd.DataFrame(test_data, columns = ['Serial', 'Text'])
display(test_df)


# Load training data (text) from the dataframe and form to a list containing all the entries
training_data = train_df['Text'].tolist()

# Load training labels from the dataframe and form to a list as well
training_labels = train_df['classification'].tolist()

# Load testing data from the dataframe and form a list
testing_data = test_df['Text'].tolist()

# Get a tfidf vectorizer to process the text into vectors
vectorizer = TfidfVectorizer()

# Fit the tfidf-vectorizer to training data and transform the training text into vectors
X_train = vectorizer.fit_transform(training_data)

# Transform the testing text into vectors
X_test = vectorizer.transform(testing_data)

# Get the SVC classifier
clf = SVC()

# Train the SVC with the training data (data points and labels)
clf.fit(X_train, training_labels)

# Predict the test samples
print(clf.predict(X_test))

# Add classification results to test dataframe
test_df['Classification'] = clf.predict(X_test)

# Display test dataframe
display(test_df)

作为方法的解释：

您有训练数据，想用它来训练 SVM，然后用标签预测测试数据。

这意味着你需要为每个数据点提取训练数据和标签（所以对于每个短语，你需要知道它是动物还是东西等）然后你需要设置和训练一个支持向量机。在这里，我使用了 scikit-learn 的实现。

此外，您不能仅使用原始文本数据训练 SVM，因为它需要数值（数字）。这意味着您需要将文本数据转换为数字。这是“feature extraction from text”，为此，一种常见的方法是使用词频倒文档频率 (TF-IDF) 概念。

现在你可以使用每个短语的向量表示加上它的标签来训练支持向量机，然后用它来对测试数据进行分类:)

简而言之，步骤是：

从训练中提取数据点和标签
从测试中提取数据点
设置 SVM 分类器
设置 TF-IDF 向量化器并将其拟合到训练数据
使用 tf-idf vectorizer 转换训练数据和测试数据
训练 SVM 分类器
使用经过训练的分类器对测试数据进行分类

希望对您有所帮助！

如何在 python 中打印 SVM 集群

How to print clusters of SVM in python

python

cluster-analysis

svm

scikits

scikit-learn