NLP

Question

我想使用 sklearn 分类器训练模型，使用文本特征（内容）、数字特征（人口）和分类特征（位置）对数据条目（是，否）进行分类。

下面的模型仅使用文本数据对每个条目进行分类。文本在导入分类器之前使用 TF-IDF 转换为稀疏矩阵。

有没有办法 add/use 还有其他功能？这些特征不是稀疏矩阵格式，所以不确定如何将它们与文本稀疏矩阵结合起来。


    #import libraries
    import string, re, nltk
    import pandas as pd
    from pandas import Series, DataFrame
    from nltk.corpus import stopwords
    from nltk.stem.porter import PorterStemmer
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import classification_report
    from sklearn.pipeline import Pipeline

    # read data and remove empty lines
    dataset = pd.read_csv('sample_data.txt',
                           sep='\t',
                           names=['content','location','population','target'])
                           .dropna(how='all')
                           .dropna(subset=['target'])

    df = dataset[1:]

    #reset dataframe index
    df.reset_index(inplace = True)

    #add an extra column which is the length of text
    df['length'] = df['content'].apply(len)

    #create a dataframe that contains only two columns the text and the target class
    df_cont = df.copy()
    df_cont = df_cont.drop(
        ['location','population','length'],axis = 1)

    # function that takes in a string of text, removes all punctuation, stopwords and returns a list of cleaned text

    def text_process(mess):
        # lower case for string
        mess = mess.lower()

        # check characters and removes URLs
       nourl = re.sub(r'http\S+', ' ', mess)

        # check characters and removes punctuation
        nopunc = [char for char in nourl if char not in string.punctuation]

        # join the characters again to form the string and removes numbers
        nopunc = ''.join([i for i in nopunc if not i.isdigit()])

        # remove stopwords
        return [ps.stem(word) for word in nopunc.split() if word not in set(stopwords.words('english'))]

    #split the data in train and test set and train/test the model

    cont_train, cont_test, target_train, target_test = train_test_split(df_cont['content'],df_cont['target'],test_size = 0.2,shuffle = True, random_state = 1)


    pipeline = Pipeline([('bag_of_words',CountVectorizer(analyzer=text_process)),
                         ('tfidf',TfidfTransformer()),
                         ('classifier',MultinomialNB())])

    pipeline.fit(cont_train,target_train)
    predictions = pipeline.predict(cont_test)

    print(classification_report(predictions,target_test))

该模型预计 return 以下内容：准确性、精确度、召回率、f1 分数

Answer 1

我们似乎不能直接将文本编码为特征。所以你可能需要规范化它。您可以选择其中一个文本行并将其设置为标准。使用 TFIDF 计算标准文本与每行文本之间的匹配分数。然后您可以将该百分比编码为特征。我意识到这是一种非常迂回的编码方式，但根据您选择作为标准的文本，它可能会起作用。

Answer 2

您可以使用 toarray 方法将稀疏矩阵转换为 numpy array。

您将为每个文本条目获得一个向量，您可以将其与其他特征相结合。

Answer 3

我认为您需要为 'location' 功能使用 one-hot 矢量化。 One-hot 给定数据的向量为，

伦敦 - 100

曼彻斯特 - 010

爱丁堡 - 001

向量长度是你所在城市的数量。请注意，这里的每一位都是一个特征。在提供给机器学习算法之前，分类数据通常会转换为 one-hot 向量。

完成此操作后，您可以将整行连接到一维数组中，然后将其提供给分类器。

Answer 4

x_tfidf = hstack((x_tfidf , np.array(df['additonal_feature'])[:,None]))  
x_tfidf = x_tfidf.tocsr()

以上代码只是将您的附加列添加到您的 tf-idf 矩阵。如果您的 TF-IDF 矩阵的维度为 M x N，通过这一步，它将添加维度为 1 x N 的另一列，并且会产生一个 M+1 x N 的数组。所以最后，您的模型将将附加列视为附加的 nlp 标记。

第一行生成一个密集矩阵。这就是为什么我添加了第二个 lind 将其转换回压缩稀疏行格式的原因。

NLP - 如何添加更多功能？

NLP - How to add more features?

python

machine-learning

tf-idf

scikit-learn