如何使用机器学习将 labels/score 分配给数据

Question

我有一个由许多行组成的数据框，其中包括推文。我想使用机器学习技术（监督或非监督）对它们进行分类。由于数据集未标记，我想 select 几行 (50%) 手动标记（+1 pos，-1 neg，0 neutral），然后使用机器学习将标签分配给其他行。为此，我做了如下操作：

原始数据集

Date                   ID        Tweet                         
01/20/2020           4141    The cat is on the table               
01/20/2020           4142    The sky is blue                       
01/20/2020           53      What a wonderful day                  
...
05/12/2020           532     In this extraordinary circumstance we are together   
05/13/2020           12      It was a very bad decision            
05/22/2020           565     I know you are the best

将数据集拆分为 50% 的训练和 50% 的测试。我手动标记了 50% 的数据如下：

Date                   ID        Tweet                          PosNegNeu
 01/20/2020           4141    The cat is on the table               0
 01/20/2020           4142    The weather is bad today              -1
 01/20/2020           53      What a wonderful day                  1
 ...
 05/12/2020           532     In this extraordinary circumstance we are together   1
 05/13/2020           12      It was a very bad decision            -1
 05/22/2020           565     I know you are the best               1

然后我提取了单词的频率（去除停用词后）：

               Frequency
 bad               2
 circumstance      1
 best              1
 day               1
 today             1
 wonderful         1

.....

我想尝试根据以下条件为其他数据分配标签：

频率 table 内的单词，例如说“如果一条推文包含例如 bad than assign -1；如果一条推文包含 wonderful assign 1（即我应该创建一个字符串列表和一个规则） ;
基于句子相似性（例如使用 Levenshtein 距离）。

我知道有几种方法可以做到这一点，甚至更好，但是我在为数据添加 classify/assign 标签时遇到了一些问题，我无法手动完成。

我的预期输出，例如使用以下测试数据集

Date                   ID        Tweet                                   
06/12/2020           43       My cat 'Sylvester' is on the table            
07/02/2020           75       Laura's pen is black                                                
07/02/2020           763      It is such a wonderful day                                    
...
11/06/2020           1415    No matter what you need to do                  
05/15/2020           64      I disagree with you: I think it is a very bad decision           
12/27/2020           565     I know you can improve

应该是这样的

Date                   ID        Tweet                                   PosNegNeu
06/12/2020           43       My cat 'Sylvester' is on the table            0
07/02/2020           75       Laura's pen is black                          0                       
07/02/2020           763      It is such a wonderful day                    1                
...
11/06/2020           1415    No matter what you need to do                  0  
05/15/2020           64      I disagree with you: I think it is a very bad decision  -1          
12/27/2020           565     I know you can improve                         0

可能更好的方法应该是考虑 n-gram 而不是单个单词或构建 corpus/vocabulary 来分配分数，然后是情绪。任何建议将不胜感激，因为这是我对机器学习的第一次练习。我认为也可以应用 k-means 聚类，试图得到更多相似的句子。如果你能给我一个完整的例子（有我的数据就很好，但也有其他数据也可以），我将不胜感激。

Answer 1

IIUC，您有一部分数据已标记，需要标记剩余数据。我建议阅读半监督机器学习。

Semi-supervised learning is an approach to machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. Semi-supervised learning falls between unsupervised learning (with no labeled training data) and supervised learning (with only labeled training data)

Sklearn 提供了相当广泛的算法来帮助解决这个问题。请检查 this。

如果您需要更深入地了解这个主题，我强烈建议您也查看此 article。

这里有一个 iris 数据集的例子 -

import numpy as np
from sklearn import datasets
from sklearn.semi_supervised import LabelPropagation

#Init
label_prop_model = LabelPropagation()
iris = datasets.load_iris()

#Randomly create unlabelled samples
rng = np.random.RandomState(42)
random_unlabeled_points = rng.rand(len(iris.target)) < 0.3
labels = np.copy(iris.target)
labels[random_unlabeled_points] = -1

#propogate labels over remaining unlabelled data
label_prop_model.fit(iris.data, labels)

Answer 2

我将建议分析此上下文中的句子或推文的极性。这可以使用 textblob 库来完成。它可以安装为 pip install -U textblob。一旦找到文本数据极性，就可以将其分配为数据帧中的单独列。随后，句子极性可以用于进一步分析。

初始代码

from textblob import TextBlob
df['sentiment'] = df['Tweet'].apply(lambda Tweet: TextBlob(Tweet).sentiment)
print(df)

中间结果

    Date     ...                                  sentiment
0  1/1/2020  ...                                 (0.0, 0.0)
1  2/1/2020  ...                                 (0.0, 0.0)
2  3/2/2020  ...                                 (0.0, 0.1)
3  4/2/2020  ...  (-0.6999999999999998, 0.6666666666666666)
4  5/2/2020  ...                                 (0.5, 0.6)

[5 rows x 4 columns]

从情绪列（在上面的输出中），我们可以看到情绪列分为两类——极性和主观性。

Polarity is a float value within the range [-1.0 to 1.0] where 0 indicates neutral, +1 indicates a very positive sentiment and -1 represents a very negative sentiment.

Subjectivity is a float value within the range [0.0 to 1.0] where 0.0 is very objective and 1.0 is very subjective. Subjective sentence expresses some personal feelings, views, beliefs, opinions, allegations, desires, beliefs, suspicions, and speculations where as Objective sentences are factual.

注意，情绪列是一个元组。所以我们可以把它分成两列，比如 df1=pd.DataFrame(df['sentiment'].tolist(), index= df.index)。现在，我们可以创建一个新的数据框，我将在其中附加拆分列，如图所示；

df_new = df
df_new['polarity'] = df1['polarity']
df_new.polarity = df1.polarity.astype(float)
df_new['subjectivity'] = df1['subjectivity']
df_new.subjectivity = df1.polarity.astype(float)

最后，根据之前发现的句子极性，我们现在可以向数据框添加一个标签，该标签将指示推文是正面的、负面的还是中立的。

import numpy as np
conditionList = [
    df_new['polarity'] == 0,
    df_new['polarity'] > 0,
    df_new['polarity'] < 0]
choiceList = ['neutral', 'positive', 'negative']
df_new['label'] = np.select(conditionList, choiceList, default='no_label')
print(df_new)

最后，结果会是这样的；

最终结果

[5 rows x 6 columns]
       Date  ID                 Tweet  ... polarity  subjectivity     label
0  1/1/2020   1  the weather is sunny  ...      0.0           0.0   neutral
1  2/1/2020   2       tom likes harry  ...      0.0           0.0   neutral
2  3/2/2020   3       the sky is blue  ...      0.0           0.0   neutral
3  4/2/2020   4    the weather is bad  ...     -0.7          -0.7  negative
4  5/2/2020   5         i love apples  ...      0.5           0.5  positive

[5 rows x 7 columns]

数据

import pandas as pd

# create a dictionary
data = {"Date":["1/1/2020","2/1/2020","3/2/2020","4/2/2020","5/2/2020"],
    "ID":[1,2,3,4,5],
    "Tweet":["the weather is sunny",
             "tom likes harry", "the sky is blue",
             "the weather is bad","i love apples"]}
# convert data to dataframe
df = pd.DataFrame(data)

完整代码

# create some dummy data
import pandas as pd
import numpy as np

# create a dictionary
data = {"Date":["1/1/2020","2/1/2020","3/2/2020","4/2/2020","5/2/2020"],
        "ID":[1,2,3,4,5],
        "Tweet":["the weather is sunny",
                 "tom likes harry", "the sky is blue",
                 "the weather is bad","i love apples"]}
# convert data to dataframe
df = pd.DataFrame(data)

from textblob import TextBlob
df['sentiment'] = df['Tweet'].apply(lambda Tweet: TextBlob(Tweet).sentiment)
print(df)

# split the sentiment column into two
df1=pd.DataFrame(df['sentiment'].tolist(), index= df.index)

# append cols to original dataframe
df_new = df
df_new['polarity'] = df1['polarity']
df_new.polarity = df1.polarity.astype(float)
df_new['subjectivity'] = df1['subjectivity']
df_new.subjectivity = df1.polarity.astype(float)
print(df_new)

# add label to dataframe based on condition
conditionList = [
    df_new['polarity'] == 0,
    df_new['polarity'] > 0,
    df_new['polarity'] < 0]
choiceList = ['neutral', 'positive', 'negative']
df_new['label'] = np.select(conditionList, choiceList, default='no_label')
print(df_new)

如何使用机器学习将 labels/score 分配给数据

How to assign labels/score to data using machine learning

python

machine-learning

sentiment-analysis

pandas