协助创建词袋模型
Assistance Creating Bag of Words Model
免责声明:这是家庭作业的一部分。
我有一组推文,我需要创建一个分类器来尝试预测他们的情绪。我将通过创建词袋模型并对数据应用径向 SVM 核函数来完成此操作。
这里是给你一个思路的原始数据:
> original_tweets
# A tibble: 2,385 x 3
tweet_id sentiment text
<int> <chr> <chr>
1 1 positive @TylerSkewes: It is almost 2014. Where are the self-driving cars so we don't have to worry about a DD tonight. Forreal tho
2 2 positive @WIRED: BMW builds a self-driving car -- that drifts I love this technology. Drive me to work baby!
3 3 positive Google better hurry up with that driverless car. Watching grandma do an 8 point turn to get in a parking spot is horrific.
4 4 positive I just waved thank you to this lady that let me merge on the highway and she gave me the finger. Need my self driving car.
5 5 positive I might be the only person who starts #cheering in their car when they see a @google car :) #happiness #feelslikeChristmas
6 6 positive I want the driverless car, and BAD. Seriously I would be happy if tomorrow morning there were no drivers behind the wheel.
7 7 positive I'm over here writing a 2000 word essay while *****s at Google are on driverless cars making ground breaking shit. Damn. _
8 8 positive Is it crazy to think that self driving cars will be the biggest innovation of the last few decades?
9 9 positive Its very nice!RT @cdixon: It's awesome that Google is investing in futuristic stuff like AR glasses and self-driving cars.
10 10 positive Look closely you will see the reflection of a google car !!!! Screen shot from google maps !!!!!
# ... with 2,375 more rows
>
我稍微编辑了一些术语,因为它们中有 URL,但你明白了。
我已将数据格式化为整洁的格式,并计算了每个术语的 TF-IDF 分数。对于我的特征 space,我选择了 IDF 得分最高的前 1000 个术语。
这是我的数据示例:
> feature_space
# A tibble: 3,000 x 7
tweet_id sentiment word n tf idf tf_idf
<int> <chr> <chr> <int> <dbl> <dbl> <dbl>
1 1 positive forreal 1 0.0435 7.78 0.338
2 2 positive drifts 1 0.0476 7.78 0.370
3 2 positive rprjtelkg6 1 0.0476 7.78 0.370
4 5 positive cheering 1 0.0455 7.78 0.353
5 5 positive feelslikechristmas 1 0.0455 7.78 0.353
6 7 positive 2000 1 0.0476 7.78 0.370
7 7 positive *****s 1 0.0476 7.78 0.370
8 8 positive decades 1 0.0417 7.78 0.324
9 8 positive vltlymug89 1 0.0417 7.78 0.324
10 9 positive ar 1 0.0476 7.78 0.370
# ... with 2,990 more rows
我想使用他们的 TF-IDF 分数创建一个词袋模型来创建一个情感分类器。对于这个模型,我知道我需要设置我的数据框,以便每条推文都是一行,并且在我的特征 space.
中每个可能的 TF-IDF 词权重都是一列
我很难弄清楚如何最好地改变 tibble 或数据框以将数据转换为这种格式。我已经尝试了 mutate() 和 join() 的各种组合,但它从来都不是我想要的方式。
如何根据一组特征词将 3000 或更多列快速添加到数据框或 tibble,并应用它们的 TF-IDF 值来填充这个稀疏数据结构?我不一定需要直接的代码答案,但是朝着正确的方向迈出一步,了解如何在 R 中实现这一点对我有很大帮助。
更新:我的词袋现在有一个空的 tibble,我只需要填写数据中的非零 TF-DF 值。这是:
> bag_of_words
# A tibble: 2,385 x 3,002
tweet_id sentiment forreal drifts rprjtelkg6 cheering feelslikechristmas `2000` *****s decades vltlymug89 ar closely reflection zg7hvvfgpn
<int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
2 2 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
3 3 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
4 4 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
5 5 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
6 6 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
7 7 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
8 8 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
9 9 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
10 10 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
# ... with 2,375 more rows, and 2,987 more variables
好的,我想我有办法了。我肯定很好奇如何在没有 for 循环的情况下做到这一点,但我仍然对 apply() 编码风格不太满意。
这是我想出的:
#create bag of words model
#get tweet_id and sentiment
bag_of_words <- original_tweets %>%
select(-one_of('text'))
#get words from feature space
feature_words <- feature_space$word
#generate empty columns
for(i in feature_words)
bag_of_words[,i] <- 0
#fill in columns with values from feature space
for(i in 1:length(feature_words)) {
word <- feature_space[i,]$word
tweet <- feature_space[i,]$tweet_id
score <- feature_space[i,]$tf_idf
bag_of_words[tweet,word] <- score
}
检查输出,看起来不错:
> bag_of_words
# A tibble: 2,385 x 3,002
tweet_id sentiment forreal drifts rprjtelkg6 cheering feelslikechristmas `2000` *****s decades vltlymug89 ar closely reflection zg7hvvfgpn
<int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 positive 0.338 0 0 0 0 0 0 0 0 0 0 0 0
2 2 positive 0 0.370 0.370 0 0 0 0 0 0 0 0 0 0
3 3 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
4 4 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
5 5 positive 0 0 0 0.353 0.353 0 0 0 0 0 0 0 0
6 6 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
7 7 positive 0 0 0 0 0 0.370 0.370 0 0 0 0 0 0
8 8 positive 0 0 0 0 0 0 0 0.324 0.324 0 0 0 0
9 9 positive 0 0 0 0 0 0 0 0 0 0.370 0 0 0
10 10 positive 0 0 0 0 0 0 0 0 0 0 0.370 0.370 0.370
# ... with 2,375 more rows, and 2,987 more variables
回想起来,我可能让自己变得比我需要的更难,但我绝对希望看到任何更有效的方法来完成这个经验丰富的 R 兽医。干杯。
免责声明:这是家庭作业的一部分。
我有一组推文,我需要创建一个分类器来尝试预测他们的情绪。我将通过创建词袋模型并对数据应用径向 SVM 核函数来完成此操作。
这里是给你一个思路的原始数据:
> original_tweets
# A tibble: 2,385 x 3
tweet_id sentiment text
<int> <chr> <chr>
1 1 positive @TylerSkewes: It is almost 2014. Where are the self-driving cars so we don't have to worry about a DD tonight. Forreal tho
2 2 positive @WIRED: BMW builds a self-driving car -- that drifts I love this technology. Drive me to work baby!
3 3 positive Google better hurry up with that driverless car. Watching grandma do an 8 point turn to get in a parking spot is horrific.
4 4 positive I just waved thank you to this lady that let me merge on the highway and she gave me the finger. Need my self driving car.
5 5 positive I might be the only person who starts #cheering in their car when they see a @google car :) #happiness #feelslikeChristmas
6 6 positive I want the driverless car, and BAD. Seriously I would be happy if tomorrow morning there were no drivers behind the wheel.
7 7 positive I'm over here writing a 2000 word essay while *****s at Google are on driverless cars making ground breaking shit. Damn. _
8 8 positive Is it crazy to think that self driving cars will be the biggest innovation of the last few decades?
9 9 positive Its very nice!RT @cdixon: It's awesome that Google is investing in futuristic stuff like AR glasses and self-driving cars.
10 10 positive Look closely you will see the reflection of a google car !!!! Screen shot from google maps !!!!!
# ... with 2,375 more rows
>
我稍微编辑了一些术语,因为它们中有 URL,但你明白了。
我已将数据格式化为整洁的格式,并计算了每个术语的 TF-IDF 分数。对于我的特征 space,我选择了 IDF 得分最高的前 1000 个术语。
这是我的数据示例:
> feature_space
# A tibble: 3,000 x 7
tweet_id sentiment word n tf idf tf_idf
<int> <chr> <chr> <int> <dbl> <dbl> <dbl>
1 1 positive forreal 1 0.0435 7.78 0.338
2 2 positive drifts 1 0.0476 7.78 0.370
3 2 positive rprjtelkg6 1 0.0476 7.78 0.370
4 5 positive cheering 1 0.0455 7.78 0.353
5 5 positive feelslikechristmas 1 0.0455 7.78 0.353
6 7 positive 2000 1 0.0476 7.78 0.370
7 7 positive *****s 1 0.0476 7.78 0.370
8 8 positive decades 1 0.0417 7.78 0.324
9 8 positive vltlymug89 1 0.0417 7.78 0.324
10 9 positive ar 1 0.0476 7.78 0.370
# ... with 2,990 more rows
我想使用他们的 TF-IDF 分数创建一个词袋模型来创建一个情感分类器。对于这个模型,我知道我需要设置我的数据框,以便每条推文都是一行,并且在我的特征 space.
中每个可能的 TF-IDF 词权重都是一列我很难弄清楚如何最好地改变 tibble 或数据框以将数据转换为这种格式。我已经尝试了 mutate() 和 join() 的各种组合,但它从来都不是我想要的方式。
如何根据一组特征词将 3000 或更多列快速添加到数据框或 tibble,并应用它们的 TF-IDF 值来填充这个稀疏数据结构?我不一定需要直接的代码答案,但是朝着正确的方向迈出一步,了解如何在 R 中实现这一点对我有很大帮助。
更新:我的词袋现在有一个空的 tibble,我只需要填写数据中的非零 TF-DF 值。这是:
> bag_of_words
# A tibble: 2,385 x 3,002
tweet_id sentiment forreal drifts rprjtelkg6 cheering feelslikechristmas `2000` *****s decades vltlymug89 ar closely reflection zg7hvvfgpn
<int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
2 2 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
3 3 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
4 4 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
5 5 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
6 6 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
7 7 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
8 8 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
9 9 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
10 10 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
# ... with 2,375 more rows, and 2,987 more variables
好的,我想我有办法了。我肯定很好奇如何在没有 for 循环的情况下做到这一点,但我仍然对 apply() 编码风格不太满意。
这是我想出的:
#create bag of words model
#get tweet_id and sentiment
bag_of_words <- original_tweets %>%
select(-one_of('text'))
#get words from feature space
feature_words <- feature_space$word
#generate empty columns
for(i in feature_words)
bag_of_words[,i] <- 0
#fill in columns with values from feature space
for(i in 1:length(feature_words)) {
word <- feature_space[i,]$word
tweet <- feature_space[i,]$tweet_id
score <- feature_space[i,]$tf_idf
bag_of_words[tweet,word] <- score
}
检查输出,看起来不错:
> bag_of_words
# A tibble: 2,385 x 3,002
tweet_id sentiment forreal drifts rprjtelkg6 cheering feelslikechristmas `2000` *****s decades vltlymug89 ar closely reflection zg7hvvfgpn
<int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 positive 0.338 0 0 0 0 0 0 0 0 0 0 0 0
2 2 positive 0 0.370 0.370 0 0 0 0 0 0 0 0 0 0
3 3 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
4 4 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
5 5 positive 0 0 0 0.353 0.353 0 0 0 0 0 0 0 0
6 6 positive 0 0 0 0 0 0 0 0 0 0 0 0 0
7 7 positive 0 0 0 0 0 0.370 0.370 0 0 0 0 0 0
8 8 positive 0 0 0 0 0 0 0 0.324 0.324 0 0 0 0
9 9 positive 0 0 0 0 0 0 0 0 0 0.370 0 0 0
10 10 positive 0 0 0 0 0 0 0 0 0 0 0.370 0.370 0.370
# ... with 2,375 more rows, and 2,987 more variables
回想起来,我可能让自己变得比我需要的更难,但我绝对希望看到任何更有效的方法来完成这个经验丰富的 R 兽医。干杯。