使用 R 中的词嵌入从文本变量预测数值变量
Predict numeric variable from a text variable using word embeddings in R
我有一个包含电影评论的文本变量和另一个包含评级的变量 – 我想尝试使用文本评论来预测评级。
以下是一些示例数据:
movie_reviews <- c("I really loved the movie plot", "This movie really sucked", "I really found this movie thought provoking", "ahh what a boring movie", "A wonderful movie, with a wonderful end", "Great action movie: Very thrilling", "Worst movie ever, it never stopped being cheesy", "Enjoying, feelgood movie for the entire family", "I will definitely watch this movie again")
movie_ratings <- c(8, 2, 6, 3, 9, 8.5, 3.5, 9.5, 7.5)
movie_df <- tibble(movie_reviews, movie_ratings)
谢谢。
为此你可以使用 text
-package
# Create word embedding representations of your text
help(textEmbed)
reviews_embeddings <- textEmbed(movie_df,
model = "bert-base-uncased", # Select model you want from huggingface
layers = 11:12) # Select which layers you want to use
# Train the word embeddings to the numeric variable using ridge regression
reviews_rating_model <- textTrain(reviews_embeddings$movie_reviews,
movie_df$movie_ratings)
# See the results
reviews_rating_model
结果
$results
Pearson's product-moment correlation
data: predy_y$predictions and predy_y$y
t = 5.621, df = 7, p-value = 0.0003991
alternative hypothesis: true correlation is greater than 0
95 percent confidence interval:
0.6785761 1.0000000
sample estimates:
cor
0.9047823
我有一个包含电影评论的文本变量和另一个包含评级的变量 – 我想尝试使用文本评论来预测评级。
以下是一些示例数据:
movie_reviews <- c("I really loved the movie plot", "This movie really sucked", "I really found this movie thought provoking", "ahh what a boring movie", "A wonderful movie, with a wonderful end", "Great action movie: Very thrilling", "Worst movie ever, it never stopped being cheesy", "Enjoying, feelgood movie for the entire family", "I will definitely watch this movie again")
movie_ratings <- c(8, 2, 6, 3, 9, 8.5, 3.5, 9.5, 7.5)
movie_df <- tibble(movie_reviews, movie_ratings)
谢谢。
为此你可以使用 text
-package
# Create word embedding representations of your text
help(textEmbed)
reviews_embeddings <- textEmbed(movie_df,
model = "bert-base-uncased", # Select model you want from huggingface
layers = 11:12) # Select which layers you want to use
# Train the word embeddings to the numeric variable using ridge regression
reviews_rating_model <- textTrain(reviews_embeddings$movie_reviews,
movie_df$movie_ratings)
# See the results
reviews_rating_model
结果
$results
Pearson's product-moment correlation
data: predy_y$predictions and predy_y$y
t = 5.621, df = 7, p-value = 0.0003991
alternative hypothesis: true correlation is greater than 0
95 percent confidence interval:
0.6785761 1.0000000
sample estimates:
cor
0.9047823