参数在 R 中应该有相同的长度错误

Question

我正在尝试创建一个键值存储，键是实体，值是实体在新闻文章中的平均情绪得分。

我有一个包含新闻文章的数据框和一个由分类器在这些新闻文章中识别的名为 organizations1 的实体列表。 organization1 列表的第一行包含在 news_us 数据框第一行的文章中标识的实体。我正在尝试遍历组织列表并创建一个键值存储，键是 organization1 列表中的实体名称，值是提到该实体的新闻描述的情绪分数。

我可以从一篇文章中获取实体的情感分数，但我想将它们加在一起并平均情感分数。

library(syuzhet)
sentiment <- list()
organization1 <- list(NULL, "US", "Bath", "Animal Crossing", "World Health Organization", 
    NULL, c("Microsoft", "Facebook"))
news_us <- structure(list(title = c("Stocks making the biggest moves after hours: Bed Bath & Beyond, JC Penney, United Airlines and more - CNBC", 
"Los Angeles mayor says 'very difficult to see' large gatherings like concerts and sporting events until 2021 - CNN", 
"Bed Bath & Beyond shares rise as earnings top estimates, retailer plans to maintain some key investments - CNBC", 
"6 weeks with Animal Crossing: New Horizons reveals many frustrations - VentureBeat", 
"Timeline: How Trump And WHO Reacted At Key Moments During The Coronavirus Crisis : Goats and Soda - NPR", 
"Michigan protesters turn out against Whitmer’s strict stay-at-home order - POLITICO"
), description = c("Check out the companies making headlines after the bell.", 
"Los Angeles Mayor Eric Garcetti said Wednesday large gatherings like sporting events or concerts may not resume in the city before 2021 as the US grapples with mitigating the novel coronavirus pandemic.", 
"Bed Bath & Beyond said that its results in 2020 \"will be unfavorably impacted\" by the crisis, and so it will not be offering a first-quarter nor full-year outlook.", 
"Six weeks with Animal Crossing: New Horizons has helped to illuminate some of the game's shortcomings that weren't obvious in our first review.", 
"How did the president respond to key moments during the pandemic? And how did representatives of the World Health Organization respond during the same period?", 
"Many demonstrators, some waving Trump campaign flags, ignored organizers‘ pleas to stay in their cars and flooded the streets of Lansing, the state capital."
), name = c("CNBC", "CNN", "CNBC", "Venturebeat.com", "Npr.org", 
"Politico")), na.action = structure(c(`35` = 35L, `95` = 95L, 
`137` = 137L, `154` = 154L, `213` = 213L, `214` = 214L, `232` = 232L, 
`276` = 276L, `321` = 321L), class = "omit"), row.names = c(NA, 
6L), class = "data.frame")

setNames(lapply(news_us$description, get_sentiment), unlist(organization1))

#$US
#[1] 0

#$Bath
#[1] -0.4

#$`Animal Crossing`
#[1] -0.1

#$`World Health Organization`
#[1] 1.1

#$Microsoft
#[1] -0.6

#$Facebook
#[1] -1.9

tapply(sapply(news_us$description, get_sentiment), unlist(organization1), mean) #this line throws the error

Answer 1

您的问题似乎是由使用 'unlist' 引起的。避免这种情况，因为它会丢弃 NULL 值并将列表条目与多个值连接起来。您的 organization1 列表有 7 个条目（其中两个为 NULL，一个为 length = 2）。如果要匹配 news_us data.frame，您应该有 6 个条目 - 所以那里有些不同步。

让我们假设 organization1 中的前 6 个条目是正确的；我会将它们绑定到您的 data.frame 以避免进一步 'sync errors':

news_us$organization1 = organization1[1:6]

然后需要对data.frame的每一行进行情感分析，并将结果绑定到organization1value/s。下面的代码可能不是实现此目的的最优雅方式，但我认为它可以满足您的需求：

results = do.call("rbind", apply(news_us, 1, function(item){
    if(!is.null(item$organization1[[1]])) cbind(item$organization1, get_sentiment(item$description))
}))

此代码会删除未检测到 organization1 值的所有行。在检测到多个 organization1 的情况下，它还应该复制情绪分数。结果将如下所示（我相信这是您的目标）：

     [,1]                        [,2]  
[1,] "US"                        "-0.4"
[2,] "Bath"                      "-0.1"
[3,] "Animal Crossing"           "1.1" 
[4,] "World Health Organization" "-0.6"

然后可以使用 by、aggregate 或类似方法折叠每个组织的平均分数。

[编辑：by 和 aggregate]

的示例

by(as.numeric(results[, 2]), results$V1, mean)

aggregate(as.numeric(results[, 2]), list(results$V1), mean)

参数在 R 中应该有相同的长度错误

Arguments should have same length error in R

r

lapply

sentiment-analysis

tapply

data-science