R 文本挖掘 - 整个数据帧中字符串中出现频率最高的单词
R Text Mining - the most frequent word in string across entire data frame
我正在努力掌握文本挖掘和确定词频。我刚刚开始了解R和它的包,我刚刚了解了tm(看了一段时间后我觉得这可能会解决我的问题)。
我的问题是:如何确定整个列中字符串中最常用的两个?
我有以下例子:
structure(list(Location = c("Chicago", "Chicago", "Chicago",
"LA", "LA", "LA", "LA", "LA", "LA", "Texas", "Texas", "Texas",
"Texas", "Texas"), Code = c(4450L, 4450L, 4450L, 4450L, 4450L,
4450L, 4450L, 4450L, 4450L, 4410L, 4410L, 4410L, 4410L, 4410L
), Description = c("LABOR - CROSSOVER BOARD BRACKET", "LABOR - CROWN DOOR GASKET",
"LABOR - CROWN DOOR GASKET - APPLY 4' NEW GASKET AND RE-APPLY",
"LABOR - CUSHIONING DEVICE - END OF CAR CUSTOMER SUPPLIED MATERIAL",
"LABOR - DOOR EDGE", "LABOR - DOOR GASKET, CROWN CORNER", "LABOR - DOOR LOCK POCKET STG",
"LABOR - DOOR LOCK RECEPTICALS STG", "LABOR - DOOR LOCK STG",
"BOLT, HT, UNDER 5/8\"\" DIA & 6\"\" - SIDE POST", "BOLT, HT, UNDER 5/8\"\" DIA & 6\"\" - TRAINLINE TROLLEY",
"BOLT,HT,5/8 IN.DIA.OR LESS UNDER 6\"\" LONG - BRAKE STEP", "BOLT,HT,5/8 IN.DIA.OR LESS UNDER 6\"\" LONG - CROSSOVER BOARD",
"BOLT,HT,5/8 IN.DIA.OR LESS UNDER 6\"\" LONG - CROSSOVER BOARD BRACKET"
), `Desired Description Based on frequency` = c("Labor - CROWN DOOR GASKET",
"Labor - CROWN DOOR GASKET", "Labor - CROWN DOOR GASKET", "Labor - DOOR LOCK",
"Labor - DOOR LOCK", "Labor - DOOR LOCK", "Labor - DOOR LOCK",
"Labor - DOOR LOCK", "Labor - DOOR LOCK", "Bolt - HT", "Bolt - HT",
"Bolt - HT", "Bolt - HT", "Bolt - HT")), .Names = c("Location",
"Code", "Description", "Desired Description Based on frequency"
), row.names = c(NA, -14L), class = "data.frame")
最后希望能加上这一栏:
Desired Description Based on frequency
Labor - CROWN DOOR GASKET
Labor - CROWN DOOR GASKET
Labor - CROWN DOOR GASKET
Labor - DOOR LOCK
Labor - DOOR LOCK
Labor - DOOR LOCK
Labor - DOOR LOCK
Labor - DOOR LOCK
Labor - DOOR LOCK
Bolt - HT
Bolt - HT
Bolt - HT
Bolt - HT
Bolt - HT
基本上我想评估所有 4450 或 4410 并查看 table 中的所有描述,其中最常见并将其添加为一列。另一个条件将基于位置。有人可以帮我举个简单的例子吗?
非常感谢
我认为没有万能的解决方案可以解决您的问题。 (首先,对于描述中使用哪些词或多少词并没有确切的规则。)但是,这里有两种快速而肮脏的方法,作为起点可能会有所帮助:
library(tm)
txts <- gsub("[^A-Z]", " ", df$Description)
groups <- paste(df$Location, df$Code)
# 1
opts <- list(tolower=F, removePunctuation=TRUE, wordLengths=c(2, Inf))
lst <- split(txts, groups)
res <- sapply(lst, function(x) {
freq <- termFreq(paste(x, collapse=" "), opts)/length(x)
paste(names(freq[rank(-freq, ties.method = "first")<=3]), collapse = " - ")
})
rep(res, lengths(lst))
# 2
lst <- lapply(strsplit(txts, "\s+"), function(x) x[1:min(c(3,length(x)))] )
lst <- split(lst, groups)
n <- lengths(lst)
lst <- mapply("/", lapply(lst, function(x) sort(table(unlist(x)), decreasing = T)), n)
rep(sapply(lst, function(x) paste(names(x)[x>=.5], collapse=" - ")), n)
我正在努力掌握文本挖掘和确定词频。我刚刚开始了解R和它的包,我刚刚了解了tm(看了一段时间后我觉得这可能会解决我的问题)。
我的问题是:如何确定整个列中字符串中最常用的两个?
我有以下例子:
structure(list(Location = c("Chicago", "Chicago", "Chicago",
"LA", "LA", "LA", "LA", "LA", "LA", "Texas", "Texas", "Texas",
"Texas", "Texas"), Code = c(4450L, 4450L, 4450L, 4450L, 4450L,
4450L, 4450L, 4450L, 4450L, 4410L, 4410L, 4410L, 4410L, 4410L
), Description = c("LABOR - CROSSOVER BOARD BRACKET", "LABOR - CROWN DOOR GASKET",
"LABOR - CROWN DOOR GASKET - APPLY 4' NEW GASKET AND RE-APPLY",
"LABOR - CUSHIONING DEVICE - END OF CAR CUSTOMER SUPPLIED MATERIAL",
"LABOR - DOOR EDGE", "LABOR - DOOR GASKET, CROWN CORNER", "LABOR - DOOR LOCK POCKET STG",
"LABOR - DOOR LOCK RECEPTICALS STG", "LABOR - DOOR LOCK STG",
"BOLT, HT, UNDER 5/8\"\" DIA & 6\"\" - SIDE POST", "BOLT, HT, UNDER 5/8\"\" DIA & 6\"\" - TRAINLINE TROLLEY",
"BOLT,HT,5/8 IN.DIA.OR LESS UNDER 6\"\" LONG - BRAKE STEP", "BOLT,HT,5/8 IN.DIA.OR LESS UNDER 6\"\" LONG - CROSSOVER BOARD",
"BOLT,HT,5/8 IN.DIA.OR LESS UNDER 6\"\" LONG - CROSSOVER BOARD BRACKET"
), `Desired Description Based on frequency` = c("Labor - CROWN DOOR GASKET",
"Labor - CROWN DOOR GASKET", "Labor - CROWN DOOR GASKET", "Labor - DOOR LOCK",
"Labor - DOOR LOCK", "Labor - DOOR LOCK", "Labor - DOOR LOCK",
"Labor - DOOR LOCK", "Labor - DOOR LOCK", "Bolt - HT", "Bolt - HT",
"Bolt - HT", "Bolt - HT", "Bolt - HT")), .Names = c("Location",
"Code", "Description", "Desired Description Based on frequency"
), row.names = c(NA, -14L), class = "data.frame")
最后希望能加上这一栏:
Desired Description Based on frequency
Labor - CROWN DOOR GASKET
Labor - CROWN DOOR GASKET
Labor - CROWN DOOR GASKET
Labor - DOOR LOCK
Labor - DOOR LOCK
Labor - DOOR LOCK
Labor - DOOR LOCK
Labor - DOOR LOCK
Labor - DOOR LOCK
Bolt - HT
Bolt - HT
Bolt - HT
Bolt - HT
Bolt - HT
基本上我想评估所有 4450 或 4410 并查看 table 中的所有描述,其中最常见并将其添加为一列。另一个条件将基于位置。有人可以帮我举个简单的例子吗?
非常感谢
我认为没有万能的解决方案可以解决您的问题。 (首先,对于描述中使用哪些词或多少词并没有确切的规则。)但是,这里有两种快速而肮脏的方法,作为起点可能会有所帮助:
library(tm)
txts <- gsub("[^A-Z]", " ", df$Description)
groups <- paste(df$Location, df$Code)
# 1
opts <- list(tolower=F, removePunctuation=TRUE, wordLengths=c(2, Inf))
lst <- split(txts, groups)
res <- sapply(lst, function(x) {
freq <- termFreq(paste(x, collapse=" "), opts)/length(x)
paste(names(freq[rank(-freq, ties.method = "first")<=3]), collapse = " - ")
})
rep(res, lengths(lst))
# 2
lst <- lapply(strsplit(txts, "\s+"), function(x) x[1:min(c(3,length(x)))] )
lst <- split(lst, groups)
n <- lengths(lst)
lst <- mapply("/", lapply(lst, function(x) sort(table(unlist(x)), decreasing = T)), n)
rep(sapply(lst, function(x) paste(names(x)[x>=.5], collapse=" - ")), n)