计算顶卦

Calculating top trigrams

我有一个文章标题 (test$title) 及其总社交份额 (test$total_shares) 的测试文件。我可以使用 say:

找到最常用的三元组
library(tau)
trigrams = textcnt(test$title, n = 3, method = "string")
trigrams = trigrams[order(trigrams, decreasing = TRUE)]
head(trigrams, 20)

但是,我希望能够按平均份额而不是出​​现次数计算排名靠前的三卦。

我可以使用 grep eg

找到任何特定 trigram 的平均份额
HowTo <- filter(test, grepl('how to create', ignore.case = TRUE, title))

然后使用:

summary(HowTo)

查看带有该八卦的标题的平均份额。

但这是一个耗时的过程。我想做的是按平均份额计算数据集中的顶级三元组。感谢您的帮助。

这是一个示例数据集。 https://d380wq8lfryn3c.cloudfront.net/wp-content/uploads/2017/06/16175029/test4.csv

我倾向于使用

从标题中删除 non-ascii 个字符
test$title <- sapply(test$title,function(row) iconv(row, from = "UTF-8", to = "ASCII", sub=""))

对,这有点棘手。我将它分解成可管理的块,然后将它们串起来,这意味着我可能遗漏了一些 short-cuts,但至少它似乎有效。

哦,忘了说了。如果像您一样使用 textcnt() ,则将制作由一个标题的结尾和下一个标题的开头组成的三联字母。我认为这是不可取的,并找到了一种方法来规避它。

library(tau)
library(magrittr)

test0 <- read.csv(paste0("https://d380wq8lfryn3c.cloudfront.net/",
                  "wp-content/uploads/2017/06/16175029/test4.csv"),
                  header=TRUE, stringsAsFactors=FALSE)

test0[7467,] #problematic line

test <- test0
# test <- head(test0, 20)
test$title <- iconv(test$title, from="UTF-8", to="ASCII", sub=" ")
test$title <- test$title %>% 
  tolower %>% 
  gsub("[,/]", " ", .) %>%    #replace , and / with space
  gsub("[^a-z ]", "", .) %>%  #keep only letters and spaces
  gsub(" +", " ", .) %>%      #shrink multiple spaces to one
  gsub("^ ", "", .) %>%       #remove leading spaces
  gsub(" $", "", .)           #remove trailing spaces

test[7467,] #problematic line resolved

trigrams <- sapply(test$title, 
  function(s) names(textcnt(s, n=3, method="string")))
names(trigrams) <- test$total_shares

trigrams <- do.call(c, trigrams)
trigrams.df <- data.frame(trigrams, shares=as.numeric(names(trigrams)))

# aggregate shares by trigram. The number of shares of identical trigrams
# are summarized using some function (sum, mean, median etc.)
trigrams_share <- aggregate(shares ~ trigrams, data=trigrams.df, sum)

# more than one statistic can be calculated
trigrams_share <- aggregate(shares ~ trigrams, data=trigrams.df,
  FUN=function(x) c(mean=mean(x), sum=sum(x), nhead=length(x)))
trigrams_share <- do.call(data.frame, trigrams_share)
trigrams_share[[1]] <- as.character(trigrams_share[[1]])

# top five trigrams by average number of shares,
# of those that was found in three or more hedlines
trigrams_share <- trigrams_share[order(
  trigrams_share[2], decreasing=TRUE), ]
head(trigrams_share[trigrams_share[["shares.nhead"]] >= 3, ], 5)
#                           trigrams shares.mean shares.sum shares.nhead
# 37588                the secret to    42852.75     171411            4
# 43607                    will be a    24779.00     123895            5
# 44945        your career elearning    23012.00      92048            4
# 31454            raises million to    21378.67      64136            3
# 6419  classroom elearning industry    18812.38     150499            8

以防连接中断

# dput(head(test0, 20)):

test <- structure(list(
title = c("Top 3 Myths About BYOD In The Classroom - eLearning Industry", 
"The Emotional Weight of Being Graded, for Better or Worse", 
"Online learning startup Coursera raises M at an 0M valuation",
"LinkedIn doubles down on education with LinkedIn Learning, updates desktop site",
"Create Your eLearning Resume - eLearning Industry", 
"The Disruption of Digital Learning: Ten Things We Have Learned", 
"'Top universities to offer full degrees online in five years' - BBC News", 
"Schools will teach 'soft skills' from 2017, but assessing them presents a challenge",
"Top 5 Lead-Generating Ideas for Your Content Marketing", 
"'Top universities to offer full degrees online in five years' - BBC News",
"The long-distance learners of Aleppo - BBC News", 
"eLearning Solutions for Business", 
"6 Top eLearning Course Reviewer Tools And Selection Criteria - eLearning Industry",
"eLearning Elevated", 
"When Teachers and Technology Let Students Be Masters of Their Own Learning", 
"Aviation Technical English online elearning course", 
"How the Pioneers of the MOOC Got It Wrong", 
"Study challenges cost and price myths of online education", 
"10 Easy Ways to Integrate Technology in Your Classroom", 
"7 e-learning trends for educational institutions in 2017"
), total_shares = c(13646L, 12120L, 8328L, 5945L, 5853L, 5108L, 
4944L, 3570L, 3104L, 2841L, 2463L, 2227L, 2218L, 2210L, 2200L, 
2117L, 2039L, 1876L, 1861L, 1779L)), .Names = c("title", "total_shares"
), row.names = c(NA, 20L), class = "data.frame")