根据推文中存在的关键字分配 ID
Assign an ID based on keywords present in Tweets
我通过输入 44 个不同的关键字提取了推文,输出在一个包含 40 万条推文的文件中。输出文件包含包含相关关键字的推文。我如何创建一个单独的 ID 列,其中包含该推文中存在的关键字?
例如:推文是:
Andhra Pradesh is the highest state with crimes against women
这里的关键词是"crimes against women"
我想创建一个将关键字 "crimes against women" 分配给推文的列,准确地说是一种 ID 列。
#input column 1
Tweet<-("Andhra Pradesh is the highest state with crimes against women")
#expected output column 2 beside the Tweet column
Keyword<-("crimes against women")
编辑:我不想提取推文的任何部分,我只想能够在新列中为推文分配它包含的关键字,这样它会帮助我根据以下条件分离推文这个关键字。
我们可以使用stringr
,对于字符串操作非常方便,只需使用str_extract
即可,即
str_extract(Tweet, Keyword)
#[1] "crimes against women"
对于多个关键字和多个字符串,您需要申请,即
Keyword <- c("crimes against women", "something")
Tweet <- c("Andhra Pradesh is the highest state with crimes against women",
"another string with something else")
sapply(Tweet, function(i)str_extract(i, paste(Keyword, collapse = '|')))
# Andhra Pradesh is the highest state with crimes against women another string with something else
# "crimes against women" "something"
您可以使用 stringr
包执行此分析,但是,我认为您不需要使用 sapply
。
考虑以下关键字列表和 table 推文:
keyword_list <- c("crimes against women", "downloading tweets", "r analysis")
tweets <- data.frame(
tweet = c("Andhra Pradesh is the highest state with crimes against women",
"I am downloading tweets",
"I love r analysis",
"downloading tweets helps with my r analysis")
)
首先,您想将关键字组合成一个正则表达式来搜索任何字符串。
keyword_pattern <- paste0(
"(",
paste0(keyword_list, collapse = "|"),
")"
)
keyword_pattern
#> [1] "(crimes against women|downloading tweets|r analysis)"
最后,我们可以在数据框中添加一列,从推文中提取关键字。
tweets$keyword <- str_extract(tweets$tweet, keyword_pattern)
> tweets
#> tweet keyword
#> 1 Andhra Pradesh is the highest state with crimes against women crimes against women
#> 2 I am downloading tweets downloading tweets
#> 3 I love r analysis r analysis
#> 4 downloading tweets helps with my r analysis downloading tweets
如最后一个示例所示,您需要考虑当一条推文包含多个关键字时您要做什么。在这种情况下,关键字 returned 只是在推文中找到的第一个。但是,您也可以使用 str_extract_all
到 return 在推文中找到的所有关键字。
我通过输入 44 个不同的关键字提取了推文,输出在一个包含 40 万条推文的文件中。输出文件包含包含相关关键字的推文。我如何创建一个单独的 ID 列,其中包含该推文中存在的关键字?
例如:推文是:
Andhra Pradesh is the highest state with crimes against women
这里的关键词是"crimes against women"
我想创建一个将关键字 "crimes against women" 分配给推文的列,准确地说是一种 ID 列。
#input column 1
Tweet<-("Andhra Pradesh is the highest state with crimes against women")
#expected output column 2 beside the Tweet column
Keyword<-("crimes against women")
编辑:我不想提取推文的任何部分,我只想能够在新列中为推文分配它包含的关键字,这样它会帮助我根据以下条件分离推文这个关键字。
我们可以使用stringr
,对于字符串操作非常方便,只需使用str_extract
即可,即
str_extract(Tweet, Keyword)
#[1] "crimes against women"
对于多个关键字和多个字符串,您需要申请,即
Keyword <- c("crimes against women", "something")
Tweet <- c("Andhra Pradesh is the highest state with crimes against women",
"another string with something else")
sapply(Tweet, function(i)str_extract(i, paste(Keyword, collapse = '|')))
# Andhra Pradesh is the highest state with crimes against women another string with something else
# "crimes against women" "something"
您可以使用 stringr
包执行此分析,但是,我认为您不需要使用 sapply
。
考虑以下关键字列表和 table 推文:
keyword_list <- c("crimes against women", "downloading tweets", "r analysis")
tweets <- data.frame(
tweet = c("Andhra Pradesh is the highest state with crimes against women",
"I am downloading tweets",
"I love r analysis",
"downloading tweets helps with my r analysis")
)
首先,您想将关键字组合成一个正则表达式来搜索任何字符串。
keyword_pattern <- paste0(
"(",
paste0(keyword_list, collapse = "|"),
")"
)
keyword_pattern
#> [1] "(crimes against women|downloading tweets|r analysis)"
最后,我们可以在数据框中添加一列,从推文中提取关键字。
tweets$keyword <- str_extract(tweets$tweet, keyword_pattern)
> tweets
#> tweet keyword
#> 1 Andhra Pradesh is the highest state with crimes against women crimes against women
#> 2 I am downloading tweets downloading tweets
#> 3 I love r analysis r analysis
#> 4 downloading tweets helps with my r analysis downloading tweets
如最后一个示例所示,您需要考虑当一条推文包含多个关键字时您要做什么。在这种情况下,关键字 returned 只是在推文中找到的第一个。但是,您也可以使用 str_extract_all
到 return 在推文中找到的所有关键字。