根据列表的 R 子集数据
R subset data according list
我正在处理 Twitter 数据集,但我还没有弄清楚根据主题标签列表对我的数据进行子集化。
df:
rowID Hashtags
1 ogretmenemayistamujdehazirandaatama,onlarkonusurakpartiyapar
2 onlarkonusurakpartiyapar,halkinbasbakanitokatta
3 kurdish,mahabad,justiceforfarinaz,kurdistan
4 onlarkonusurakpartiyapar
5 anfal,halabja,kurdistan,kobani
6 onlarkonusurakpartiyapar
7 kurdistan
主题标签是一个字符列表
hashtag_list:
"onlarkonusurakpartiyapar" "kurdistan"
我试过这段代码,但它对我不起作用;
new_df=df[df$Hashtags %in% hashtag_list,]
它只能给出 "onlarkonusurakpartiyapar" 主题标签的子集。
我知道它看起来很简单,但即使我已经查看了网站上的所有帖子,我还是想不通。
感谢您的帮助。
这是一种通过区分由“,”分隔的字符来修改您的方法的方法,这些字符是不同的主题标签,并且如果您的列表中有任何这些主题标签,则表示该行是匹配的。
您的数据
df <- data.frame(
rowID=1:8,
Hashtags=c(
"ogretmenemayistamujdehazirandaatama,onlarkonusurakpartiyapar",
"onlarkonusurakpartiyapar,halkinbasbakanitokatta",
"kurdish,mahabad,justiceforfarinaz,kurdistan",
"onlarkonusurakpartiyapar",
"anfal,halabja,kurdistan,kobani",
"onlarkonusurakpartiyapar",
"kurdistan",
"this,willnot,befound"
),
stringsAsFactors=F
)
hashtag_list <- c("onlarkonusurakpartiyapar", "kurdistan")
解决方案
find_ht <- function(hashtags, hashtag_list){
sapply(strsplit(hashtags, split=","), function(x)any(x%in%hashtag_list))
}
实施
find_ht(hashtags=df$Hashtags, hashtag_list=hashtag_list)
哪个 return...
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
编辑
要执行子集,您只需...
sub.index <- find_ht(hashtags=df$Hashtags, hashtag_list=hashtag_list)
df[sub.index,]
哪个return
rowID Hashtags
1 1 ogretmenemayistamujdehazirandaatama,onlarkonusurakpartiyapar
2 2 onlarkonusurakpartiyapar,halkinbasbakanitokatta
3 3 kurdish,mahabad,justiceforfarinaz,kurdistan
4 4 onlarkonusurakpartiyapar
5 5 anfal,halabja,kurdistan,kobani
6 6 onlarkonusurakpartiyapar
7 7 kurdistan
或者,如果您想要索引 which(sub.index)
。要专门对 rowID
进行子集化,请执行 df[sub.index,"rowID"]
。在这种情况下,这两个 return [1] 1 2 3 4 5 6 7
我正在处理 Twitter 数据集,但我还没有弄清楚根据主题标签列表对我的数据进行子集化。
df:
rowID Hashtags
1 ogretmenemayistamujdehazirandaatama,onlarkonusurakpartiyapar
2 onlarkonusurakpartiyapar,halkinbasbakanitokatta
3 kurdish,mahabad,justiceforfarinaz,kurdistan
4 onlarkonusurakpartiyapar
5 anfal,halabja,kurdistan,kobani
6 onlarkonusurakpartiyapar
7 kurdistan
主题标签是一个字符列表
hashtag_list:
"onlarkonusurakpartiyapar" "kurdistan"
我试过这段代码,但它对我不起作用;
new_df=df[df$Hashtags %in% hashtag_list,]
它只能给出 "onlarkonusurakpartiyapar" 主题标签的子集。 我知道它看起来很简单,但即使我已经查看了网站上的所有帖子,我还是想不通。 感谢您的帮助。
这是一种通过区分由“,”分隔的字符来修改您的方法的方法,这些字符是不同的主题标签,并且如果您的列表中有任何这些主题标签,则表示该行是匹配的。
您的数据
df <- data.frame(
rowID=1:8,
Hashtags=c(
"ogretmenemayistamujdehazirandaatama,onlarkonusurakpartiyapar",
"onlarkonusurakpartiyapar,halkinbasbakanitokatta",
"kurdish,mahabad,justiceforfarinaz,kurdistan",
"onlarkonusurakpartiyapar",
"anfal,halabja,kurdistan,kobani",
"onlarkonusurakpartiyapar",
"kurdistan",
"this,willnot,befound"
),
stringsAsFactors=F
)
hashtag_list <- c("onlarkonusurakpartiyapar", "kurdistan")
解决方案
find_ht <- function(hashtags, hashtag_list){
sapply(strsplit(hashtags, split=","), function(x)any(x%in%hashtag_list))
}
实施
find_ht(hashtags=df$Hashtags, hashtag_list=hashtag_list)
哪个 return...
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
编辑
要执行子集,您只需...
sub.index <- find_ht(hashtags=df$Hashtags, hashtag_list=hashtag_list)
df[sub.index,]
哪个return
rowID Hashtags
1 1 ogretmenemayistamujdehazirandaatama,onlarkonusurakpartiyapar
2 2 onlarkonusurakpartiyapar,halkinbasbakanitokatta
3 3 kurdish,mahabad,justiceforfarinaz,kurdistan
4 4 onlarkonusurakpartiyapar
5 5 anfal,halabja,kurdistan,kobani
6 6 onlarkonusurakpartiyapar
7 7 kurdistan
或者,如果您想要索引 which(sub.index)
。要专门对 rowID
进行子集化,请执行 df[sub.index,"rowID"]
。在这种情况下,这两个 return [1] 1 2 3 4 5 6 7