如何使用 R 从列中提取多词单元？

Question

我的数据是这样的：

话题	测量
气候变化	减少排放
大流行	疫苗接种
慈善事业	募捐

现在我想提取一列中的所有多词单元 (MWU)，即：

topic_mwu<-c("气候变化")

measure_mwu<-c("减排","募捐")

R有自动提取这些MWU的功能吗？基本上我只需要识别那些至少包含一个空格的条目，所以我在考虑 RegEx - hack..

非常感谢您的帮助！

Answer 1

下面的代码应该可以工作：

#your dataframe
dt <- matrix(c("reduce emission", "call for donations", "pandemic", "climate change", "donations", "charity"), ncol =2)

#make it a vector
dt <- as.vector(dt)

#if the table is very big, you can do unique() to remove duplicates
dt <- unique(dt)

#get the MWU
dt[unlist(lapply(strsplit(dt,split = " "), length)) > 1]

这是您要找的吗？

如何使用 R 从列中提取多词单元？

How to extract multi-word units from a column using R?

regex

r

text-mining