R 中的文本挖掘，读取每一行以获得 yes/no 个答案

Question

我一直在尝试找出一种使用 R 的方法，以了解如何从使用 PubMed 的 RISmed 包创建的 CSV 文件中提取某些术语，例如拉丁裔，以创建新变量的方式"Latino" 读取整行并插入新创建的变量中是否提到单词 yes 或 no

我怎样才能做到这一点，你推荐哪个包？

这是我的代码示例

library(RISmed)
library(dplyr) # tibble and other functions

RCT_topic <- 'randomized clinical trial'
RCT_query <- EUtilsSummary(RCT_topic, mindate=2016, maxdate=2017, retmax=100)
summary(RCT_query)
RCT_records <- EUtilsGet(RCT_query)

RCT_data <- data_frame('PMID'=PMID(RCT_records),
                       'Title'=ArticleTitle(RCT_records),
                       'Abstract'=AbstractText(RCT_records),
                       'YearPublished'=YearPubmed(RCT_records),
                       'Month.Published'=MonthPubmed(RCT_records),
                       'Country'= Country(RCT_records),
                       'Grant' =GrantID(RCT_records),
                       'Acronym' =Acronym(RCT_records),
                       'Agency' =Agency(RCT_records),
                       'Mesh'=Mesh(RCT_records))

Answer 1

这是一个解决方案：

library(stringr)

RCT_data %>% str_detect("Latino")

这将 return 拉丁裔在哪一列，然后您可以对该列应用相同的命令以查找行。例如在摘要栏中如下。

RCT_data %>% mutate(new_variable = ifelse(Abstract %>% str_detect("Latino"), "yes", "no"))

这将添加一个名为 new_variable 的新列，如果包含 "Latino"，则该行包含 yes，如果不包含，则包含 no。

Answer 2

为什么不使用 grepl 添加一个列来指示是否在搜索结果的摘要列中找到搜索词？ grepl 将 return 一个逻辑向量，如果找到您的模式则指示 TRUE，否则指示 FALSE。

# There are no mentions of "Latino" or "latino" in your df. 
RCT_data$Latino <- grepl("Latino|latino",RCT_data$Abstract)

# There are several mentions of the word "pain":
RCT_data$Pain <- grepl("pain",RCT_data$Abstract)

R 中的文本挖掘，读取每一行以获得 yes/no 个答案

Text mining in R, reading every row for a yes/no answer

text

r

extraction

mining

tm