Select 在词典和 return 数据框中找到的短语 doc_id 和短语
Select phrases found in dictionary and return dataframe of doc_id and phrase
我有一个医学短语字典文件和一个原始文本语料库。我正在尝试使用字典文件 select 文本中的相关短语。在这种情况下,短语是 1 到 5 个单词的 n-gram。最后,我想要一个包含两列的数据框中的 selected 短语:doc_id、phrase
我一直在尝试使用 quanteda 包来执行此操作,但没有成功。下面是一些代码来重现我最近的尝试。如果您有任何建议,我将不胜感激...我尝试了多种方法,但始终只返回单字匹配项。
version R version 3.6.2 (2019-12-12)
os Windows 10 x64
system x86_64, mingw32
ui RStudio
Packages:
dbplyr 1.4.2
quanteda 1.5.2
library(quanteda)
library(dplyr)
raw <- data.frame("doc_id" = c("1", "2", "3"),
"text" = c("diffuse intrinsic pontine glioma are highly aggressive and difficult to treat brain tumors found at the base of the brain.",
"magnetic resonance imaging (mri) is a medical imaging technique used in radiology to form pictures of the anatomy and the physiological processes of the body.",
"radiation therapy or radiotherapy, often abbreviated rt, rtx, or xrt, is a therapy using ionizing radiation, generally as part of cancer treatment to control or kill malignant cells and normally delivered by a linear accelerator."))
term = c("diffuse intrinsic pontine glioma", "brain tumors", "brain", "pontine glioma", "mri", "medical imaging", "radiology", "anatomy", "physiological processes", "radiation therapy", "radiotherapy", "cancer treatment", "malignant cells")
medTerms = list(term = term)
dict <- dictionary(medTerms)
corp <- raw %>% group_by(doc_id) %>% summarise(text = paste(text, collapse=" "))
corp <- corpus(corp, text_field = "text")
dfm <- dfm(corp,
tolower = TRUE, stem = FALSE, remove_punct = TRUE,
remove = stopwords("english"))
dfm <- dfm_select(dfm, pattern = phrase(dict))
我最终想要返回的内容如下所示:
doc_id term
1 diffuse intrinsice pontine glioma
1 pontine glioma
1 brain tumors
1 brain
2 mri
2 medical imaging
2 radiology
2 anatomy
2 physiological processes
3 radiation therapy
3 radiotherapy
3 cancer treatment
3 malignant cells
如果你想从字典中匹配多个单词模式,你可以通过使用 ngrams 构建你的 dfm
来实现。
library(quanteda)
library(dplyr)
library(tidyr)
raw$text <- as.character(raw$text) # you forgot to use stringsAsFactors = FALSE while constructing the data.frame, so I convert your factor to character before continuing
corp <- corpus(raw, text_field = "text")
dfm <- tokens(corp) %>%
tokens_ngrams(1:5) %>% # This is the new way of creating ngram dfms. 1:5 means to construct all from unigram to 5-grams
dfm(tolower = TRUE,
stem = FALSE,
remove_punct = TRUE) %>% # I wouldn't remove stopwords for this matching task
dfm_select(pattern = dict)
现在我们只需要将 dfm
转换为 data.frame
并将其转换为长格式:
convert(dfm, "data.frame") %>%
pivot_longer(-document, names_to = "term") %>%
filter(value > 0)
#> # A tibble: 13 x 3
#> document term value
#> <chr> <chr> <dbl>
#> 1 1 brain 2
#> 2 1 pontine_glioma 1
#> 3 1 brain_tumors 1
#> 4 1 diffuse_intrinsic_pontine_glioma 1
#> 5 2 mri 1
#> 6 2 radiology 1
#> 7 2 anatomy 1
#> 8 2 medical_imaging 1
#> 9 2 physiological_processes 1
#> 10 3 radiotherapy 1
#> 11 3 radiation_therapy 1
#> 12 3 cancer_treatment 1
#> 13 3 malignant_cells 1
您可以删除值列,但稍后可能会对它感兴趣。
你可以形成长度从 1 到 5 的所有 ngram,然后 select 全部出来。但是对于大文本,这将是非常低效的。这是一个更直接的方法。我在这里重现了整个问题并做了一些修改(例如 stringsAsFactors = FALSE
并跳过了一些不必要的步骤)。
当然,这不会像您预期的示例那样重复计算条款,但我认为您可能不希望这样。如果它发生在 "brain tumor" 之内,为什么要算 "brain"?当它作为该短语出现时,您最好将 "brain tumor" 计算在内,而仅当它没有 "tumor" 出现时才计算 "brain"。下面的代码就是这样做的。
library(quanteda)
## Package version: 2.0.1
raw <- data.frame(
"doc_id" = c("1", "2", "3"),
"text" = c(
"diffuse intrinsic pontine glioma are highly aggressive and difficult to treat brain tumors found at the base of the brain.",
"magnetic resonance imaging (mri) is a medical imaging technique used in radiology to form pictures of the anatomy and the physiological processes of the body.",
"radiation therapy or radiotherapy, often abbreviated rt, rtx, or xrt, is a therapy using ionizing radiation, generally as part of cancer treatment to control or kill malignant cells and normally delivered by a linear accelerator."
),
stringsAsFactors = FALSE
)
dict <- dictionary(list(
term = c(
"diffuse intrinsic pontine glioma",
"brain tumors", "brain", "pontine glioma", "mri", "medical imaging",
"radiology", "anatomy", "physiological processes", "radiation therapy",
"radiotherapy", "cancer treatment", "malignant cells"
)
))
这里是答案的关键:首先使用字典来 select 标记,然后连接它们,然后为每个新的 "document" 重塑一个字典匹配项。最后一步创建你想要的data.frame。
toks <- corpus(raw) %>%
tokens() %>%
tokens_select(dict) %>% # select just dictionary values
tokens_compound(dict, concatenator = " ") %>% # turn phrase into single "tokens"
tokens_segment(pattern = "*") # make one token per "document"
# make into data.frame
data.frame(
doc_id = docid(toks), term = as.character(toks),
stringsAsFactors = FALSE
)
## doc_id term
## 1 1 diffuse intrinsic pontine glioma
## 2 1 brain tumors
## 3 1 brain
## 4 2 mri
## 5 2 medical imaging
## 6 2 radiology
## 7 2 anatomy
## 8 2 physiological processes
## 9 3 radiation therapy
## 10 3 radiotherapy
## 11 3 cancer treatment
## 12 3 malignant cells
我有一个医学短语字典文件和一个原始文本语料库。我正在尝试使用字典文件 select 文本中的相关短语。在这种情况下,短语是 1 到 5 个单词的 n-gram。最后,我想要一个包含两列的数据框中的 selected 短语:doc_id、phrase
我一直在尝试使用 quanteda 包来执行此操作,但没有成功。下面是一些代码来重现我最近的尝试。如果您有任何建议,我将不胜感激...我尝试了多种方法,但始终只返回单字匹配项。
version R version 3.6.2 (2019-12-12)
os Windows 10 x64
system x86_64, mingw32
ui RStudio
Packages:
dbplyr 1.4.2
quanteda 1.5.2
library(quanteda)
library(dplyr)
raw <- data.frame("doc_id" = c("1", "2", "3"),
"text" = c("diffuse intrinsic pontine glioma are highly aggressive and difficult to treat brain tumors found at the base of the brain.",
"magnetic resonance imaging (mri) is a medical imaging technique used in radiology to form pictures of the anatomy and the physiological processes of the body.",
"radiation therapy or radiotherapy, often abbreviated rt, rtx, or xrt, is a therapy using ionizing radiation, generally as part of cancer treatment to control or kill malignant cells and normally delivered by a linear accelerator."))
term = c("diffuse intrinsic pontine glioma", "brain tumors", "brain", "pontine glioma", "mri", "medical imaging", "radiology", "anatomy", "physiological processes", "radiation therapy", "radiotherapy", "cancer treatment", "malignant cells")
medTerms = list(term = term)
dict <- dictionary(medTerms)
corp <- raw %>% group_by(doc_id) %>% summarise(text = paste(text, collapse=" "))
corp <- corpus(corp, text_field = "text")
dfm <- dfm(corp,
tolower = TRUE, stem = FALSE, remove_punct = TRUE,
remove = stopwords("english"))
dfm <- dfm_select(dfm, pattern = phrase(dict))
我最终想要返回的内容如下所示:
doc_id term
1 diffuse intrinsice pontine glioma
1 pontine glioma
1 brain tumors
1 brain
2 mri
2 medical imaging
2 radiology
2 anatomy
2 physiological processes
3 radiation therapy
3 radiotherapy
3 cancer treatment
3 malignant cells
如果你想从字典中匹配多个单词模式,你可以通过使用 ngrams 构建你的 dfm
来实现。
library(quanteda)
library(dplyr)
library(tidyr)
raw$text <- as.character(raw$text) # you forgot to use stringsAsFactors = FALSE while constructing the data.frame, so I convert your factor to character before continuing
corp <- corpus(raw, text_field = "text")
dfm <- tokens(corp) %>%
tokens_ngrams(1:5) %>% # This is the new way of creating ngram dfms. 1:5 means to construct all from unigram to 5-grams
dfm(tolower = TRUE,
stem = FALSE,
remove_punct = TRUE) %>% # I wouldn't remove stopwords for this matching task
dfm_select(pattern = dict)
现在我们只需要将 dfm
转换为 data.frame
并将其转换为长格式:
convert(dfm, "data.frame") %>%
pivot_longer(-document, names_to = "term") %>%
filter(value > 0)
#> # A tibble: 13 x 3
#> document term value
#> <chr> <chr> <dbl>
#> 1 1 brain 2
#> 2 1 pontine_glioma 1
#> 3 1 brain_tumors 1
#> 4 1 diffuse_intrinsic_pontine_glioma 1
#> 5 2 mri 1
#> 6 2 radiology 1
#> 7 2 anatomy 1
#> 8 2 medical_imaging 1
#> 9 2 physiological_processes 1
#> 10 3 radiotherapy 1
#> 11 3 radiation_therapy 1
#> 12 3 cancer_treatment 1
#> 13 3 malignant_cells 1
您可以删除值列,但稍后可能会对它感兴趣。
你可以形成长度从 1 到 5 的所有 ngram,然后 select 全部出来。但是对于大文本,这将是非常低效的。这是一个更直接的方法。我在这里重现了整个问题并做了一些修改(例如 stringsAsFactors = FALSE
并跳过了一些不必要的步骤)。
当然,这不会像您预期的示例那样重复计算条款,但我认为您可能不希望这样。如果它发生在 "brain tumor" 之内,为什么要算 "brain"?当它作为该短语出现时,您最好将 "brain tumor" 计算在内,而仅当它没有 "tumor" 出现时才计算 "brain"。下面的代码就是这样做的。
library(quanteda)
## Package version: 2.0.1
raw <- data.frame(
"doc_id" = c("1", "2", "3"),
"text" = c(
"diffuse intrinsic pontine glioma are highly aggressive and difficult to treat brain tumors found at the base of the brain.",
"magnetic resonance imaging (mri) is a medical imaging technique used in radiology to form pictures of the anatomy and the physiological processes of the body.",
"radiation therapy or radiotherapy, often abbreviated rt, rtx, or xrt, is a therapy using ionizing radiation, generally as part of cancer treatment to control or kill malignant cells and normally delivered by a linear accelerator."
),
stringsAsFactors = FALSE
)
dict <- dictionary(list(
term = c(
"diffuse intrinsic pontine glioma",
"brain tumors", "brain", "pontine glioma", "mri", "medical imaging",
"radiology", "anatomy", "physiological processes", "radiation therapy",
"radiotherapy", "cancer treatment", "malignant cells"
)
))
这里是答案的关键:首先使用字典来 select 标记,然后连接它们,然后为每个新的 "document" 重塑一个字典匹配项。最后一步创建你想要的data.frame。
toks <- corpus(raw) %>%
tokens() %>%
tokens_select(dict) %>% # select just dictionary values
tokens_compound(dict, concatenator = " ") %>% # turn phrase into single "tokens"
tokens_segment(pattern = "*") # make one token per "document"
# make into data.frame
data.frame(
doc_id = docid(toks), term = as.character(toks),
stringsAsFactors = FALSE
)
## doc_id term
## 1 1 diffuse intrinsic pontine glioma
## 2 1 brain tumors
## 3 1 brain
## 4 2 mri
## 5 2 medical imaging
## 6 2 radiology
## 7 2 anatomy
## 8 2 physiological processes
## 9 3 radiation therapy
## 10 3 radiotherapy
## 11 3 cancer treatment
## 12 3 malignant cells