识别短语中的单词并编码为 0 或 1

identify words within a phrase and code as 0 or 1

我正在处理儿童所说的话语和陈述。从每个话语中,如果语句中的一个或多个单词与多个 'core' 单词(可能是 300 个单词)的预定义列表匹配,那么我想将“1”输入 'Core' (如果 none, 然后输入'0'到'Core').

同样,如果语句中有一个或多个词匹配不同的预定义列表 'fringe' 词(可能是 300 个边缘词;再次与核心词不同),那么我想在 'Fringe' 中输入“1”(如果 none,则在 'Fringe' 中输入“0”)。

基本上,现在我只有话语,从这些话语中,我需要确定是否有任何词与核心词之一匹配,是否与任何边缘词匹配。这是我的数据片段。

  Core Fringe        Utterance
1   NA     NA            small
2   NA     NA            small
3   NA     NA  where's his bed
4   NA     NA  there's his bed
5   NA     NA  there's his bed
6   NA     NA is that a pillow

提前致谢。我搜索了档案,但很难找到适合我情况的解决方案。

dput() 代码是:

    structure(list(Utterance = c("small", "small", "where's his bed", "there's his bed", "there's his bed", "is that a pillow", "what is that on his head", "hey he has his arm stuck here", "there there's it", "now you're gonna go night_night", "and that's the thing you can turn on", "yeah where's the music+box"), Core = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Fringe = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, -12L))

这里有一个可能可以解决您的问题的快速方法(尽管我相信还有更优雅的解决方案)...

df <- structure(list(Utterance = c("small", "small", "where's his bed", "there's his bed", "there's his bed", "is that a pillow", "what is that on his head", "hey he has his arm stuck here", "there there's it", "now you're gonna go night_night", "and that's the thing you can turn on", "yeah where's the music+box"), Core = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Fringe = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, -12L))

## Define object with all the core terms:

CorePatterns <- c("his", "music", "turn")

## Define value of `df$Core` as `1` if `df$Utterance` 
## contains one of the patterns in `CorePatterns`, 
## otherwise, define it as `0`:

df$Core <- ifelse(grepl(paste(CorePatterns, collapse = "|"), 
                        df$Utterance), 
                  1, 0)

df

                              Utterance Core Fringe
> 1                                 small    0     NA
> 2                                 small    0     NA
> 3                       where's his bed    1     NA
> 4                       there's his bed    1     NA
> 5                       there's his bed    1     NA
> 6                      is that a pillow    0     NA
> 7              what is that on his head    1     NA
> 8         hey he has his arm stuck here    1     NA
> 9                      there there's it    0     NA
> 10      now you're gonna go night_night    0     NA
> 11 and that's the thing you can turn on    1     NA
> 12           yeah where's the music+box    1     NA

您可以对 Fringe 数据执行相同的操作。

tidyverse 选项可以是:

library(dplyr)
library(stringr)

coreWords <- c('small', 'bed')
fringeWords <- c('head', 'night')

df %>%
  mutate(Core = + str_detect(Utterance, str_c(coreWords, collapse = '|')),
         Fringe = + str_detect(Utterance, str_c(fringeWords, collapse = '|')))

#                               Utterance Core Fringe
# 1                                 small    1      0
# 2                                 small    1      0
# 3                       where's his bed    1      0
# 4                       there's his bed    1      0
# 5                       there's his bed    1      0
# 6                      is that a pillow    0      0
# 7              what is that on his head    0      1
# 8         hey he has his arm stuck here    0      0
# 9                      there there's it    0      0
# 10      now you're gonna go night_night    0      1
# 11 and that's the thing you can turn on    0      0
# 12           yeah where's the music+box    0      0