识别短语中的单词并编码为 0 或 1

Question

我正在处理儿童所说的话语和陈述。从每个话语中，如果语句中的一个或多个单词与多个 'core' 单词（可能是 300 个单词）的预定义列表匹配，那么我想将“1”输入 'Core' （如果 none, 然后输入'0'到'Core').

同样，如果语句中有一个或多个词匹配不同的预定义列表 'fringe' 词（可能是 300 个边缘词；再次与核心词不同），那么我想在 'Fringe' 中输入“1”（如果 none，则在 'Fringe' 中输入“0”）。

基本上，现在我只有话语，从这些话语中，我需要确定是否有任何词与核心词之一匹配，是否与任何边缘词匹配。这是我的数据片段。

  Core Fringe        Utterance
1   NA     NA            small
2   NA     NA            small
3   NA     NA  where's his bed
4   NA     NA  there's his bed
5   NA     NA  there's his bed
6   NA     NA is that a pillow

提前致谢。我搜索了档案，但很难找到适合我情况的解决方案。

dput() 代码是：

    structure(list(Utterance = c("small", "small", "where's his bed", "there's his bed", "there's his bed", "is that a pillow", "what is that on his head", "hey he has his arm stuck here", "there there's it", "now you're gonna go night_night", "and that's the thing you can turn on", "yeah where's the music+box"), Core = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Fringe = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, -12L))

Answer 1

这里有一个可能可以解决您的问题的快速方法（尽管我相信还有更优雅的解决方案）...

df <- structure(list(Utterance = c("small", "small", "where's his bed", "there's his bed", "there's his bed", "is that a pillow", "what is that on his head", "hey he has his arm stuck here", "there there's it", "now you're gonna go night_night", "and that's the thing you can turn on", "yeah where's the music+box"), Core = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Fringe = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, -12L))

## Define object with all the core terms:

CorePatterns <- c("his", "music", "turn")

## Define value of `df$Core` as `1` if `df$Utterance` 
## contains one of the patterns in `CorePatterns`, 
## otherwise, define it as `0`:

df$Core <- ifelse(grepl(paste(CorePatterns, collapse = "|"), 
                        df$Utterance), 
                  1, 0)

df

                              Utterance Core Fringe
> 1                                 small    0     NA
> 2                                 small    0     NA
> 3                       where's his bed    1     NA
> 4                       there's his bed    1     NA
> 5                       there's his bed    1     NA
> 6                      is that a pillow    0     NA
> 7              what is that on his head    1     NA
> 8         hey he has his arm stuck here    1     NA
> 9                      there there's it    0     NA
> 10      now you're gonna go night_night    0     NA
> 11 and that's the thing you can turn on    1     NA
> 12           yeah where's the music+box    1     NA

您可以对 Fringe 数据执行相同的操作。

Answer 2

tidyverse 选项可以是：

library(dplyr)
library(stringr)

coreWords <- c('small', 'bed')
fringeWords <- c('head', 'night')

df %>%
  mutate(Core = + str_detect(Utterance, str_c(coreWords, collapse = '|')),
         Fringe = + str_detect(Utterance, str_c(fringeWords, collapse = '|')))

#                               Utterance Core Fringe
# 1                                 small    1      0
# 2                                 small    1      0
# 3                       where's his bed    1      0
# 4                       there's his bed    1      0
# 5                       there's his bed    1      0
# 6                      is that a pillow    0      0
# 7              what is that on his head    0      1
# 8         hey he has his arm stuck here    0      0
# 9                      there there's it    0      0
# 10      now you're gonna go night_night    0      1
# 11 and that's the thing you can turn on    0      0
# 12           yeah where's the music+box    0      0

识别短语中的单词并编码为 0 或 1

identify words within a phrase and code as 0 or 1

text

r

match