根据另一列的值对数据框中的字符串进行子集化

subset a string within a dataframe based on value of another column

我正在努力处理数据框列中的子集字符串。我正在处理语言数据。在我的数据框中,第一列包含动词词干,第二列包含包含多个单词的完整句子,其中一个是变位动词。我想创建一个只有共轭动词的第 3 列(因此删除其他词),它包含与同一行中第 1 列相同的动词词干。我不能为此简单地使用所有动词词干的列表,因为有些句子包含 2 个动词,而我只想要与该行第 1 列中词干相同的动词。

这是我的数据现在的样子:

   Verb_stem       Full_sentence 
1. copt            to coptu to 
2. puns            punse kanchina 
3. khag            basana na lo khagunse nan

这是我想要的输出:

   Verb_stem       Full_sentence              Conjugated verb         
1. copt            to coptu to                copto
2. puns            punse kanchina             punse
3. khag            basana na lo khagunse nan  khagunse

经过一些研究,我尝试了以下公式:

Df$Conjugated_verb <- lapply(strsplit(Df$Full_sentence, " "), grep, pattern = Df$Verb_stem, value = TRUE)

我现在面临的问题是公式似乎只查找所有句子中第一行的动词词干,而不是在每一行切换到一个新的动词词干。这是我得到的输出:

   Verb_stem       Full_sentence               Conjugated_verb 
1. copt            to coptu to                 coptu
2. puns            punse kanchina              character(0)
3. khag            basana na lo khagunse nan   character(0)

我试了很多东西,这几天一直在寻找解决方案,但我真的不知道该怎么做。如果有人有想法,我将不胜感激!提前致谢!

您可以使用 mapply() 成对操作 Verb_stemFull_sentence

within(df, {
  Conjugated_verb <- mapply(\(x, y) { z <- strsplit(y, "\s+")[[1]] ; z[grepl(x, z)] },
                            Verb_stem, Full_sentence)
})

within(df, {
  Conjugated_verb <- mapply(\(x, y) sub(sprintf(".*(\w*%s\w*).*", x), "\1", y),
                            Verb_stem, Full_sentence)
})

输出:

#   Verb_stem             Full_sentence Conjugated_verb
# 1      copt               to coptu to           coptu
# 2      puns            punse kanchina           punse
# 3      khag basana na lo khagunse nan        khagunse

我们可以用vectorizedstr_extract

library(dplyr)
library(stringr)
df1 %>%
    mutate(Conjugated = str_extract(Full_sentence, str_c(Verb_stem, "\S*")))

-输出

   Verb_stem             Full_sentence Conjugated
1.      copt               to coptu to      coptu
2.      puns            punse kanchina      punse
3.      khag basana na lo khagunse nan   khagunse

数据

df1 <- structure(list(Verb_stem = c("copt", "puns", "khag"), 
Full_sentence = c("to coptu to", 
"punse kanchina", "basana na lo khagunse nan")), 
class = "data.frame", row.names = c("1.", 
"2.", "3."))