根据匹配的正则表达式创建列值

Question

我在 df 的名为“句子”的列中有以下字符串：

I like an apple

我想创建第二列，称为 Type，其值由匹配的字符串确定。我想使用正则表达式 \bapple\b，将其与句子匹配，如果匹配，则在 Type 列中添加值 Fruit_apple。

在长运行中，我想用其他几个字符串和类型来做到这一点。

有没有使用函数的简单方法？

数据集（survey_1）：

structure(list(slider_8.response = c(1L, 1L, 3L, 7L, 7L, 7L, 
1L, 3L, 2L, 1L, 1L, 7L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 6L, 1L, 7L, 
7L, 7L, 1L, 1L, 7L, 6L, 6L, 1L, 1L, 7L, 1L, 7L, 7L, 1L, 7L, 7L, 
7L, 7L, 7L, 6L, 7L, 7L, 7L, 1L, 1L, 6L, 1L, 1L, 1L, 1L, 7L, 2L
), Sentences = c("He might could do it.", "I ever see the film.", 
"I may manage to come visit soon.", "She’ll never be forgotten.", 
"They might find something special.", "It might not be a good buy.", 
"Maybe my pain will went away.", "Stephen maybe should fix your bicycle.", 
"It used to didnʼt matter if you walked in late.", "He’d could climb the stairs.", 
"Only Graeme would might notice that.", "I used to cycle a lot. ", 
"Your dad belongs to disagree with this. ", "We can were pleased to see her.", 
"He may should take us to the city.", "I could never forgot his deep voice.", 
"I should can turn this thing over to Ann.", "They must knew who they really are.", 
"We used to runs down three flights.", "I don’t care what he may be up to. ", 
"That’s something I ain’t know about.", "That must be quite a skill.", 
"We must be able to invite Jim.", "She used to play with a trolley.", 
"He is done gone. ", "You might can check this before making a decision.", 
"It would have a positive effect on the team. ", "Ruth can maybe look for it later.", 
"You should tag along at the dance.", "They’re finna leave town.", 
"A poem should looks like that.", "I can tell you didn’t do your homework. ", 
"I can driving now.", "They should be able to put a blanket over it.", 
"We could scarcely see each other.", "I might says I was never good at maths.", 
"The next dance will be a quickstep. ", "I might be able to find myself a seat in this place.", 
"Andrew thinks we shouldn’t do it.", "Jack could give a hand.", 
"She’ll be able to come to the event.", "She’d maybe keep the car the way it is.", 
"Sarah used to be able to agree with this proposal.", "I’d like to see your lights working. ", 
"I’d be able to get a little bit more sleep.", "John may has a second name.", 
"You must can apply for this job.", "I maybe could wait till the 8 o’clock train.", 
"She used to could go if she finished early.", "That would meaned something else, eh?", 
"You’ll can enjoy your holiday.", "We liketa drowned that day. ", 
"I must say it’s a nice feeling.", "I eaten my lunch."), construct = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA)), row.names = c(NA, 54L), class = "data.frame")

type_list:

list("DM_will_can"=c("ll can","will can"), "DM_would_could"=c("d could","would could"),
                  "DM_might_can"="might can","DM_might_could"="might could","DM_used_to_could"="used to could",
                  "DM_should_can"="should can","DM_would_might"=c("d might", "would might"),"DM_may_should"="may should",
                  "DM_must_can"="must can", "SP_will_be_able"=c("ll be able","will be able"),
                  "SP_would_be_able"=c("d be able","would be able"),"SP_might_be_able"="might be able",
                  "SP_maybe_could"="maybe could","SP_used_to_be_able"="used to be able","SP_should_be_able"=
                    "should be able","SP_would_maybe"=c("d maybe", "would maybe"), "SP_maybe_should"="maybe should",
                  "SP_must_be_able"="must be able", "Filler_will_a"="quickstep","Filler_will_b"="forgotten",
                  "Filler_would_a"="lights working","Filler_would_b"="positive effect","Filler_can_a"="homework",
                  "Filler_can_b"="Ruth","Filler_could_a"="scarcely","Filler_could_b"="Jack", "Filler_may_a"="may be up to",
                  "Filler_may_b"="visit soon", "Filler_might_a"="good buy","Filler_might_be"="something special",
                  "Filler_should_a"="tag along","Filler_should_b"="Andrew","Filler_used_to_a"="trolley",
                  "Filler_used_to_b"="cycle a lot","Filler_must_a"="quite a skill","Filler_must_b"="nice feeling",
                  "Dist_gram_will_went"="will went","Dist_gram_meaned"="meaned","Dist_gram_can_were"="can were",
                  "Dist_gram_forgot"="never forgot", "Dist_gram_may_has"="may has", 
                  "Dist_gram_might_says"="might says","Dist_gram_used_to_runs"="used to runs",
                  "Dist_gram_should_looks"="should looks","Dist_gram_must_knew"="must knew","Dist_dial_liketa"="liketa",
                  "Dist_dial_belongs"="belongs to disagree","Dist_dial_finna"="finna","Dist_dial_used_to_didnt"="used to didn't matter",
                  "Dist_dial_eaten"="I eaten", "Dist_dial_can_driving"="can driving","Dist_dial_aint_know"="That's something",
                  "Dist_dial_ever_see"="ever see the film","Dist_dial_done_gone"="done gone")

Answer 1

我想用 Python 字典来做到这一点，但我们谈论的是 R，所以我或多或少地翻译了这个方法。在 R 中可能有比两个 for 循环更惯用的方法来执行此操作，但这应该有效：

# Define data
df <- data.frame(
    id = c(1:5),
    sentences = c("I like apples", "I like dogs", "I have cats", "Dogs are cute", "I like fish")
)

#   id     sentences
# 1  1 I like apples
# 2  2   I like dogs
# 3  3   I have cats
# 4  4 Dogs are cute
# 5  5   I like fish

type_list <- list(
    "fruit" = c("apples", "oranges"),
    "animals" = c("dogs", "cats")
)

types <- names(type_list)

df$type <- NA
df$item <- NA

for (type in types) {
    for (item in type_list[[type]]) {
        matches <- grep(item, df$sentences, ignore.case = TRUE)
        df[matches, "type"]  = type
        df[matches, "item"]  = item
    }
}


# Output:
#   id     sentences    type   item
# 1  1 I like apples   fruit apples
# 2  2   I like dogs animals   dogs
# 3  3   I have cats animals   cats
# 4  4 Dogs are cute animals   dogs
# 5  5   I like fish    <NA>   <NA>

编辑

添加数据后添加。如果我读入你的数据，并将其命名为 df，而你的类型列表将其命名为 type_list，则以下工作：


types <- names(type_list)

df$type <- NA
df$item <- NA

for (type in types) {
    for (item in type_list[[type]]) {
        matches <- grep(item, df$Sentences, ignore.case = TRUE)
        df[matches, "type"]  = type
        df[matches, "item"]  = item
    }
}

这与我之前的代码完全相同，除了 Sentences 在您的数据框中有一个大写的 S。

根据匹配的正则表达式创建列值

Create a column value based on a matching regular expression

regex

string

r

matching

dataframe