根据另一列的值对数据框中的字符串进行子集化
subset a string within a dataframe based on value of another column
我正在努力处理数据框列中的子集字符串。我正在处理语言数据。在我的数据框中,第一列包含动词词干,第二列包含包含多个单词的完整句子,其中一个是变位动词。我想创建一个只有共轭动词的第 3 列(因此删除其他词),它包含与同一行中第 1 列相同的动词词干。我不能为此简单地使用所有动词词干的列表,因为有些句子包含 2 个动词,而我只想要与该行第 1 列中词干相同的动词。
这是我的数据现在的样子:
Verb_stem Full_sentence
1. copt to coptu to
2. puns punse kanchina
3. khag basana na lo khagunse nan
这是我想要的输出:
Verb_stem Full_sentence Conjugated verb
1. copt to coptu to copto
2. puns punse kanchina punse
3. khag basana na lo khagunse nan khagunse
经过一些研究,我尝试了以下公式:
Df$Conjugated_verb <- lapply(strsplit(Df$Full_sentence, " "), grep, pattern = Df$Verb_stem, value = TRUE)
我现在面临的问题是公式似乎只查找所有句子中第一行的动词词干,而不是在每一行切换到一个新的动词词干。这是我得到的输出:
Verb_stem Full_sentence Conjugated_verb
1. copt to coptu to coptu
2. puns punse kanchina character(0)
3. khag basana na lo khagunse nan character(0)
我试了很多东西,这几天一直在寻找解决方案,但我真的不知道该怎么做。如果有人有想法,我将不胜感激!提前致谢!
您可以使用 mapply()
成对操作 Verb_stem
和 Full_sentence
。
within(df, {
Conjugated_verb <- mapply(\(x, y) { z <- strsplit(y, "\s+")[[1]] ; z[grepl(x, z)] },
Verb_stem, Full_sentence)
})
或
within(df, {
Conjugated_verb <- mapply(\(x, y) sub(sprintf(".*(\w*%s\w*).*", x), "\1", y),
Verb_stem, Full_sentence)
})
输出:
# Verb_stem Full_sentence Conjugated_verb
# 1 copt to coptu to coptu
# 2 puns punse kanchina punse
# 3 khag basana na lo khagunse nan khagunse
我们可以用vectorized
str_extract
library(dplyr)
library(stringr)
df1 %>%
mutate(Conjugated = str_extract(Full_sentence, str_c(Verb_stem, "\S*")))
-输出
Verb_stem Full_sentence Conjugated
1. copt to coptu to coptu
2. puns punse kanchina punse
3. khag basana na lo khagunse nan khagunse
数据
df1 <- structure(list(Verb_stem = c("copt", "puns", "khag"),
Full_sentence = c("to coptu to",
"punse kanchina", "basana na lo khagunse nan")),
class = "data.frame", row.names = c("1.",
"2.", "3."))
我正在努力处理数据框列中的子集字符串。我正在处理语言数据。在我的数据框中,第一列包含动词词干,第二列包含包含多个单词的完整句子,其中一个是变位动词。我想创建一个只有共轭动词的第 3 列(因此删除其他词),它包含与同一行中第 1 列相同的动词词干。我不能为此简单地使用所有动词词干的列表,因为有些句子包含 2 个动词,而我只想要与该行第 1 列中词干相同的动词。
这是我的数据现在的样子:
Verb_stem Full_sentence
1. copt to coptu to
2. puns punse kanchina
3. khag basana na lo khagunse nan
这是我想要的输出:
Verb_stem Full_sentence Conjugated verb
1. copt to coptu to copto
2. puns punse kanchina punse
3. khag basana na lo khagunse nan khagunse
经过一些研究,我尝试了以下公式:
Df$Conjugated_verb <- lapply(strsplit(Df$Full_sentence, " "), grep, pattern = Df$Verb_stem, value = TRUE)
我现在面临的问题是公式似乎只查找所有句子中第一行的动词词干,而不是在每一行切换到一个新的动词词干。这是我得到的输出:
Verb_stem Full_sentence Conjugated_verb
1. copt to coptu to coptu
2. puns punse kanchina character(0)
3. khag basana na lo khagunse nan character(0)
我试了很多东西,这几天一直在寻找解决方案,但我真的不知道该怎么做。如果有人有想法,我将不胜感激!提前致谢!
您可以使用 mapply()
成对操作 Verb_stem
和 Full_sentence
。
within(df, {
Conjugated_verb <- mapply(\(x, y) { z <- strsplit(y, "\s+")[[1]] ; z[grepl(x, z)] },
Verb_stem, Full_sentence)
})
或
within(df, {
Conjugated_verb <- mapply(\(x, y) sub(sprintf(".*(\w*%s\w*).*", x), "\1", y),
Verb_stem, Full_sentence)
})
输出:
# Verb_stem Full_sentence Conjugated_verb
# 1 copt to coptu to coptu
# 2 puns punse kanchina punse
# 3 khag basana na lo khagunse nan khagunse
我们可以用vectorized
str_extract
library(dplyr)
library(stringr)
df1 %>%
mutate(Conjugated = str_extract(Full_sentence, str_c(Verb_stem, "\S*")))
-输出
Verb_stem Full_sentence Conjugated
1. copt to coptu to coptu
2. puns punse kanchina punse
3. khag basana na lo khagunse nan khagunse
数据
df1 <- structure(list(Verb_stem = c("copt", "puns", "khag"),
Full_sentence = c("to coptu to",
"punse kanchina", "basana na lo khagunse nan")),
class = "data.frame", row.names = c("1.",
"2.", "3."))