stri_replace_all_regex 不接受导入的模式替换文件的结果
stri_replace_all_regex won't accept results from imported pattern replacement file
我有一个 applescript 可以查找并替换大约一百个术语。使用正则表达式。我想在 R 中导入这个查找和替换函数。因此,在 ScriptEditor 中,我将 AppleScript 保存为文本文件并通过 readLines() 将其导入 R。此导入的 dput() 结果如下所示 punct.out。当我从原始向量创建自己的模式和替换数据框时,而不是从导入(参见下面的 punct),然后在测试字符串上查找和替换(参见下面的测试)就可以正常工作。但是,当我对导入的数据框尝试相同的命令时,它不起作用,它 returns NA。
所以不知何故,导入的文本结果没有以某种方式被解释为正则表达式或字符向量...我无法弄清楚。
#structure of my imported patterns and replacements
punct.out<-structure(list(replace = c(NA, NA, "good-bye[a-z]+|good-bye",
"good bye[a-z]+|good bye", "good-", "ill at ease", "ill-", "-like",
" well,", "- well,", ", well,", "as well", ".,", ".... well",
"... well", ". Well,", ": well,", "well-", "well,", "well,",
"well,", "Well,", "- okay,", ", okay,", "okay,", " okay,", ".... okay",
"... okay", ". Okay,", ": okay,", "OK", "'okay,", "okay,", "Okay,",
"Okay", ", too", "too /", "too,", "too.", "too?", "too:", "(No)(. )([0- 9]+)",
"( [A-Z])(.)( )", "www.", "ain't", "let's", "won't", "can't",
"n't", "cannot", "'d", "'ll", "'m", "'ve", "'re", "!", "?", ";",
"", ",", "--", "-", "-", "é", "è", "à", "ç", "&", "%", "per cent",
"_", "Que.", "Ont.", "Nfld.", "Alta.", "Man.", "Sask.", "St.",
"Ste.", "i.e.", "Mr.", "Ms.", "Mrs.", "Prof.", ".com", "a. m.",
"p. m.", "a.m.", "p.m.", "Jan.", "Feb.", "Mar.", "Apr.", "Jun.",
"Jul.", "Aug.", "Sept.", "Oct.", "Nov.", "Dec.", "gen.", "Dr.",
"e. coli", "(.)([A-Z])(.)", "([A-Z])(.)([A-Z])", "([A-Z])(.)([A-Z])",
"([A-Z])(.)([A-Z])", "([A-Z])(.)([A-Z])", "([A-Z])(.)([A-Z])",
"([0-9])(.)([0-9])", "()(S)", "([a-z]+)(')", "(')([a-z]+)", "bull ' s eye",
"no man ' s land", "pandora ' s box", "....", "...", ".", ",",
":", "", "", "", "", NA, NA), with = c("character(0)", "character(0)",
"goodbye", "goodbye", "good x", "ill at xease", "ill x", " xlike",
" xwell", " xwell", " xwell", "as xwell", " ", " xwell", " xwell",
". xWell", ": xwell", "well x", "xwell", " xwell", "xwell", "xWell",
" xokay", " xokay", " xokay", " xokay", " xokay", " xokay", ". xOkay",
": xokay", "okay", "xokay", "xokay", "xOkay", "xOkay", " xtoo",
"xtoo /", "xtoo", "xtoo.", "xtoo.", "xtoo", "#\\3", "\\1\\3",
"www", "am not", "let us", "will not", "can not", " not", "can not",
" would", " will", " am", " have", " are", ".", ".", "", "",
"", " ", " ", " ", "e", "e", "a", "c", "and", "percent", "percent",
" ", "Que", "Ont", "Nfld", "Alta", "Man", "Sask", "St", "Ste",
"ie", "Mr", "Ms", "Mrs", "Prof", "com", "am", "pm", " am", " pm",
"Jan", "Feb", "Mar", "Apr", "Jun", "Jul", "Aug", "Sept", "Oct",
"Nov", "Dec", "gen", "Dr", "e coli", "\\1\\2 ", "\\1\\3",
"\\1\\3", "\\1\\3", "\\1\\3", "\\1\\3", "\\1dot\\3",
"\\1 \\2", "\\1 \\2", "\\1 \\2", "bull's eye", "no man's land",
"pandora's box", "", "", " . ", " ,", "", " ", " ", " ", " ",
"character(0)", "character(0)")), .Names = c("replace", "with"
), row.names = c(NA, -127L), class = "data.frame")
#library
library(stringi)
#test string
test<-c('Sept.','Mr.' ,'Oct.', 'ill at ease', 'as well', 'Dr.', 'OK'
, 'well,', '.com')
#data frame of patterns and replacements
punct<-data.frame(replace=c('ill at ease', 'Sept.', 'Mr.', 'Oct.', 'as
well', 'Dr.', 'OK', 'well,', '.com'), with=c('ill at xease', 'Sept',
'Mr', 'Oct', 'as xwell', 'Dr', 'okay', 'xwell', 'com'))
#This works
stri_replace_all_regex(test, punct$replace, punct$with, vectorize_all=F)
#But this doesn't
stri_replace_all_regex(test, punct.out$replace, punct.out$with,
vectorize_all=F)
第二个问题:
我根据下面的评论解决了上面的问题。但是,一些正则表达式的出现存在一些具体问题。具体来说,我不知道如何转义反斜杠以打印正则表达式中匹配的第一个和第二个模式,即 \1、\2 等。
#Define data
punct.out<-structure(list(replace = c("(\.)([A-Z])(\.)", "([A-Z])(\.)([A-
Z])",
"([0-9])(\.)([0-9])", "([a-z]+)(')", "(') ([a-z]+)"), with =
c("\\1\\2 ",
"\\1\\3", "\\1dot\\3", "\\1 \\2", "\\1 \\2")), .Names =
c("replace",
"with"), row.names = c(104L, 105L, 110L, 112L, 113L), class = "data.frame")
#Test string of characters that the above regex's are supposed to match
test<-c('.B.', 'B.B', '1.1','premier\'s')
#This sort of works but I clearly haven't figured out how to properly escape
the backslashes to capture the references
stri_replace_all_regex(test,punct.out$replace, punct.out$with,
vectorize_all=F)
#Based on the help for stri_replace I also tried using $ to capture the
references.
punct.out$with<-gsub('\\\\', '$', punct.out$with)
#And it did work.
stri_replace_all_regex(test,punct$replace, punct$with, vectorize_all=F)
punct.out
包含缺失的观察值。这就是为什么你在输出中得到 NA
s 的原因。例如,您应该首先使用 na.omit
。此外,当您执行正则表达式匹配时,一些字符(例如 .
)应该被转义,即前面有一个反斜杠。还要注意第一列中有一些空字符串 - 它们也应该被删除。
我有一个 applescript 可以查找并替换大约一百个术语。使用正则表达式。我想在 R 中导入这个查找和替换函数。因此,在 ScriptEditor 中,我将 AppleScript 保存为文本文件并通过 readLines() 将其导入 R。此导入的 dput() 结果如下所示 punct.out。当我从原始向量创建自己的模式和替换数据框时,而不是从导入(参见下面的 punct),然后在测试字符串上查找和替换(参见下面的测试)就可以正常工作。但是,当我对导入的数据框尝试相同的命令时,它不起作用,它 returns NA。
所以不知何故,导入的文本结果没有以某种方式被解释为正则表达式或字符向量...我无法弄清楚。
#structure of my imported patterns and replacements
punct.out<-structure(list(replace = c(NA, NA, "good-bye[a-z]+|good-bye",
"good bye[a-z]+|good bye", "good-", "ill at ease", "ill-", "-like",
" well,", "- well,", ", well,", "as well", ".,", ".... well",
"... well", ". Well,", ": well,", "well-", "well,", "well,",
"well,", "Well,", "- okay,", ", okay,", "okay,", " okay,", ".... okay",
"... okay", ". Okay,", ": okay,", "OK", "'okay,", "okay,", "Okay,",
"Okay", ", too", "too /", "too,", "too.", "too?", "too:", "(No)(. )([0- 9]+)",
"( [A-Z])(.)( )", "www.", "ain't", "let's", "won't", "can't",
"n't", "cannot", "'d", "'ll", "'m", "'ve", "'re", "!", "?", ";",
"", ",", "--", "-", "-", "é", "è", "à", "ç", "&", "%", "per cent",
"_", "Que.", "Ont.", "Nfld.", "Alta.", "Man.", "Sask.", "St.",
"Ste.", "i.e.", "Mr.", "Ms.", "Mrs.", "Prof.", ".com", "a. m.",
"p. m.", "a.m.", "p.m.", "Jan.", "Feb.", "Mar.", "Apr.", "Jun.",
"Jul.", "Aug.", "Sept.", "Oct.", "Nov.", "Dec.", "gen.", "Dr.",
"e. coli", "(.)([A-Z])(.)", "([A-Z])(.)([A-Z])", "([A-Z])(.)([A-Z])",
"([A-Z])(.)([A-Z])", "([A-Z])(.)([A-Z])", "([A-Z])(.)([A-Z])",
"([0-9])(.)([0-9])", "()(S)", "([a-z]+)(')", "(')([a-z]+)", "bull ' s eye",
"no man ' s land", "pandora ' s box", "....", "...", ".", ",",
":", "", "", "", "", NA, NA), with = c("character(0)", "character(0)",
"goodbye", "goodbye", "good x", "ill at xease", "ill x", " xlike",
" xwell", " xwell", " xwell", "as xwell", " ", " xwell", " xwell",
". xWell", ": xwell", "well x", "xwell", " xwell", "xwell", "xWell",
" xokay", " xokay", " xokay", " xokay", " xokay", " xokay", ". xOkay",
": xokay", "okay", "xokay", "xokay", "xOkay", "xOkay", " xtoo",
"xtoo /", "xtoo", "xtoo.", "xtoo.", "xtoo", "#\\3", "\\1\\3",
"www", "am not", "let us", "will not", "can not", " not", "can not",
" would", " will", " am", " have", " are", ".", ".", "", "",
"", " ", " ", " ", "e", "e", "a", "c", "and", "percent", "percent",
" ", "Que", "Ont", "Nfld", "Alta", "Man", "Sask", "St", "Ste",
"ie", "Mr", "Ms", "Mrs", "Prof", "com", "am", "pm", " am", " pm",
"Jan", "Feb", "Mar", "Apr", "Jun", "Jul", "Aug", "Sept", "Oct",
"Nov", "Dec", "gen", "Dr", "e coli", "\\1\\2 ", "\\1\\3",
"\\1\\3", "\\1\\3", "\\1\\3", "\\1\\3", "\\1dot\\3",
"\\1 \\2", "\\1 \\2", "\\1 \\2", "bull's eye", "no man's land",
"pandora's box", "", "", " . ", " ,", "", " ", " ", " ", " ",
"character(0)", "character(0)")), .Names = c("replace", "with"
), row.names = c(NA, -127L), class = "data.frame")
#library
library(stringi)
#test string
test<-c('Sept.','Mr.' ,'Oct.', 'ill at ease', 'as well', 'Dr.', 'OK'
, 'well,', '.com')
#data frame of patterns and replacements
punct<-data.frame(replace=c('ill at ease', 'Sept.', 'Mr.', 'Oct.', 'as
well', 'Dr.', 'OK', 'well,', '.com'), with=c('ill at xease', 'Sept',
'Mr', 'Oct', 'as xwell', 'Dr', 'okay', 'xwell', 'com'))
#This works
stri_replace_all_regex(test, punct$replace, punct$with, vectorize_all=F)
#But this doesn't
stri_replace_all_regex(test, punct.out$replace, punct.out$with,
vectorize_all=F)
第二个问题: 我根据下面的评论解决了上面的问题。但是,一些正则表达式的出现存在一些具体问题。具体来说,我不知道如何转义反斜杠以打印正则表达式中匹配的第一个和第二个模式,即 \1、\2 等。
#Define data
punct.out<-structure(list(replace = c("(\.)([A-Z])(\.)", "([A-Z])(\.)([A-
Z])",
"([0-9])(\.)([0-9])", "([a-z]+)(')", "(') ([a-z]+)"), with =
c("\\1\\2 ",
"\\1\\3", "\\1dot\\3", "\\1 \\2", "\\1 \\2")), .Names =
c("replace",
"with"), row.names = c(104L, 105L, 110L, 112L, 113L), class = "data.frame")
#Test string of characters that the above regex's are supposed to match
test<-c('.B.', 'B.B', '1.1','premier\'s')
#This sort of works but I clearly haven't figured out how to properly escape
the backslashes to capture the references
stri_replace_all_regex(test,punct.out$replace, punct.out$with,
vectorize_all=F)
#Based on the help for stri_replace I also tried using $ to capture the
references.
punct.out$with<-gsub('\\\\', '$', punct.out$with)
#And it did work.
stri_replace_all_regex(test,punct$replace, punct$with, vectorize_all=F)
punct.out
包含缺失的观察值。这就是为什么你在输出中得到 NA
s 的原因。例如,您应该首先使用 na.omit
。此外,当您执行正则表达式匹配时,一些字符(例如 .
)应该被转义,即前面有一个反斜杠。还要注意第一列中有一些空字符串 - 它们也应该被删除。