如何正确使用正则表达式功能(正则表达式无法正常运行)
How to use regex function properly (regex not functioning properly)
这是我正在做的练习,我得到了以下说明:
想出一个拆分标点符号或空格的策略,但它会保留完整的单词,例如“I've”或“wasn't”,中间有一个标点符号,位于两个字母之间。 (或者当标点符号在开头时,如“'em”,或者当开头有美元符号时。)将您的策略应用于 trump.words 如下定义,以便您只显示那些带有标点符号 and/or 美元符号。使用此策略时,给练习的答案应该是 102 [不一定是唯一的,但总计] 个单词。
我尝试的 code/input 行:
trump.lines = readLines("http://www.stat.cmu.edu/~pfreeman/trump.txt")
my.pattern=("([a-z]|[A-Z]){0,}([[:punct:]]|$){1,}([[:alnum:]]{1,})")
exp=regexpr(my.pattern,trump.lines,useBytes=TRUE)
regmatches(trump.lines,exp)
输出:
[1] "would've" "carefully-crafted" "Administration's"
[4] "nation's" ",000" "border-crosser"
[7] "I've" "African-American" "0"
[10] "" "0" "America's"
[13] "Let's" "Clinton's" "nation's"
[16] "Clinton's" "won't" "\"extremely"
[19] "America's" "we're" "don't"
[22] "there's" "African-American" "it's"
[25] "America's" "won't" "It's"
[28] "I'm" "nearly-one" "China's"
[31] "it's" "China's" "we'll"
[34] "Middle-income" "highest-taxed" ""
[37] "that's" "We're" "ten-point"
[40] "I'm" "I'll" "I'm"
[43] "he'd" "there's" "It's"
[46] "can't" "don't" "\"I"
我在我的代码中发现的一个问题是我在原始 txt 文件中有六个,而我只输出了 3 个,我不明白这怎么可能。 任何正确方向的帮助或一般推动将不胜感激。
这是你需要的吗?
grep("['$-]", unlist(strsplit(gsub(" -{1,2}", "", trump.lines), " ")), value = T)
说明:
这里;我们一口气做了四个操作:
gsub(" -{1,2}", "", trump.lines)
删除独立的双破折号或单破折号
strsplit(gsub(" -{1,2}", "", trump.lines), " ")
根据是否存在空格 ,将从上一个操作接收到的输入拆分为'words'
unlist(strsplit(gsub(" -{1,2}", "", trump.lines), " "))
取消列出前两次操作的结果
grep("['$-]", unlist(strsplit(gsub(" -{1,2}", "", trump.lines), " ")), value = T)
,最后,匹配那些至少有一个来自字符 class '
或 $
或 [=18= 的成员的那些 'words' ]中(因为前面有"
and/or\
的'words'恰好有字符class中的三个字符之一,这些字符需要未明确提及)
希望这对您有所帮助。
输出:
[1] "would've" "would've" "would've" "carefully-crafted" "Administration's"
[6] "America's" "That's" "nation's" "President's" "border-crosser"
[11] "years-old," "class'" "I've" "Sarah's" "wasn't"
[16] "African-American" "African-American" ",000" "that's" "0"
[21] "0" "We're" "" "forty-three" "0"
[26] "America's" "Let's" "Let's" "pre-Hillary," "Clinton's"
[31] "America's" "nation's" "it's" "Clinton's" "laid-off"
[36] "they're" "won't" "can't" "America's" "America's"
[41] "It's" "It's" "It's" "It's" "nation-"
[46] "we're" "don't" "there's" "African-American" "other-"
[51] "it's" "America's" "catch-and-release" "won't" "It's"
[56] "I'm" "nearly-one" "China's" "husband's" "it's"
[61] "China's" "we'll" "don't" "Middle-income" "highest-taxed"
[66] "job-killers" "" "that's" "she's" "that's"
[71] "she's" "We're" "ten-point" "I'm" "I'll"
[76] "they've" "I'm" "I'm" "he'd" "It's"
[81] "there's" "'em" "It's" "don't" "can't"
[86] "wouldn't" "doesn't" "don't" "Don't" "don't"
[91] "It's" "three-word" "\"I'm" "\"I'm"
这是我正在做的练习,我得到了以下说明:
想出一个拆分标点符号或空格的策略,但它会保留完整的单词,例如“I've”或“wasn't”,中间有一个标点符号,位于两个字母之间。 (或者当标点符号在开头时,如“'em”,或者当开头有美元符号时。)将您的策略应用于 trump.words 如下定义,以便您只显示那些带有标点符号 and/or 美元符号。使用此策略时,给练习的答案应该是 102 [不一定是唯一的,但总计] 个单词。
我尝试的 code/input 行:
trump.lines = readLines("http://www.stat.cmu.edu/~pfreeman/trump.txt")
my.pattern=("([a-z]|[A-Z]){0,}([[:punct:]]|$){1,}([[:alnum:]]{1,})")
exp=regexpr(my.pattern,trump.lines,useBytes=TRUE)
regmatches(trump.lines,exp)
输出:
[1] "would've" "carefully-crafted" "Administration's"
[4] "nation's" ",000" "border-crosser"
[7] "I've" "African-American" "0"
[10] "" "0" "America's"
[13] "Let's" "Clinton's" "nation's"
[16] "Clinton's" "won't" "\"extremely"
[19] "America's" "we're" "don't"
[22] "there's" "African-American" "it's"
[25] "America's" "won't" "It's"
[28] "I'm" "nearly-one" "China's"
[31] "it's" "China's" "we'll"
[34] "Middle-income" "highest-taxed" ""
[37] "that's" "We're" "ten-point"
[40] "I'm" "I'll" "I'm"
[43] "he'd" "there's" "It's"
[46] "can't" "don't" "\"I"
我在我的代码中发现的一个问题是我在原始 txt 文件中有六个,而我只输出了 3 个,我不明白这怎么可能。 任何正确方向的帮助或一般推动将不胜感激。
这是你需要的吗?
grep("['$-]", unlist(strsplit(gsub(" -{1,2}", "", trump.lines), " ")), value = T)
说明: 这里;我们一口气做了四个操作:
gsub(" -{1,2}", "", trump.lines)
删除独立的双破折号或单破折号strsplit(gsub(" -{1,2}", "", trump.lines), " ")
根据是否存在空格 ,将从上一个操作接收到的输入拆分为'words'
unlist(strsplit(gsub(" -{1,2}", "", trump.lines), " "))
取消列出前两次操作的结果grep("['$-]", unlist(strsplit(gsub(" -{1,2}", "", trump.lines), " ")), value = T)
,最后,匹配那些至少有一个来自字符 class'
或$
或 [=18= 的成员的那些 'words' ]中(因为前面有"
and/or\
的'words'恰好有字符class中的三个字符之一,这些字符需要未明确提及)
希望这对您有所帮助。
输出:
[1] "would've" "would've" "would've" "carefully-crafted" "Administration's"
[6] "America's" "That's" "nation's" "President's" "border-crosser"
[11] "years-old," "class'" "I've" "Sarah's" "wasn't"
[16] "African-American" "African-American" ",000" "that's" "0"
[21] "0" "We're" "" "forty-three" "0"
[26] "America's" "Let's" "Let's" "pre-Hillary," "Clinton's"
[31] "America's" "nation's" "it's" "Clinton's" "laid-off"
[36] "they're" "won't" "can't" "America's" "America's"
[41] "It's" "It's" "It's" "It's" "nation-"
[46] "we're" "don't" "there's" "African-American" "other-"
[51] "it's" "America's" "catch-and-release" "won't" "It's"
[56] "I'm" "nearly-one" "China's" "husband's" "it's"
[61] "China's" "we'll" "don't" "Middle-income" "highest-taxed"
[66] "job-killers" "" "that's" "she's" "that's"
[71] "she's" "We're" "ten-point" "I'm" "I'll"
[76] "they've" "I'm" "I'm" "he'd" "It's"
[81] "there's" "'em" "It's" "don't" "can't"
[86] "wouldn't" "doesn't" "don't" "Don't" "don't"
[91] "It's" "three-word" "\"I'm" "\"I'm"