如何正确使用正则表达式功能（正则表达式无法正常运行）

Question

这是我正在做的练习，我得到了以下说明：

想出一个拆分标点符号或空格的策略，但它会保留完整的单词，例如“I've”或“wasn't”，中间有一个标点符号，位于两个字母之间。（或者当标点符号在开头时，如“'em”，或者当开头有美元符号时。）将您的策略应用于 trump.words 如下定义，以便您只显示那些带有标点符号 and/or 美元符号。使用此策略时，给练习的答案应该是 102 [不一定是唯一的，但总计] 个单词。

我尝试的 code/input 行：

trump.lines = readLines("http://www.stat.cmu.edu/~pfreeman/trump.txt")
my.pattern=("([a-z]|[A-Z]){0,}([[:punct:]]|$){1,}([[:alnum:]]{1,})")
exp=regexpr(my.pattern,trump.lines,useBytes=TRUE)
regmatches(trump.lines,exp)

输出：

 [1] "would've"          "carefully-crafted" "Administration's" 
 [4] "nation's"          ",000"              "border-crosser"   
 [7] "I've"              "African-American"  "0"             
[10] ""               "0"              "America's"        
[13] "Let's"             "Clinton's"         "nation's"         
[16] "Clinton's"         "won't"             "\"extremely"      
[19] "America's"         "we're"             "don't"            
[22] "there's"           "African-American"  "it's"             
[25] "America's"         "won't"             "It's"             
[28] "I'm"               "nearly-one"        "China's"          
[31] "it's"              "China's"           "we'll"            
[34] "Middle-income"     "highest-taxed"     ""               
[37] "that's"            "We're"             "ten-point"        
[40] "I'm"               "I'll"              "I'm"              
[43] "he'd"              "there's"           "It's"             
[46] "can't"             "don't"             "\"I"

我在我的代码中发现的一个问题是我在原始 txt 文件中有六个，而我只输出了 3 个，我不明白这怎么可能。 任何正确方向的帮助或一般推动将不胜感激。

Answer 1

这是你需要的吗？

grep("['$-]", unlist(strsplit(gsub(" -{1,2}", "", trump.lines), " ")), value = T)

说明：这里;我们一口气做了四个操作：

gsub(" -{1,2}", "", trump.lines) 删除独立的双破折号或单破折号
strsplit(gsub(" -{1,2}", "", trump.lines), " ")根据是否存在空格
unlist(strsplit(gsub(" -{1,2}", "", trump.lines), " "))取消列出前两次操作的结果
grep("['$-]", unlist(strsplit(gsub(" -{1,2}", "", trump.lines), " ")), value = T)，最后，匹配那些至少有一个来自字符 class ' 或 $ 或 [=18= 的成员的那些 'words' ]中（因为前面有"and/or\的'words'恰好有字符class中的三个字符之一，这些字符需要未明确提及）

希望这对您有所帮助。

输出：

 [1] "would've"          "would've"          "would've"          "carefully-crafted" "Administration's" 
 [6] "America's"         "That's"            "nation's"          "President's"       "border-crosser"   
[11] "years-old,"        "class'"            "I've"              "Sarah's"           "wasn't"           
[16] "African-American"  "African-American"  ",000"            "that's"            "0"             
[21] "0"              "We're"             ""               "forty-three"       "0"             
[26] "America's"         "Let's"             "Let's"             "pre-Hillary,"      "Clinton's"        
[31] "America's"         "nation's"          "it's"              "Clinton's"         "laid-off"         
[36] "they're"           "won't"             "can't"             "America's"         "America's"        
[41] "It's"              "It's"              "It's"              "It's"              "nation-"          
[46] "we're"             "don't"             "there's"           "African-American"  "other-"           
[51] "it's"              "America's"         "catch-and-release" "won't"             "It's"             
[56] "I'm"               "nearly-one"        "China's"           "husband's"         "it's"             
[61] "China's"           "we'll"             "don't"             "Middle-income"     "highest-taxed"    
[66] "job-killers"       ""                "that's"            "she's"             "that's"           
[71] "she's"             "We're"             "ten-point"         "I'm"               "I'll"             
[76] "they've"           "I'm"               "I'm"               "he'd"              "It's"             
[81] "there's"           "'em"               "It's"              "don't"             "can't"            
[86] "wouldn't"          "doesn't"           "don't"             "Don't"             "don't"            
[91] "It's"              "three-word"        "\"I'm"             "\"I'm"

如何正确使用正则表达式功能（正则表达式无法正常运行）

How to use regex function properly (regex not functioning properly)

string

split

r

vector

matching