字符串中 'standalone' 个数字的 Stata 正则表达式

Question

我正在尝试使用 Stata 中的 regexr 函数从字符串中删除特定的数字模式。我想删除任何不受字符（白色 space 除外）或字母限制的数字模式。例如，如果字符串包含 t370 或 6-test，我希望保留它们。只有当我的数字彼此相邻时才会出现。

clear 
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end

我想结束：

ID     string
1      7-test
2      67-tty
3      j37b2 3hty

我尝试了不同的正则表达式语句来查找数字何时包含在单词边界中：regexr(string, "\b[0-9]+\b", "")；除了手动添加白色 space " [0-9]+" ，它只会在模式出现在中间而不是字符串开头时替换。如果没有正则表达式更容易做到这一点，那很好，我只是想变得更加熟悉。

Answer 1

按照评论中的循环建议，您可以执行以下操作：

clear 
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end

gen N_words = wordcount(string) // # words in each string
qui sum N_words 
global max_words = r(max)  // max # words in all strings

split string, gen(part) parse(" ") // split string at space (p.s. space is the default)

gen string2 = ""
forval i = 1/$max_words {
    * add in parts that contain at least one letter
    replace string2 = string2 + " " + part`i' if regexm(part`i', "[a-zA-Z]") & !missing(string2)
    replace string2 = part`i' if regexm(part`i', "[a-zA-Z]") & missing(string2)
}

drop part* N_words

结果会在哪里

. list

     +----------------------------------------+
     | id                 string      string2 |
     |----------------------------------------|
  1. |  1   9884 7-test 58 - 489       7-test |
  2. |  2         67-tty 783 444       67-tty |
  3. |  3             j3782 3hty   j3782 3hty |
     +----------------------------------------+

请注意，我假设您想要所有包含至少一个字母的单词。您可能需要在此处针对您的特定用例调整 regexm。

字符串中 'standalone' 个数字的 Stata 正则表达式

Stata Regex for 'standalone' numbers in string

regex

stata