你如何在 Stata 中只提取字符串的大写部分?

How do you extract only the uppercase portions of a string in Stata?

这是数据示例:

part1
"Cambridge, Maryland TEST MODEL SEADROME"
"L.B. MAYER HONORED"
"A TOWN MOVES"
"U.S. SAVINGS BONDS RALLY"
"N.D. NOSES OUT S.M.U. BY 27 TO 20"
"Philadelphia, Pa. BURN 2,300 SQUEALERS"
"Odd Bits In To-day's News"
"Saratoga Springs, N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPENING"
"Risk Death in Daring Race"
"Philadelphia, PA. IT'S HIGHER EDUCATION"
"806 DECORATIONS"
"Snow Hauled 20 Miles For Skiers"
"F.D.R. ASKS VICTORY EFFORT"

每个字符串要么有大写部分和小写部分,要么全部大写。我一直在尝试使用正则表达式来仅提取字符串的大写部分,但没有成功。我能做的最好的事情就是识别字符串何时以一定数量的大写字符开始或结束:

generate title = regexs(0) if regexm(part1, "^[A-Z][A-Z][A-Z].*[A-Z][A-Z][A-Z]$")

我还尝试了以下方法,这是我从论坛中的另一个问题中提取的:

generate title = regexs(0) if(regexm(part1, "\b[A-Z]{2,}\b"))

它应该查找连续包含至少两个大写字母的单词,但它只 returns 缺少我的值。我正在为 Mac.

使用 Stata 版本 13.1

正如@stribizhev 指出的那样,否定 可能是一种方式:

clear
set more off

input ///
str70 myvar
"Cambridge, Maryland TEST MODEL SEADROME"
"L.B. MAYER HONORED"
"A TOWN MOVES"
"U.S. SAVINGS BONDS RALLY"
"N.D. NOSES OUT S.M.U. BY 27 TO 20"
"Philadelphia, Pa. BURN 2,300 SQUEALERS"
"Odd Bits In To-day's News"
"Saratoga Springs, N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPENING"
"Risk Death in Daring Race"
"Philadelphia, PA. IT'S HIGHER EDUCATION"
"806 DECORATIONS"
"Snow Hauled 20 Miles For Skiers"
"F.D.R. ASKS VICTORY EFFORT"
end

gen title = trim(regexs(2)) if regexm(myvar, "([,.]*)([^a-z]*$)")

list title

结果是

. list title

     +-----------------------------------------------+
     |                                         title |
     |-----------------------------------------------|
  1. |                           TEST MODEL SEADROME |
  2. |                            L.B. MAYER HONORED |
  3. |                                  A TOWN MOVES |
  4. |                      U.S. SAVINGS BONDS RALLY |
  5. |             N.D. NOSES OUT S.M.U. BY 27 TO 20 |
     |-----------------------------------------------|
  6. |                          BURN 2,300 SQUEALERS |
  7. |                                               |
  8. | N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPENING |
  9. |                                               |
 10. |                     PA. IT'S HIGHER EDUCATION |
     |-----------------------------------------------|
 11. |                               806 DECORATIONS |
 12. |                                               |
 13. |                    F.D.R. ASKS VICTORY EFFORT |
     +-----------------------------------------------+

我认为这接近您想要的,但并不完美。如果字符串没有一些规则结构,很难想象有一种直接的方法来清理字符串。例如,比较观察 6 和观察 10 的 input/output。

如果您有标题数据库,在初始清理后,您可以与它进行比较和匹配。例如,参见 ssc describe strgroup

问题的含义似乎是您希望正则表达式规范能够提取所有实例。不管这多么合理,这并不是 Stata 中正则表达式的工作方式。您需要对实例进行循环。这使用 moss (ssc install moss),它以此为主要目的。 (收集苔藓的暗示是第二位程序作者的典型无力文字游戏,如果他正在阅读这篇文章的话。)

clear 
input str100 part1
"Cambridge, Maryland TEST MODEL SEADROME"
"L.B. MAYER HONORED"
"A TOWN MOVES"
"U.S. SAVINGS BONDS RALLY"
"N.D. NOSES OUT S.M.U. BY 27 TO 20"
"Philadelphia, Pa. BURN 2,300 SQUEALERS"
"Odd Bits In To-day's News"
"Saratoga Springs, N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPENING"
"Risk Death in Daring Race"
"Philadelphia, PA. IT'S HIGHER EDUCATION"
"806 DECORATIONS"
"Snow Hauled 20 Miles For Skiers"
"F.D.R. ASKS VICTORY EFFORT"
end 
compress 

moss part1, match("([A-Z]+)") regex 
egen wanted = concat(_match*), p(" ")
l wanted

     +--------------------------------------------------+
     |                                           wanted |
     |--------------------------------------------------|
  1. |                          C M TEST MODEL SEADROME |
  2. |                                L B MAYER HONORED |
  3. |                                     A TOWN MOVES |
  4. |                          U S SAVINGS BONDS RALLY |
  5. |                        N D NOSES OUT S M U BY TO |
     |--------------------------------------------------|
  6. |                               P P BURN SQUEALERS |
  7. |                                        O B I T N |
  8. | S S N Y DIAVOLO IS STAR AT BRILLIANT SPA OPENING |
  9. |                                          R D D R |
 10. |                       P PA IT S HIGHER EDUCATION |
     |--------------------------------------------------|
 11. |                                      DECORATIONS |
 12. |                                        S H M F S |
 13. |                        F D R ASKS VICTORY EFFORT |
     +--------------------------------------------------+

我假设您希望结果之间有空格;否则很难理解。您没有在大写字母之间指定标点符号;如果需要,则需要相应地修改正则表达式。

我想不出一条规则可以用一条命令干净地解析这种类型的数据。通常,最好的策略是针对简单的案例,然后转向更困难的案例,直到减少 returns 使额外的尝试变得没有吸引力。

使用正则表达式时,请务必注意意外匹配项,尤其是在观测值较大的情况下。对于此类工作,我使用 listsome(来自 SSC)。

看起来 part1 通常以城市名称开头,后跟州 name/abbreviation。这是处理简单情况和 city/state 情况的代码:

clear
input str60 part1
"Cambridge, Maryland TEST MODEL SEADROME" 
"L.B. MAYER HONORED" 
"A TOWN MOVES" 
"U.S. SAVINGS BONDS RALLY" 
"N.D. NOSES OUT S.M.U. BY 27 TO 20" 
"Philadelphia, Pa. BURN 2,300 SQUEALERS" 
"Odd Bits In To-day's News" 
"Saratoga Springs, N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPEN" 
"Risk Death in Daring Race" 
"Philadelphia, PA. IT'S HIGHER EDUCATION" 
"806 DECORATIONS" 
"Snow Hauled 20 Miles For Skiers" 
"F.D.R. ASKS VICTORY EFFORT" 
end

* take care of the easy cases where there are no lowercase letters
gen title = part1 if !regexm(part1,"[a-z]")

* this type of string work is easier if text is aligned to the left
leftalign   // (from SSC)

* target cases of City, State at the start of part1.
* with complex patterns, it's easy to miss unintended matches when
* lots of obs are involved so use -listsome- (from SSC to track changes)
gen title0 = title
replace title = trim(regexs(3)) if regexm(part1,"^([A-Z][a-z ]*)+, ([A-Z][a-z]*\.?)+([^a-z]+$)")
listsome if title != title0

list part1 title