找到单词时的 Stata 标志，而不是 strpos

Question

我有一些带字符串的数据，我想在找到单词时进行标记。单词将被定义为在字符串的开头、结尾或分隔 space。 strpos 会在字符串存在时查找，但我正在寻找类似于 subinword 的内容。 Stata 是否有办法使用 subinword 的功能而无需替换它，而是标记单词？

clear 
input id str50 strings
1 "the thin th man"
2  "this old then"
3 "th to moon"
4 "moon blank th"
end

gen th_pos = 0
replace th = 1 if strpos(strings, "th") >0

上面的代码将标记每个观察结果，因为它们都包含“th”，但我想要的输出是：

ID      strings          th_sub
1   "the thin th man"      1
2   "this old then"        0
3   "th to moon"           1
4   "moon blank th"        1

Answer 1

一个小技巧是 "th" 作为一个单词将在 space 之前和之后，除非它出现在字符串的开头或结尾。例外情况真的不是挑战，因为

gen wanted = strpos(" " + strings + " ", " th ") > 0

围绕它们工作。否则，有一组丰富的正则表达式函数可供使用。

上面的示例标记了未执行您想要的代码浓缩到一行，

gen th_pos = strpos(strings, "th") > 0

更直接的答案是您不必更换任何东西。你只需要让 Stata 告诉你如果你这样做会发生什么：

gen WANTED = strings != subinword(strings, "th", "", .)

如果删除一个子字符串（如果存在）会更改字符串，则它必须已经存在。

Answer 2

正则表达式可用于此类练习，单词边界允许您搜索由 \b 指示的整个单词，如 "\bword\b".

gen wanted = ustrregexm(strings, "\bth\b")

找到单词时的 Stata 标志，而不是 strpos

Stata flag when word found, not strpos

string

stata