用于匹配三元组的正则表达式
RegEx for matching trigrams
我正在尝试从一个字符串中获取所有三个单词的组——它可以包含多个句子——而不跨越句子边界。我让它适用于只有标准字母的单词:
preg_match_all("/(?=(\b(\w+)(?:\s+(\w+)\b|$)(?:\s+(\w+)\b|$)))/",$utext,$matches);
print_r($matches[1]);
但它会在有撇号或连字符的地方下降。因此,使用此示例文本:
The quick brown fox's feet jumped over the lazy dog. The rain falls head-first in the plain.
我想要这个列表:
- 快速棕色
- 快速棕狐
- 棕色狐狸脚
- 狐狸脚跳了
- 脚跳过
- 跳过
- 越过懒惰
- 懒狗
- 下雨了
- 雨头先下
- 头朝下倒在
- 在
中抢占先机
- 在平原
我已经尝试对上面的每个 \w 使用 [\w'-],但这会产生一些奇怪的情况:
Array ( [0] => The quick brown [1] => quick brown fox's [2] => brown fox's feet [3] => fox's feet jumped [4] => 's feet jumped [5] => s feet jumped [6] => feet jumped over [7] => jumped over the [8] => over the lazy [9] => the lazy dog [10] => The rain falls [11] => rain falls head-first [12] => falls head-first in [13] => head-first in the [14] => -first in the [15] => first in the [16] => in the plain )
我错过了什么?谢谢
只需将 \w
更改为 [^\s.]
(不是 space 或点)并删除边界一词。另一个变化是在正则表达式的开头添加一个交替 "beginning of line OR space":
$text = "The quick brown fox's feet jumped over the lazy dog. The rain falls head-first in the plain.";
preg_match_all("/(?=((?<=^|\s)[^\s.]+(?:\s+[^\s.]+|$)(?:\s+[^\s.]+|$)))/",$text,$matches);
print_r($matches[1]);
输出:
Array
(
[0] => The quick brown
[1] => quick brown fox's
[2] => brown fox's feet
[3] => fox's feet jumped
[4] => feet jumped over
[5] => jumped over the
[6] => over the lazy
[7] => the lazy dog
[8] => The rain falls
[9] => rain falls head-first
[10] => falls head-first in
[11] => head-first in the
[12] => in the plain
)
正则表达式解释:
(?= # lookahead
( # start group 1
(?<=^|\s) # lookbehind, make sure we have beginning of line or space before
[^\s.]+ # 1 or more non space, non dot
(?: # non capture group
\s+ # 1 or more spaces
[^\s.]+ # 1 or more non space, non dot
| # OR
$ # end of line
) # end group
(?: # non capture group
\s+ # 1 or more spaces
[^\s.]+ # 1 or more non space, non dot
| # OR
$ # end of line
) # end group
) # end group 1
) # end lookahead
根据评论编辑
$text = "The quick brown fox's feet jumped over the lazy dog. The rain falls head-first in the plain. 'This is a quote,' I say, and that's that.";
preg_match_all("/(?=((?<=^|\s|')(?:(?<=[a-zA-Z])'(?=[a-zA-Z])|[^\s.,'])+(?:\s+(?:(?<=[a-zA-Z])'(?=[a-zA-Z])|[^\s.,'])+|$){2}))/",$text,$matches);
print_r($matches[1]);
输出:
Array
(
[0] => The quick brown
[1] => quick brown fox's
[2] => brown fox's feet
[3] => fox's feet jumped
[4] => s feet jumped
[5] => feet jumped over
[6] => jumped over the
[7] => over the lazy
[8] => the lazy dog
[9] => The rain falls
[10] => rain falls head-first
[11] => falls head-first in
[12] => head-first in the
[13] => in the plain
[14] => This is a
[15] => is a quote
[16] => and that's that
)
正则表达式解释:
(?= # lookahead
( # start group 1
(?<=^|\s|') # lookbehind, make sure we have beginning of line or space or quote before
(?: # start non capture group
(?<=[a-zA-Z]) # lookbehind, make sure we have a letter before
' # a single quote
(?=[a-zA-Z]) # lookahead, make sure we have a letter after
| # OR
[^\s.,'] # not a space or dot or comma or single quote
)+ # group may appear 1 or more times
(?: # non capture group
\s+ # 1 or more spaces
(?: # non capture group
(?<=[a-zA-Z]) # lookbehind, make sure we have a letter before
' # a single quote
(?=[a-zA-Z]) # lookahead, make sure we have a letter after
| # OR
[^\s.,'] # not a space or dot or comma or single quote
)+ # group may appear 1 or more times
| # OR
$ # end of line
){2} # end group, must appear twice
) # end group 1
) # end lookahead
我正在尝试从一个字符串中获取所有三个单词的组——它可以包含多个句子——而不跨越句子边界。我让它适用于只有标准字母的单词:
preg_match_all("/(?=(\b(\w+)(?:\s+(\w+)\b|$)(?:\s+(\w+)\b|$)))/",$utext,$matches);
print_r($matches[1]);
但它会在有撇号或连字符的地方下降。因此,使用此示例文本:
The quick brown fox's feet jumped over the lazy dog. The rain falls head-first in the plain.
我想要这个列表:
- 快速棕色
- 快速棕狐
- 棕色狐狸脚
- 狐狸脚跳了
- 脚跳过
- 跳过
- 越过懒惰
- 懒狗
- 下雨了
- 雨头先下
- 头朝下倒在
- 在 中抢占先机
- 在平原
我已经尝试对上面的每个 \w 使用 [\w'-],但这会产生一些奇怪的情况:
Array ( [0] => The quick brown [1] => quick brown fox's [2] => brown fox's feet [3] => fox's feet jumped [4] => 's feet jumped [5] => s feet jumped [6] => feet jumped over [7] => jumped over the [8] => over the lazy [9] => the lazy dog [10] => The rain falls [11] => rain falls head-first [12] => falls head-first in [13] => head-first in the [14] => -first in the [15] => first in the [16] => in the plain )
我错过了什么?谢谢
只需将 \w
更改为 [^\s.]
(不是 space 或点)并删除边界一词。另一个变化是在正则表达式的开头添加一个交替 "beginning of line OR space":
$text = "The quick brown fox's feet jumped over the lazy dog. The rain falls head-first in the plain.";
preg_match_all("/(?=((?<=^|\s)[^\s.]+(?:\s+[^\s.]+|$)(?:\s+[^\s.]+|$)))/",$text,$matches);
print_r($matches[1]);
输出:
Array
(
[0] => The quick brown
[1] => quick brown fox's
[2] => brown fox's feet
[3] => fox's feet jumped
[4] => feet jumped over
[5] => jumped over the
[6] => over the lazy
[7] => the lazy dog
[8] => The rain falls
[9] => rain falls head-first
[10] => falls head-first in
[11] => head-first in the
[12] => in the plain
)
正则表达式解释:
(?= # lookahead
( # start group 1
(?<=^|\s) # lookbehind, make sure we have beginning of line or space before
[^\s.]+ # 1 or more non space, non dot
(?: # non capture group
\s+ # 1 or more spaces
[^\s.]+ # 1 or more non space, non dot
| # OR
$ # end of line
) # end group
(?: # non capture group
\s+ # 1 or more spaces
[^\s.]+ # 1 or more non space, non dot
| # OR
$ # end of line
) # end group
) # end group 1
) # end lookahead
根据评论编辑
$text = "The quick brown fox's feet jumped over the lazy dog. The rain falls head-first in the plain. 'This is a quote,' I say, and that's that.";
preg_match_all("/(?=((?<=^|\s|')(?:(?<=[a-zA-Z])'(?=[a-zA-Z])|[^\s.,'])+(?:\s+(?:(?<=[a-zA-Z])'(?=[a-zA-Z])|[^\s.,'])+|$){2}))/",$text,$matches);
print_r($matches[1]);
输出:
Array
(
[0] => The quick brown
[1] => quick brown fox's
[2] => brown fox's feet
[3] => fox's feet jumped
[4] => s feet jumped
[5] => feet jumped over
[6] => jumped over the
[7] => over the lazy
[8] => the lazy dog
[9] => The rain falls
[10] => rain falls head-first
[11] => falls head-first in
[12] => head-first in the
[13] => in the plain
[14] => This is a
[15] => is a quote
[16] => and that's that
)
正则表达式解释:
(?= # lookahead
( # start group 1
(?<=^|\s|') # lookbehind, make sure we have beginning of line or space or quote before
(?: # start non capture group
(?<=[a-zA-Z]) # lookbehind, make sure we have a letter before
' # a single quote
(?=[a-zA-Z]) # lookahead, make sure we have a letter after
| # OR
[^\s.,'] # not a space or dot or comma or single quote
)+ # group may appear 1 or more times
(?: # non capture group
\s+ # 1 or more spaces
(?: # non capture group
(?<=[a-zA-Z]) # lookbehind, make sure we have a letter before
' # a single quote
(?=[a-zA-Z]) # lookahead, make sure we have a letter after
| # OR
[^\s.,'] # not a space or dot or comma or single quote
)+ # group may appear 1 or more times
| # OR
$ # end of line
){2} # end group, must appear twice
) # end group 1
) # end lookahead