使用正则表达式 select 多个句子模式 - 分组问题？

Question

我想在 R 中使用 Regex 语句从数据框中提取模式的完全匹配项时遇到问题。

我有 11 个句子模式，我希望能够 select 使用一个 Regex 仅记录与我的数据框中的这些模式匹配的完全匹配（我已经能够使用多个正则表达式，但这真的很麻烦）。如果我能为此做些什么，请提供任何帮助。

这些是我的句子：

从任何其他章更改为标题 0101 至 0106。
从任何其他章节到 0712.20 到 0712.39 的副标题的更改。
从任何其他章改变到品目 0903。
从任何其他标题到子标题 1806.20 的更改。
从任何其他章到副标题 1207.99 的更改。
从任何其他品目改变到品目 4302。
从品目 4102 或任何其他章更改到子目 4105.10。
从品目 4102、子目 4105.10 或任何其他章更改为子目 4105.30。
从子目 4103.10 或任何其他章节更改为子目 4106.21。
从子目 4103.10 或 4106.21 或任何其他章节更改为子目 4106.22。
关税项目 7304.41.30 从子目 7304.49 或任何其他章节。

这是我现在拥有的正则表达式，它 select 完全匹配和部分匹配（我被卡住了）- 所以我最终从我的数据框中得到了我不想要的记录这些句子（我知道这很乱，只是一个例子）。

^A change to (?:headings|heading|subheadings|subheading|tariff item) (?:\d+\S\d+\S\d+|\d+\S\d+) (?:through \d+\S\d+ from any other chapter.|from any other chapter.|from any other heading.|)|from heading \d+\S\d+ or any other chapter.|from (?:heading|subheading|subheadings) \d+\S\d+|, subheading \d+\S\d+ or any other chapter| or any other chapter.| or \d+\S\d+

这是我使用 Regex 对所有 11 个句子进行完全匹配的结果。在此之后我无法继续对 cleany 进行分组：

^A change to (?:tariff item|headings|heading|subheading|subheadings) (?:\d+\S\d+|\d+\S\d+\S\d+|\d+\S\d+) (?:from|through)

Answer 1

您可以使用

rx <- "A\s+change\s+to\s+(?:(?:sub)?headings?|tariff\s+item)\s+\d[0-9.]*(?:\s+through\s+\d[0-9.]*)?\s+from(?:(?:,?\s+(?:sub)?headings?\s+\d[0-9.]*)+(?:\s+or\s+\d[0-9.]*)*\s+or)?\s+any\s+other\s+(?:heading|chapter)\."

参见regex demo。请注意，\s+ 匹配 1 个或多个空白字符，即使单词之间的空白字符的数量和类型不固定，也会匹配。

详情

A\s+change\s+to\s+ - A change to 子字符串
(?:(?:sub)?headings?|tariff\s+item) - subheading, subheadings, heading, headings, tariff item 子串
\s+\d[0-9.]* - 1+ 个空格、1 个数字和 0 个或更多数字或 .
(?:\s+through\s+\d[0-9.]*)? - 一个可选的序列：
- \s+ - 1+ 个空格
- through - through
- \s+ - 1+ 个空格
- \d[0-9.]* - 1 个数字和 0 个或更多数字或 .
\s+from - 1+ 个空格和 from
(?:(?:,?\s+(?:sub)?headings?\s+\d[0-9.]*)+(?:\s+or\s+\d[0-9.]*)*\s+or)? - 一个可选的序列：
- (?:,?\s+(?:sub)?headings?\s+\d[0-9.]*)+ - 1 个或多个序列：
  - ,? - 一个可选的 ,
  - \s+
  - (?:sub)?headings? - 一个可选的 sub，然后是 heading，然后是一个可选的 s
  - \s+ - 1+ 个空格
  - \d[0-9.]* - 一个数字，然后是 0+ 个数字或 . 个字符
- (?:\s+or\s+\d[0-9.]*)* - 0 个或多个序列：
  - \s+ - 1+ 个空格
  - or\s+\d[0-9.]* - or，1+ 个空格，一个数字，然后是 0+ 个数字或 . 个字符
- \s+or - 1+ 个空格和 or
\s+any\s+other\s+(?:heading|chapter)\. - any other heading. 或 any other chapter.

在 this online R demo 中返回所有 11 个匹配项：

text <- "A change to headings 0101 through 0106 from any other chapter.
A change to subheadings 0712.20 through 0712.39 from any other chapter.
A change to heading 0903 from any other chapter.
A change to subheading 1806.20 from any other heading.
A change to subheading 1207.99 from any other chapter.
A change to heading 4302 from any other heading.
A change to subheading 4105.10 from heading 4102 or any other chapter.
A change to subheading 4105.30 from heading 4102, subheading 4105.10 or any other chapter.
A change to subheading 4106.21 from subheading 4103.10 or any other chapter.
A change to subheading 4106.22 from subheadings 4103.10 or 4106.21 or any other chapter.
A change to tariff item 7304.41.30 from subheading 7304.49 or any other chapter."
rx <- "A\s+change\s+to\s+(?:(?:sub)?headings?|tariff\s+item)\s+\d[0-9.]*(?:\s+through\s+\d[0-9.]*)?\s+from(?:(?:,?\s+(?:sub)?headings?\s+\d[0-9.]*)+(?:\s+or\s+\d[0-9.]*)*\s+or)?\s+any\s+other\s+(?:heading|chapter)\."
regmatches(text, gregexpr(rx, text))

使用正则表达式 select 多个句子模式 - 分组问题？

Using Regex to select multiple sentence patterns - issue with grouping?

regex

r

regex-group