简化解析的正则表达式
Simplify parsed regex
我必须简化解析为特定数据类型的自定义正则表达式。 “简化”是指以下(强调我的):
Given the rules:
- lowercase letters match themselves, eg.:
a
matches a
and nothing else
- parens enclosing only letters match their full sequence, eg.:
(abc)
matches abc
and nothing else
- square brackets enclosing only letters match every letters inside, eg.:
[abc]
matches a
and b
and c
and nothing else
The following are all valid:
(a[bc])
matches ab
and ac
and nothing else
[a(bc)]
matches a
and bc
and nothing else
(a(bc))
is the same as (abc)
and matches abc
and nothing else
[a[bc]]
is the same as [abc]
and matches a
and b
and c
and nothing else
Regexes can be simplified. For example [a[[bb]b[[b]]](c)(d)]
is
really just the same as [abcd]
which matches a
, b
, c
and d
.
我在 Haskell 中使用 attoparsec
和以下目标数据类型实现了一个简单的解析器组合器:
data Regex
= Symbol Char
| Concat [Regex] -- ()
| Union [Regex] -- []
deriving (Eq)
但是,我真的很难处理简化部分。我尝试通过展开它们的组合来减少 Concat
s 和 Union
s,nubbing and concatMapping 无济于事。我认为我定义的数据类型可能不是最合适的,但我有 运行 的想法(深夜在这里)。你能帮我看看正确的方向吗?谢谢!
simplify :: Regex -> Regex
simplify (Symbol s) = Symbol s
simplify (Concat [Symbol c]) = Symbol c
simplify (Concat rs) = Concat $ simplify <$> rs
simplify (Union [Symbol c]) = Symbol c
simplify (Union rs) = Union $ nub $ simplify <$> rs
对于初学者,您缺少一些简单的改进。 simplify (Concat [x]) = x
对于 Union 也是如此:包装的正则表达式不需要专门是一个符号。
然后您需要开始查看包含其他 Concat
的 Concat
,对于 Union
也是如此。当然,您首先会简化包装列表的元素,但在将结果塞回包装器之前,您会使用相同的包装器提取任何元素。类似于:
simplify (Concat xs) =
case concatMap liftConcats (map simplify xs) of
[x] -> x
xs -> Concat xs
where liftConcats :: Regex -> [Regex]
liftConcats r = _exerciseForTheReader
然后你可以为 Union 做类似的事情,同时加入 nub
。
我必须简化解析为特定数据类型的自定义正则表达式。 “简化”是指以下(强调我的):
Given the rules:
- lowercase letters match themselves, eg.:
a
matchesa
and nothing else- parens enclosing only letters match their full sequence, eg.:
(abc)
matchesabc
and nothing else- square brackets enclosing only letters match every letters inside, eg.:
[abc]
matchesa
andb
andc
and nothing elseThe following are all valid:
(a[bc])
matchesab
andac
and nothing else[a(bc)]
matchesa
andbc
and nothing else(a(bc))
is the same as(abc)
and matchesabc
and nothing else[a[bc]]
is the same as[abc]
and matchesa
andb
andc
and nothing elseRegexes can be simplified. For example
[a[[bb]b[[b]]](c)(d)]
is really just the same as[abcd]
which matchesa
,b
,c
andd
.
我在 Haskell 中使用 attoparsec
和以下目标数据类型实现了一个简单的解析器组合器:
data Regex
= Symbol Char
| Concat [Regex] -- ()
| Union [Regex] -- []
deriving (Eq)
但是,我真的很难处理简化部分。我尝试通过展开它们的组合来减少 Concat
s 和 Union
s,nubbing and concatMapping 无济于事。我认为我定义的数据类型可能不是最合适的,但我有 运行 的想法(深夜在这里)。你能帮我看看正确的方向吗?谢谢!
simplify :: Regex -> Regex
simplify (Symbol s) = Symbol s
simplify (Concat [Symbol c]) = Symbol c
simplify (Concat rs) = Concat $ simplify <$> rs
simplify (Union [Symbol c]) = Symbol c
simplify (Union rs) = Union $ nub $ simplify <$> rs
对于初学者,您缺少一些简单的改进。 simplify (Concat [x]) = x
对于 Union 也是如此:包装的正则表达式不需要专门是一个符号。
然后您需要开始查看包含其他 Concat
的 Concat
,对于 Union
也是如此。当然,您首先会简化包装列表的元素,但在将结果塞回包装器之前,您会使用相同的包装器提取任何元素。类似于:
simplify (Concat xs) =
case concatMap liftConcats (map simplify xs) of
[x] -> x
xs -> Concat xs
where liftConcats :: Regex -> [Regex]
liftConcats r = _exerciseForTheReader
然后你可以为 Union 做类似的事情,同时加入 nub
。