Haskell UTF8 文件正则表达式匹配
Haskell RegEx Matching on UTF8 file
这个函数是我写的
module PdfParser (parseOptions) where
import Text.Regex.PCRE
import Data.List.Split
parseOptions :: String -> [String]
parseOptions s = splitOn "\n" (s =~ regex :: String)
where
regex = "(?<=OPTIONS\n)((.|\n)*?)(?=INTERIEUR|INTÉRIEUR|EQUIPEMENTS DE SERIE)"
并测试
module PdfParserSpec (spec) where
import Test.Hspec
import Test.QuickCheck
import PdfParser(parseOptions)
spec :: Spec
spec = do
describe "PdfParser (parseOptions)" $ do
it "return List of options" $ do
referencialText <- readFile "test/assets/referential.txt"
parseOptions referencialText `shouldBe` [
"Peinture métallisée 550 €"
,"Jantes alliage 17\" Viva Stella [RDIF21] 300 €"
,"Chargeur sans fil 250 €"
,"Roue de secours tôle [RSEC01] 150 €"]
但是当我阅读文本文件时,我所有的字符 éè 等都被 3f3 替换了。然后我的正则表达式不起作用。
测试结果:
test/PdfParserSpec.hs:12:7:
1) PdfParser, PdfParser (parseOptions), return List of options
expected: ["Peinture m3tallis3e 550 64","Jantes alliage 17\" Viva Stella [RDIF21] 300 64","Chargeur sans fil 250 64","Roue de secours t4le [RSEC01] 150 64"]
but got: ["s alliage 17\" Viva Stella [RDIF21] 300 64","Chargeur sans fil 250 64","Roue de secours t4le [RSEC01] 150 64","INT1RIEUR","Sellerie Zen (Au lieu de Selleri"]
我的正则表达式适用于我的文件 -> https://regex101.com/r/HYBmMh/1
我该如何解决?
我将 hackage regex-pcre-builtin 更改为 light-pcre。并且有效!
我必须将我的字符串编码为 ut8 字节串,然后添加 utf8 编译时标志
module PdfParser (parseOptions) where
import Text.Regex.PCRE.Light(compile, utf8, match)
import Data.ByteString.UTF8(toString, fromString)
import Data.List.Split
import Data.String.Utils(strip)
parseOptions :: String -> Maybe [String]
parseOptions s = (splitOn "\n" . strip . toString . (!!0)) <$> (match regex (fromString s) [])
where
regex = compile (fromString "(?<=OPTIONS\n)([\s\S]*?)(?=INTÉRIEUR)") [utf8]
感谢您的评论:)
这个函数是我写的
module PdfParser (parseOptions) where
import Text.Regex.PCRE
import Data.List.Split
parseOptions :: String -> [String]
parseOptions s = splitOn "\n" (s =~ regex :: String)
where
regex = "(?<=OPTIONS\n)((.|\n)*?)(?=INTERIEUR|INTÉRIEUR|EQUIPEMENTS DE SERIE)"
并测试
module PdfParserSpec (spec) where
import Test.Hspec
import Test.QuickCheck
import PdfParser(parseOptions)
spec :: Spec
spec = do
describe "PdfParser (parseOptions)" $ do
it "return List of options" $ do
referencialText <- readFile "test/assets/referential.txt"
parseOptions referencialText `shouldBe` [
"Peinture métallisée 550 €"
,"Jantes alliage 17\" Viva Stella [RDIF21] 300 €"
,"Chargeur sans fil 250 €"
,"Roue de secours tôle [RSEC01] 150 €"]
但是当我阅读文本文件时,我所有的字符 éè 等都被 3f3 替换了。然后我的正则表达式不起作用。
测试结果:
test/PdfParserSpec.hs:12:7:
1) PdfParser, PdfParser (parseOptions), return List of options
expected: ["Peinture m3tallis3e 550 64","Jantes alliage 17\" Viva Stella [RDIF21] 300 64","Chargeur sans fil 250 64","Roue de secours t4le [RSEC01] 150 64"]
but got: ["s alliage 17\" Viva Stella [RDIF21] 300 64","Chargeur sans fil 250 64","Roue de secours t4le [RSEC01] 150 64","INT1RIEUR","Sellerie Zen (Au lieu de Selleri"]
我的正则表达式适用于我的文件 -> https://regex101.com/r/HYBmMh/1
我该如何解决?
我将 hackage regex-pcre-builtin 更改为 light-pcre。并且有效!
我必须将我的字符串编码为 ut8 字节串,然后添加 utf8 编译时标志
module PdfParser (parseOptions) where
import Text.Regex.PCRE.Light(compile, utf8, match)
import Data.ByteString.UTF8(toString, fromString)
import Data.List.Split
import Data.String.Utils(strip)
parseOptions :: String -> Maybe [String]
parseOptions s = (splitOn "\n" . strip . toString . (!!0)) <$> (match regex (fromString s) [])
where
regex = compile (fromString "(?<=OPTIONS\n)([\s\S]*?)(?=INTÉRIEUR)") [utf8]
感谢您的评论:)