如何分离 OCRed 文本中错误组合的单词？

Question

我有一份由其他人进行 OCR 识别的长文档的文本，其中包含许多未正确识别间距且两个单词运行在一起的实例（例如：divisionbetween、hasalready、其他所有人）。有没有使用awk、sed等比较快速的方法来查找不是单词的字符串并检查它们是否可以分成合法的单词？

或者有其他快速修复它们的方法吗？例如，我注意到 Chrome 能够将组合词标记为拼写错误，当您右键单击时，建议的更正几乎总是我想要的，但我不知道一种快速的方法来自动- 修复它们（并且有数千个）。

谢谢！

Answer 1

马特，当你修复其他人试图使用命令行工具执行此操作时，你可能会产生错误，但如果你有一个单词词典，那么你可以使用 GNU awk for patsplit() 和 multi-char RS 如果您的任何文件有 DOS 行结尾：

$ cat words
bar
disco
discontent
exchange
experts
foo
is
now
of
tent
winter

$ cat file
now is the freezing winter
of ExPeRtSeXcHaNgE discontent

.

$ cat tst.awk
BEGIN {
    RS = "\r?\n"
    minSubLgth = 2
    minWordLgth = minSubLgth * 2
}
NR==FNR {
    realWords[tolower([=11=])]
    next
}
{
    n = patsplit([=11=],words,"[[:alpha:]]{"minWordLgth",}+",seps)
    printf "%s", seps[0]
    for (i=1; i<=n; i++) {
        word = words[i]
        lcword = tolower(word)
        if ( !(lcword in realWords) ) {
            found = 0
            for (j=length(lcword)-minSubLgth; j>=minSubLgth; j--) {
                head = substr(lcword,1,j)
                tail = substr(lcword,j+1)
                if ( (head in realWords) && (tail in realWords) ) {
                    found = 1
                    break
                }
            }
            word = (found ? "[[[" substr(word,1,j) " " substr(word,j+1) "]]]" : "<<<" word ">>>")
        }
        printf "%s%s", word, seps[i]
    }
    print ""
}

.

$ awk -f tst.awk words file
now is the <<<freezing>>> winter
of [[[ExPeRtS eXcHaNgE]]] discontent

识别不在单词列表中的 case-insensitive 字母字符串，然后从每个字符串中迭代创建子字符串对，并查看这些子字符串是否在 "realWords[]" 中。它会有点慢且近似，并且仅适用于组合 2 个单词而不是 3 个或更多单词的情况，但也许就足够了。考虑一下该算法，因为它可能是也可能不是拆分子字符串的最佳方法（我没有考虑太多），调整不要查找少于一定数量字母的单词（我在上面使用了 4 个），而不是拆分成少于其他字母数的子字符串（我在上面使用了 2 个），你可能真的想也可能不想突出显示 realWords[] 中没有出现的单词，但你不能拆分成存在（freezing 以上）。

FWIW 我从 https://github.com/dwyl/english-words/blob/master/words_alpha.txt 下载了单词列表（您可能想要 google 以获得更好的列表，因为这个列表似乎包含一些 non-words，例如 wasn和 ll) 并使用问题中的文本版本删除了一些额外的空格，您可以看到它可以捕捉到的一些东西，一些无法解决的东西，还有一些是错误的：

$ cat file
I have the textof a long document that was OCRed by someoneelse that contains
a lot ofinstances where the spacingwasn't recognized properly and two words
are run together (ex: divisionbetween, hasalready, everyoneelse). Is there a
relatively quickway using awk, sed, or the like tofind strings that are not
words andcheck if they can separatedintolegitimate words?

Or is there someother quick way to fix them? Forinstance, Inotice that
Chrome is able toflag the combined words asmisspellings and when you right
click, thesuggested correction is pretty much always the oneIwant, but I
don't know a quickway to just auto-fix themall(and there are thousands).

$ awk -f tst.awk words_alpha.txt file
I have the [[[text of]]] a long document that was [[[OC Red]]] by [[[someone else]]] that contains
a lot [[[of instances]]] where the [[[spacing wasn]]]'t recognized properly and two words
are run together (ex: [[[division between]]], [[[has already]]], [[[everyone else]]]). Is there a
relatively [[[quick way]]] using awk, sed, or the like [[[to find]]] strings that are not
words [[[and check]]] if they can <<<separatedintolegitimate>>> words?

Or is there [[[some other]]] quick way to fix them? [[[For instance]]], [[[Ino tice]]] that
Chrome is able [[[to flag]]] the combined words [[[as misspellings]]] and when you right
click, [[[the suggested]]] correction is pretty much always the <<<oneIwant>>>, but I
don't know a [[[quick way]]] to just auto-fix [[[thema ll]]](and there are thousands).

FWIW 在我的 [动力不足] 笔记本电脑上的 cygwin 上运行花了大约半秒。

如何分离 OCRed 文本中错误组合的单词？

How to separate erroneously combined words in OCRed text?

ocr

awk

text-processing

sed