文本的高效移位不变特征变换
Efficient shift invariant feature transform for text
编辑
有3个连续的比特流。一次开始阅读它们。一段时间后,一个停止,现在有 3 个相同长度的非常长的字符串。
这 3 个字符串应该包含介于两者之间的某个位置的已发送消息。除了发送消息随机位。
objective 现在是,找出如何叠加 3 个字符串以进一步执行任何纠错。
hfkasjkfhjs<<this is a string><hjaksdf
jkdf::this is b strimg>>iowefjlasfjoie
jfaskflsjdflf<<this is a tring>>oweio
这是一个简单的例子。现在我想要的是这个
<<this is a string><
::this is b string>>
<<this is a tring>>
现在我可以只使用多数表决并获得正确的序列
<<this is a string>>
我如何有效地实现这一目标?
TXR Lisp 中的探索性编程:
fuzz-extract.tl
的内容:
(defun fuzz (str)
(window-map 1 " "
(do if (memql #\X @rest)
#\X #\space)
str))
(defun correlate (str1 str2 thresh)
(let ((len (length str1))
(pat (mkstring thresh #\X)))
(each ((offs (range* 0 len)))
(let* ((str2-shf `@[str2 offs..:]@[str2 0..offs]`)
(str2-dshf `@{str2-shf}@{str2-shf}`)
(raw-diff (mapcar [iff eql (ret #\X) (ret #\space)]
str1 str2-dshf))
(diff (fuzz raw-diff))
(pos (search-str diff pat)))
(if pos
(let ((rng (+ (r^ #/X+/ pos diff) #R(-2 2))))
(if (< (from rng) 0)
(set rng 0..(to rng)))
(return-from correlate [str1 rng])))))))
(defun count-same (big-s lit-s offs)
(countq t [mapcar eql [big-s offs..:] lit-s]))
(defun find-off (big-s lit-s)
(let ((idx-count-pairs (collect-each ((i (range 0 (- (length big-s)
(length lit-s)))))
(list i (count-same big-s lit-s i)))))
(first [find-max idx-count-pairs : second])))
(defun extract-from-three (str1 str2 str3 : (thresh 10))
(let* ((ss1 (correlate str1 str2 thresh))
(ss2 (correlate str2 str3 thresh))
(ss3 (correlate str3 str1 thresh))
(maxlen [[mapf max length length length] ss1 ss2 ss3])
(pad (mkstring (trunc maxlen 2) #\space))
(buf1 `@pad@ss1@pad`)
(off1 (find-off buf1 ss2))
(buf2 `@{"" off1}@ss2`)
(off2 (find-off buf1 ss3))
(buf3 `@{"" off2}@ss3`))
(mapcar (do cond
((eql @1 @2) @1)
((eql @2 @3) @2)
((eql @3 @1) @3)
(t #\space))
buf1 buf2 buf3)))
互动环节:
$ txr -i fuzz-extract.tl
1> (extract-from-three
"hfkasjkfhjs<<this is a string><hjaksdf"
"jkdf::this is b strimg>>iowefjlasfjoie"
"jfaskflsjdflf<<this is a tring>>oweio")
" f<<this is a string>> "
2> (trim-str *1)
"f<<this is a string>>"
编辑
有3个连续的比特流。一次开始阅读它们。一段时间后,一个停止,现在有 3 个相同长度的非常长的字符串。
这 3 个字符串应该包含介于两者之间的某个位置的已发送消息。除了发送消息随机位。
objective 现在是,找出如何叠加 3 个字符串以进一步执行任何纠错。
hfkasjkfhjs<<this is a string><hjaksdf
jkdf::this is b strimg>>iowefjlasfjoie
jfaskflsjdflf<<this is a tring>>oweio
这是一个简单的例子。现在我想要的是这个
<<this is a string><
::this is b string>>
<<this is a tring>>
现在我可以只使用多数表决并获得正确的序列
<<this is a string>>
我如何有效地实现这一目标?
TXR Lisp 中的探索性编程:
fuzz-extract.tl
的内容:
(defun fuzz (str)
(window-map 1 " "
(do if (memql #\X @rest)
#\X #\space)
str))
(defun correlate (str1 str2 thresh)
(let ((len (length str1))
(pat (mkstring thresh #\X)))
(each ((offs (range* 0 len)))
(let* ((str2-shf `@[str2 offs..:]@[str2 0..offs]`)
(str2-dshf `@{str2-shf}@{str2-shf}`)
(raw-diff (mapcar [iff eql (ret #\X) (ret #\space)]
str1 str2-dshf))
(diff (fuzz raw-diff))
(pos (search-str diff pat)))
(if pos
(let ((rng (+ (r^ #/X+/ pos diff) #R(-2 2))))
(if (< (from rng) 0)
(set rng 0..(to rng)))
(return-from correlate [str1 rng])))))))
(defun count-same (big-s lit-s offs)
(countq t [mapcar eql [big-s offs..:] lit-s]))
(defun find-off (big-s lit-s)
(let ((idx-count-pairs (collect-each ((i (range 0 (- (length big-s)
(length lit-s)))))
(list i (count-same big-s lit-s i)))))
(first [find-max idx-count-pairs : second])))
(defun extract-from-three (str1 str2 str3 : (thresh 10))
(let* ((ss1 (correlate str1 str2 thresh))
(ss2 (correlate str2 str3 thresh))
(ss3 (correlate str3 str1 thresh))
(maxlen [[mapf max length length length] ss1 ss2 ss3])
(pad (mkstring (trunc maxlen 2) #\space))
(buf1 `@pad@ss1@pad`)
(off1 (find-off buf1 ss2))
(buf2 `@{"" off1}@ss2`)
(off2 (find-off buf1 ss3))
(buf3 `@{"" off2}@ss3`))
(mapcar (do cond
((eql @1 @2) @1)
((eql @2 @3) @2)
((eql @3 @1) @3)
(t #\space))
buf1 buf2 buf3)))
互动环节:
$ txr -i fuzz-extract.tl
1> (extract-from-three
"hfkasjkfhjs<<this is a string><hjaksdf"
"jkdf::this is b strimg>>iowefjlasfjoie"
"jfaskflsjdflf<<this is a tring>>oweio")
" f<<this is a string>> "
2> (trim-str *1)
"f<<this is a string>>"