如何在 unicode 字符串中找到相似的 unicode 文本？

Question

我有一根大绳子和一根针。我想从字符串中找出最接近该针的文本。但是，线和针都是 Unicode（孟加拉语）。我有一些解决方案，但只有英语。我在 Unicode（孟加拉语）中找不到解决方案。请参阅以下罗马尼亚语示例，以更好地理解我的问题。

来源："Cei bătrâni fac o băutură toxică pentru regina joviană".

针："băutură pentru toxică "

输出："băutură toxică pentru"

来源："Cei bătrâni fac o băutură toxică pentru regina joviană".

针："bătra pak o băuturărinan"

输出："bătrâni fac o băutură"

我发现我可以使用余弦或曼哈顿相似性度量等相似性度量来做到这一点。但是，我认为这个算法的实现会很困难。您能否建议我使用 php 的任何库函数来处理 Unicode 字符？ TIA

Answer 1

我认为最快的方法是 ShpinxSearch Engine:

http://sphinxsearch.com/

它有类似 mysql 的客户端。你可以这样做：

mysql> SELECT * FROM test WHERE MATCH('băutură pentru toxică');

输出是按最佳匹配排序的文档列表。

============================================= =================

或尝试在 php 上创建单词索引 table（这是一个非常简单的算法，必须根据您的需要进行优化）：

function near( $src, $needle) {
  $hashIndexes = [];
  $words = mb_split(' ', $src);
  foreach( $words as $k => $w ) {
    $w = mb_strtolower( $w, 'utf-8');
    $hashIndexes [sha1( $w )] = [ 'key' => $k, 'word' => $w ];
  }
  $nWords = mb_split(' ',  mb_strtolower( $needle, 'utf-8'));
  $matches = [];
  foreach( $nWords as $k => $w ) {
    $hash = sha1( $w );
    if( isset( $hashIndexes [ $hash ]) && $w === $hashIndexes [ $hash ] ['word']) {
      $matches [] = $hashIndexes [ $hash ] ['key'];
    }
  }
  if( ! empty( $matches )) {
    sort( $matches );
    $start = $matches [0];
    $last = end( $matches );
    $result = array_slice( $words, $start, $last - $start + 1);
    return implode( ' ', $result );
  } else {
    return '';
  }
}

$src = "Cei bătrâni fac o băutură some other toxică pentru regina joviană";
$needle ="băutură pentru another toxică";

echo near( $src, $needle)  . "\n";

============================================= =================

优化是一项伟大的工作(google呵呵呵呵)。

您必须从 $words 和 $nWords 数组中删除 .、,、...、? 等符号。
$hashIndexes [sha1( $w )] 必须是数组（因为 sha1 可能与其他单词相同）
$hashIndexes [sha1( $w )] ['key'] 也必须是文本中等号词的数组。
您必须开发一种算法，该算法必须确定最接近 ['key'] 的输出最接近匹配。
等等。这对每个人来说都是非常艰巨的任务。祝你好运！

我真的建议您安装 SphinxSearch 或一些类似的文本搜索引擎。

如何在 unicode 字符串中找到相似的 unicode 文本？

How to find similar unicode text in a unicode string?

php

string

unicode

similarity