使用 Levenshtein 距离重新排列单词
Rearrange words using Levenshtein distance
总结
我正在尝试在 php 中查找名称匹配百分比,但在此之前我需要根据第一个字符串重新排列字符串中的单词。
源代码是关于什么的?
我有两个字符串。首先,如果在字符串中找到 space,我将两个字符串都添加到数组中,然后将其添加到数组中。
$arraydataBaseName 和 $arraybankData 来自我的第一个数组,即 $arraydataBaseName 我正在搜索 $arraybankData 的所有值并获取密钥。我得到了正确的键排列,但无法将它们特定位置的值排列到新数组中。
$dataBaseName = "Jardine Lloyd Thompson";
$bankdata = "Thompson Thompson Jardine";
$replacedataBaseName = preg_replace("#[\s]+#", " ", $dataBaseName);
$replacebankData = preg_replace("#[\s]+#", " ", $bankdata);
$arraydataBaseName = explode(" ",$replacedataBaseName);
$arraybankData = explode(" ",$replacebankData);
echo "<br/>";
print_r($arraydataBaseName);
$a="";
$i="";
$arraysize = count($arraydataBaseName);
$push=array();
for($i=0;$i< $arraysize;$i++)
{
if(array_search($arraybankData[$i],$arraydataBaseName)>0)
{
${"$a$i"} = array_search($arraybankData[$i],$arraydataBaseName);
//echo ${"$a$i"};
array_push($push,${"$a$i"});
}
}
print_r($push);
案例 1:
输入
数据库名称 = Jardine Lloyd Thompson
银行名称 = Thompson Jardine Lloyd
输出
ExpectedOutput = Jardine Lloyd Thompson
案例二:##
输入
数据库名称 = Jardine Lloyd Thompson
银行名称 = Thoapson Jordine Llayd
如果在上面的 DatabaseName 中找不到单词,那么预期的搜索将基于 leventish 算法单词,该单词的距离更小,将被视为关键
输出
ExpectedOutput = Jordine Llayd Thoapson
Description of Problem
问题更新
当用户输入 $bankdata
包含更多无法匹配的单词时,我需要将它们附加到末尾。
我已经分解了案例 1 和案例 2 中的代码。
但很明显,如果 var_export 为假,您将使用相同的变量执行案例 2 代码。
//Case 1:
$DatabaseName = "Jardine Lloyd Thompson";
$BankName = "Thompson Jardine Lloyd";
//Split and sort them
$data = explode(" ", $DatabaseName);
$bank = explode(" ", $BankName);
sort($data);
sort($bank);
Var_export(($data == $bank)); //true
//Case 2
$DatabaseName = "Jardine Lloyd Thompson";
$BankName = "Thoapson Jordine Llayd";
//Split and sort
$data = explode(" ", $DatabaseName);
$bank = explode(" ", $BankName);
sort($data);
sort($bank);
// Loop and accumulate the levenshtein return
$lev = 0;
foreach($data as $key => $name){
$lev += levenshtein($name, $bank[$key]);
}
echo PHP_EOL . $lev; // 3 letters "off"
情况 1 和情况 2 在同一代码中的示例。
$DatabaseName = "Jardine Lloyd Thompson";
$BankName = "Thoapson Jordine Llayd";
$data = explode(" ", $DatabaseName);
$bank = explode(" ", $BankName);
sort($data);
sort($bank);
if($data == $bank){
echo "true";
exit;
// No need to do levenshtein
}
$lev = 0;
foreach($data as $key => $name){
$lev += levenshtein($name, $bank[$key]);
}
echo PHP_EOL . $lev;
这是一个简单的版本,随后逐字查找最佳匹配。
declare (strict_types=1);
$dataBaseName = 'Jardine Lloyd Thompson';
$bankdataRows =
[
'Thompson Jardine Lloyd',
'Blaaa Llayd Thoapson f***ing user input Jordine aso. ',
];
// assume the "database" is already stored trimmed since it is server-side controlled
$dbWords = preg_split("#[\s]+#", $dataBaseName);
foreach ($bankdataRows as $bankdata)
{
// here we trim the data received from client-side.
$bankWords = preg_split("#[\s]+#", trim($bankdata));
$result = [];
if(!empty($bankWords))
foreach ($dbWords as $dbWord)
{
$idx = null;
$least = PHP_INT_MAX;
foreach ($bankWords as $k => $bankWord)
if (($lv = levenshtein($bankWord, $dbWord)) < $least)
{
$least = $lv;
$idx = $k;
}
$result[] = $bankWords[$idx];
unset($bankWords[$idx]);
}
$result = array_merge($result, $bankWords);
var_dump($result);
}
结果
array(3) {
[0] =>
string(7) "Jardine"
[1] =>
string(5) "Lloyd"
[2] =>
string(8) "Thompson"
}
array(8) {
[0] =>
string(7) "Jordine"
[1] =>
string(5) "Llayd"
[2] =>
string(8) "Thoapson"
[3] =>
string(5) "Blaaa"
[4] =>
string(7) "f***ing"
[5] =>
string(4) "user"
[6] =>
string(5) "input"
[7] =>
string(4) "aso."
}
您可能希望扩展此方法,首先计算每个可能组合的 Levenshtein 距离,然后 select 最佳的整个匹配。
总结
我正在尝试在 php 中查找名称匹配百分比,但在此之前我需要根据第一个字符串重新排列字符串中的单词。
源代码是关于什么的?
我有两个字符串。首先,如果在字符串中找到 space,我将两个字符串都添加到数组中,然后将其添加到数组中。 $arraydataBaseName 和 $arraybankData 来自我的第一个数组,即 $arraydataBaseName 我正在搜索 $arraybankData 的所有值并获取密钥。我得到了正确的键排列,但无法将它们特定位置的值排列到新数组中。
$dataBaseName = "Jardine Lloyd Thompson";
$bankdata = "Thompson Thompson Jardine";
$replacedataBaseName = preg_replace("#[\s]+#", " ", $dataBaseName);
$replacebankData = preg_replace("#[\s]+#", " ", $bankdata);
$arraydataBaseName = explode(" ",$replacedataBaseName);
$arraybankData = explode(" ",$replacebankData);
echo "<br/>";
print_r($arraydataBaseName);
$a="";
$i="";
$arraysize = count($arraydataBaseName);
$push=array();
for($i=0;$i< $arraysize;$i++)
{
if(array_search($arraybankData[$i],$arraydataBaseName)>0)
{
${"$a$i"} = array_search($arraybankData[$i],$arraydataBaseName);
//echo ${"$a$i"};
array_push($push,${"$a$i"});
}
}
print_r($push);
案例 1:
输入
数据库名称 = Jardine Lloyd Thompson
银行名称 = Thompson Jardine Lloyd
输出
ExpectedOutput = Jardine Lloyd Thompson
案例二:##
输入
数据库名称 = Jardine Lloyd Thompson
银行名称 = Thoapson Jordine Llayd
如果在上面的 DatabaseName 中找不到单词,那么预期的搜索将基于 leventish 算法单词,该单词的距离更小,将被视为关键
输出
ExpectedOutput = Jordine Llayd Thoapson
Description of Problem
问题更新
当用户输入 $bankdata
包含更多无法匹配的单词时,我需要将它们附加到末尾。
我已经分解了案例 1 和案例 2 中的代码。
但很明显,如果 var_export 为假,您将使用相同的变量执行案例 2 代码。
//Case 1:
$DatabaseName = "Jardine Lloyd Thompson";
$BankName = "Thompson Jardine Lloyd";
//Split and sort them
$data = explode(" ", $DatabaseName);
$bank = explode(" ", $BankName);
sort($data);
sort($bank);
Var_export(($data == $bank)); //true
//Case 2
$DatabaseName = "Jardine Lloyd Thompson";
$BankName = "Thoapson Jordine Llayd";
//Split and sort
$data = explode(" ", $DatabaseName);
$bank = explode(" ", $BankName);
sort($data);
sort($bank);
// Loop and accumulate the levenshtein return
$lev = 0;
foreach($data as $key => $name){
$lev += levenshtein($name, $bank[$key]);
}
echo PHP_EOL . $lev; // 3 letters "off"
情况 1 和情况 2 在同一代码中的示例。
$DatabaseName = "Jardine Lloyd Thompson";
$BankName = "Thoapson Jordine Llayd";
$data = explode(" ", $DatabaseName);
$bank = explode(" ", $BankName);
sort($data);
sort($bank);
if($data == $bank){
echo "true";
exit;
// No need to do levenshtein
}
$lev = 0;
foreach($data as $key => $name){
$lev += levenshtein($name, $bank[$key]);
}
echo PHP_EOL . $lev;
这是一个简单的版本,随后逐字查找最佳匹配。
declare (strict_types=1);
$dataBaseName = 'Jardine Lloyd Thompson';
$bankdataRows =
[
'Thompson Jardine Lloyd',
'Blaaa Llayd Thoapson f***ing user input Jordine aso. ',
];
// assume the "database" is already stored trimmed since it is server-side controlled
$dbWords = preg_split("#[\s]+#", $dataBaseName);
foreach ($bankdataRows as $bankdata)
{
// here we trim the data received from client-side.
$bankWords = preg_split("#[\s]+#", trim($bankdata));
$result = [];
if(!empty($bankWords))
foreach ($dbWords as $dbWord)
{
$idx = null;
$least = PHP_INT_MAX;
foreach ($bankWords as $k => $bankWord)
if (($lv = levenshtein($bankWord, $dbWord)) < $least)
{
$least = $lv;
$idx = $k;
}
$result[] = $bankWords[$idx];
unset($bankWords[$idx]);
}
$result = array_merge($result, $bankWords);
var_dump($result);
}
结果
array(3) {
[0] =>
string(7) "Jardine"
[1] =>
string(5) "Lloyd"
[2] =>
string(8) "Thompson"
}
array(8) {
[0] =>
string(7) "Jordine"
[1] =>
string(5) "Llayd"
[2] =>
string(8) "Thoapson"
[3] =>
string(5) "Blaaa"
[4] =>
string(7) "f***ing"
[5] =>
string(4) "user"
[6] =>
string(5) "input"
[7] =>
string(4) "aso."
}
您可能希望扩展此方法,首先计算每个可能组合的 Levenshtein 距离,然后 select 最佳的整个匹配。