如何从字符串数组中以任何顺序匹配和突出显示所有术语?
How to match and highlight all terms in any order from an array of strings?
要求是:
- 从数组中查找包含 ALL ANY[ 中的术语的字符串(从这里开始称为 options) =100=]订单
- 正确突出显示匹配项 - 即。在每个匹配项前后插入一个字符串 - 我在这里使用
<u>
和 </u>
- 搜索查询和选项都可以是“任何”
为了简单起见,答案可以专注于通过仅包含 ASCII 字符的列表进行不区分大小写的搜索,并假设术语分隔符是普通的 space - 即。输入为“Foo bar baz”的查询意味着搜索词是 ['foo', 'bar', 'baz']
.
澄清一下:
- 所有术语必须在匹配选项中彼此分开存在 - 即。较短的术语不应仅作为较长术语的子串存在,并且两个术语不应重叠
- 选项中重复搜索词的出现次数必须至少与查询中出现的次数相同
最终的应用程序(也许并不奇怪)是一种自动完成功能。
TL;DR
Most recent fiddle comparing the proposed algorithms side by side:
https://jsfiddle.net/Mikk3lRo/ndeuqn02/7/
(feel free to update this link if you add a new algorithm)
jsPerf comparing algorithms in a somewhat more realistic / representative way - a few strings are basically "entered" one character at a time on each rep:
https://jsperf.com/comparison-of-algorithms-to-search-and-highlight
At this point it is clear (thanks to trincot's excellent base-comparison) that the majority of time used by the original implementations was spent on DOM-output. Its significance has been minimized as much as possible in the fiddle.
There is still a clear difference in performance between the algorithms in each search, but not one of them is consistently fastest on every keystroke. After revisiting and cleaning up my own "Divide and Conquer" it does outperform the others consistently in any realistic scenario I try though.
Tigregalis introduced the idea of a pre-run optimization, which seems reasonable given that options are unlikely to change between keystrokes. I have added (a function for) this to all methods here. The only algorithm where I saw an obvious benefit from it was in Skirtle's Permutations, but I'll encourage each answerer to consider if it might be useful for their own algorithms.
Some algorithms will be much easier to adapt than others. It is still my opinion that this will be more important than the minor performance differences in a real implementation.
Note that the current version of Tigregalis' Shrinking Set has a bug - I've excluded it from fiddle and jsperf until that is fixed.
病毒排列
理论上,这可以通过“手动”构建一个 RegExp 来解决,该正则表达式包含由捕获组分隔的搜索词的所有可能排列,以捕获词之间的任何内容 - 搜索“foo bar”会导致 (foo)(.*?)(bar)|(bar)(.*?)(foo)
.
然后使用 string.replace()
一次性完成突出显示。如果字符串有任何变化,我们就会匹配。
演示:
var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
var terms, terms_esc;
function viral_permutations() {
var t0, t1, i, permuted, res_elm, meta_elm, regex_string, regex, li, option, match_groups, highlighted;
meta_elm = document.getElementById('viral_permutations_meta');
res_elm = document.getElementById('viral_permutations_result');
res_elm.innerHTML = meta_elm.innerHTML = '';
t0 = performance.now();
//Split query in terms at delimiter and lowercase them
terms = document.getElementById('viral_permutations').value.split(/\s/).filter(function(n) {
return n;
}).map(function(term){
return term.toLowerCase();
});
meta_elm.innerHTML += 'Terms: ' + JSON.stringify(terms) + '<br>';
//Escape terms
terms_esc = terms.map(function(term) {
return term.replace(/[-[\]{}()*+?.,\^$|#\s]/g, "\$&");
});
//Wrap terms in in individual capturing groups
terms_esc = terms.map(function(term) {
return '(' + term + ')';
});
//Find all permutations
permuted = permutate_array(terms_esc);
//Construct a group for each permutation
match_groups = [];
for (var i in permuted) {
match_groups.push(permuted[i].join('(.*?)'));
}
try {
//Construct the final regex
regex_string = match_groups.join('|');
//Display it
document.getElementById('viral_permutations_regex').innerHTML = regex_string;
meta_elm.innerHTML += 'RegExp length: ' + regex_string.length + '<br>';
regex = new RegExp(regex_string, 'i');
//Loop through each option
for (i = 0; i < options.length; i++) {
option = options[i];
//Replace the terms with highlighted terms
highlighted = option.replace(regex, viral_permutations_replace);
//If anything was changed (or the query is empty) we have a match
if (highlighted !== option || terms.length === 0) {
//Append it to the result list
li = document.createElement('li');
li.innerHTML = highlighted;
res_elm.appendChild(li);
}
}
//Print some meta
t1 = performance.now();
meta_elm.innerHTML += 'Time: ' + (Math.round((t1 - t0) * 100) / 100) + 'ms';
} catch(e) {
meta_elm.innerHTML += '<span style="color:red">' + e.message + '</span>';
}
}
//The replacement function
function viral_permutations_replace() {
var i, m, j, retval, m_cased, unmatched;
retval = '';
//Make a clone of the terms array (that we can modify without destroying the original)
unmatched = terms.slice(0);
//Loop arguments between the first (which is the full match) and
//the last 2 (which are the offset and the full option)
for (i = 1; i < arguments.length - 1; i++) {
m = arguments[i];
//Check that we have a string - most of the arguments will be undefined
if (typeof m !== 'string') continue;
//Lowercase the match
m_cased = m.toLowerCase();
//Append it to the return value - highlighted if it is among our terms
j = unmatched.indexOf(m_cased);
if (j >= 0) {
//Remove it from the unmatched terms array
unmatched.splice(j, 1);
retval += '<u>' + m + '</u>';
} else {
retval += m;
}
}
return retval;
}
//Helper function to return all possible permutations of an array
function permutate_array(arr) {
var perm, perm_intern;
perm_intern = function(perm, pre, post, n) {
var elem, i, j, ref, rest;
if (n > 0) {
for (i = j = 0, ref = post.length; 0 <= ref ? j < ref : j > ref; i = 0 <= ref ? ++j : --j) {
rest = post.slice(0);
elem = rest.splice(i, 1);
perm_intern(perm, pre.concat(elem), rest, n - 1);
}
} else {
perm.push(pre);
}
};
perm = [];
perm_intern(perm, [], arr, arr.length);
return perm;
}
viral_permutations();
<input type="text" id="viral_permutations" onkeyup="viral_permutations()">
<p id="viral_permutations_meta"></p>
<pre style="width:100%;overflow:auto;background:#eee" id="viral_permutations_regex"></pre>
<ul style="height:300px;overflow:auto" id="viral_permutations_result"></ul>
感谢 trincot 指出我的原始版本偶尔会两次突出显示重复出现的术语 - 此代码段已修复。
失败因为:
- 随着术语的添加,正则表达式呈指数级增长。在 7 个术语(甚至是单个字母)时,它超过了 250kb,我的浏览器放弃
Error: regexp too big
...
其他一些没有奏效的值得注意的策略:
正在捕获每组中包含所有术语的组:
(foo|bar)(.*)(foo|bar)
失败因为:
- 将匹配包含重复项的选项 - fx。
The food in the food court
会匹配,但显然不应该匹配。
- 如果我们“仔细检查”所有条款,事实上,发现它将无法找到有效的匹配 - fx。
The food in the food bar
会找到 foo
两次,永远不会找到 bar
。
否定前瞻和反向引用:
(foo|bar|baz)(.*?)((?!)(?:foo|bar|baz))(.*?)((?!|)(?:foo|bar|baz))
失败因为:
- 只要在查询中重复术语,它就会达到不可能的条件,例如“给我找一个
foo
、bar
或 bar
,这不是 foo
也不是bar
".
- 我相当确定它还有其他问题,但当我意识到它在逻辑上有缺陷时我就停止了追求它。
正面前瞻
(?=.*foo)(?=.*bar)(?=.*baz)
失败因为:
- 很难(如果不是不可能)可靠地突出显示找到的匹配项。
- 我还没有找到任何方法来确保所有条款确实存在 - 即。它们可能都单独出现在选项中,但较短的术语可能仅作为较长术语的子串存在 - 或者术语可能重叠。
我试了一下,但我不确定这会有多大帮助。我的方法类似于您的分而治之技术。
我没有咬掉字符串的位,而是提前搜索每个术语并存储所有匹配项的集合,记录开始和结束位置。如果特定搜索词没有足够的匹配项,算法会立即放弃 'option'.
一旦它收集了所有可能的匹配项,它就会递归地尝试找到不重叠的组合。在那个递归中有很多数据结构的复制,我怀疑它可以比我在这里优化得更好。我也只能为一些变量名道歉,我一直在努力想出一个完全有意义的名字。
对于某些测试搜索,例如 a n a n a n a n ...
,它似乎比原来的分而治之技术更能应对,但我怀疑这可能只是因为在匹配不足时执行的早期救助找到特定搜索词。没有大量的真实数据,很难知道真正有价值的优化在哪里。
function search() {
var options = [
'ababababa',
'United States',
'United Kingdom',
'Afghanistan',
'Aland Islands',
'Albania',
'Algeria',
'American Samoa',
'Andorra',
'Angola',
'Anguilla',
'Antarctica',
'Antigua and Barbuda',
'Argentina',
'Armenia',
'Aruba',
'Australia',
'Austria',
'Azerbaijan',
'Bahamas',
'Bahrain',
'Bangladesh',
'Barbados',
'Belarus',
'Belgium',
'Belize',
'Benin',
'Bermuda',
'Bhutan',
'Bolivia, Plurinational State of',
'Bonaire, Sint Eustatius and Saba',
'Bosnia and Herzegovina',
'Botswana',
'Bouvet Island',
'Brazil',
'British Indian Ocean Territory',
'Brunei Darussalam',
'Bulgaria',
'Burkina Faso',
'Burundi',
'Cambodia',
'Cameroon',
'Canada',
'Cape Verde',
'Cayman Islands',
'Central African Republic',
'Chad',
'Chile',
'China',
'Christmas Island',
'Cocos (Keeling) Islands',
'Colombia',
'Comoros',
'Congo',
'Congo, The Democratic Republic of The',
'Cook Islands',
'Costa Rica',
'Cote D\'ivoire',
'Croatia',
'Cuba',
'Curacao',
'Cyprus',
'Czech Republic',
'Denmark',
'Djibouti',
'Dominica',
'Dominican Republic',
'Ecuador',
'Egypt',
'El Salvador',
'Equatorial Guinea',
'Eritrea',
'Estonia',
'Ethiopia',
'Falkland Islands (Malvinas)',
'Faroe Islands',
'Fiji',
'Finland',
'France',
'French Guiana',
'French Polynesia',
'French Southern Territories',
'Gabon',
'Gambia',
'Georgia',
'Germany',
'Ghana',
'Gibraltar',
'Greece',
'Greenland',
'Grenada',
'Guadeloupe',
'Guam',
'Guatemala',
'Guernsey',
'Guinea',
'Guinea-bissau',
'Guyana',
'Haiti',
'Heard Island and Mcdonald Islands',
'Holy See (Vatican City State)',
'Honduras',
'Hong Kong',
'Hungary',
'Iceland',
'India',
'Indonesia',
'Iran, Islamic Republic of',
'Iraq',
'Ireland',
'Isle of Man',
'Israel',
'Italy',
'Jamaica',
'Japan',
'Jersey',
'Jordan',
'Kazakhstan',
'Kenya',
'Kiribati',
'Korea, Democratic People\'s Republic of',
'Korea, Republic of',
'Kuwait',
'Kyrgyzstan',
'Lao People\'s Democratic Republic',
'Latvia',
'Lebanon',
'Lesotho',
'Liberia',
'Libya',
'Liechtenstein',
'Lithuania',
'Luxembourg',
'Macao',
'Macedonia, The Former Yugoslav Republic of',
'Madagascar',
'Malawi',
'Malaysia',
'Maldives',
'Mali',
'Malta',
'Marshall Islands',
'Martinique',
'Mauritania',
'Mauritius',
'Mayotte',
'Mexico',
'Micronesia, Federated States of',
'Moldova, Republic of',
'Monaco',
'Mongolia',
'Montenegro',
'Montserrat',
'Morocco',
'Mozambique',
'Myanmar',
'Namibia',
'Nauru',
'Nepal',
'Netherlands',
'New Caledonia',
'New Zealand',
'Nicaragua',
'Niger',
'Nigeria',
'Niue',
'Norfolk Island',
'Northern Mariana Islands',
'Norway',
'Oman',
'Pakistan',
'Palau',
'Palestinian Territory, Occupied',
'Panama',
'Papua New Guinea',
'Paraguay',
'Peru',
'Philippines',
'Pitcairn',
'Poland',
'Portugal',
'Puerto Rico',
'Qatar',
'Reunion',
'Romania',
'Russian Federation',
'Rwanda',
'Saint Barthelemy',
'Saint Helena, Ascension and Tristan da Cunha',
'Saint Kitts and Nevis',
'Saint Lucia',
'Saint Martin (French part)',
'Saint Pierre and Miquelon',
'Saint Vincent and The Grenadines',
'Samoa',
'San Marino',
'Sao Tome and Principe',
'Saudi Arabia',
'Senegal',
'Serbia',
'Seychelles',
'Sierra Leone',
'Singapore',
'Sint Maarten (Dutch part)',
'Slovakia',
'Slovenia',
'Solomon Islands',
'Somalia',
'South Africa',
'South Georgia and The South Sandwich Islands',
'South Sudan',
'Spain',
'Sri Lanka',
'Sudan',
'Suriname',
'Svalbard and Jan Mayen',
'Swaziland',
'Sweden',
'Switzerland',
'Syrian Arab Republic',
'Taiwan, Province of China',
'Tajikistan',
'Tanzania, United Republic of',
'Thailand',
'Timor-leste',
'Togo',
'Tokelau',
'Tonga',
'Trinidad and Tobago',
'Tunisia',
'Turkey',
'Turkmenistan',
'Turks and Caicos Islands',
'Tuvalu',
'Uganda',
'Ukraine',
'United Arab Emirates',
'United Kingdom',
'United States',
'United States Minor Outlying Islands',
'Uruguay',
'Uzbekistan',
'Vanuatu',
'Venezuela, Bolivarian Republic of',
'Viet Nam',
'Virgin Islands, British',
'Virgin Islands, U.S.',
'Wallis and Futuna',
'Western Sahara',
'Yemen',
'Zambia',
'Zimbabwe'
];
var terms = document.getElementById('search').value.trim().toLowerCase().split(/\s+/);
if (!terms[0]) {
terms = [];
}
document.getElementById('terms').innerText = 'Terms: ' + JSON.stringify(terms);
var startTime = performance.now();
// Term counts is a map storing how many times each search term appears in the query
var termCounts = {};
terms.forEach(function(term) {
termCounts[term] = (termCounts[term] || 0) + 1;
});
// An array of search terms with the duplicates removed
var uniqueTerms = Object.keys(termCounts);
// Loop through each option and map to either a highlight version or null
options = options.map(function(optionText) {
var matches = {},
lastMatchIndex = {},
option = optionText.toLowerCase();
uniqueTerms.forEach(function(term) {
// This array will be populated with start/end position of each match for this term
matches[term] = [];
// The index of the last match... which could be deduced from the matches but this is slightly easier
lastMatchIndex[term] = -1;
});
var incompleteMatchTerms = uniqueTerms.slice(),
nextMatchTerm;
// This is probably a premature optimization but doing it this
// way ensures we check that each search term occurs at least
// once as quickly as possible.
while (nextMatchTerm = incompleteMatchTerms.shift()) {
var nextMatchIndex = option.indexOf(nextMatchTerm, lastMatchIndex[nextMatchTerm] + 1);
if (nextMatchIndex === -1) {
// Haven't found enough matches for this term, so the option doesn't match
if (termCounts[nextMatchTerm] > matches[nextMatchTerm].length) {
return null;
}
}
else {
// Found another match, put the term back on the queue
// for another round
incompleteMatchTerms.push(nextMatchTerm);
lastMatchIndex[nextMatchTerm] = nextMatchIndex;
matches[nextMatchTerm].push({
start: nextMatchIndex,
end: nextMatchIndex + nextMatchTerm.length
});
}
}
// Pass in the original array of terms... we attempt to highlight in the order of the original query
var highlights = performHighlight(terms, matches);
if (!highlights) {
return null;
}
// We need the highlights sorted so that we do the replacing from the end of the string
highlights.sort(function(h1, h2) {
return h2.start - h1.start;
});
highlights.forEach(function(highlight) {
optionText = optionText.slice(0, highlight.start)
+ '<u>' + optionText.slice(highlight.start, highlight.end) + '</u>'
+ optionText.slice(highlight.end);
});
return optionText;
function performHighlight(terms, allMatches) {
// If there are no terms left to match we've got a hit
if (terms.length === 0) {
return [];
}
var nextTerms = terms.slice(),
term = nextTerms.shift(),
matches = allMatches[term].slice(),
match;
while (match = matches.shift()) {
var nextMatches = {};
// We need to purge any entries from nextMatches that overlap the current match
uniqueTerms.forEach(function(nextTerm) {
var nextMatch = term === nextTerm ? matches : allMatches[nextTerm];
nextMatches[nextTerm] = nextMatch.filter(function(match2) {
return match.start >= match2.end || match.end <= match2.start;
});
});
var highlights = performHighlight(nextTerms, nextMatches);
if (highlights) {
highlights.push(match);
return highlights;
}
}
return null;
}
});
document.getElementById('results').innerHTML = options.map(function(option) {
if (option) {
return '<li>' + option + '</li>';
}
return '';
}).join('');
document.getElementById('time').innerText = Math.round((performance.now() - startTime) * 100) / 100 + 'ms';
}
<h1>Permutations</h1>
<input type="text" id="search" onkeyup="search()" autocomplete="off">
<p id="terms"></p>
<p id="time"></p>
<ul id="results"></ul>
更新:
根据 Mikk3lRo 在评论中的反馈,我做了一些性能调整并提出了这个:
https://jsfiddle.net/skirtle/ndeuqn02/1/
核心算法是一样的,但为了提高性能,我让它变得更难理解了。大多数更改与尽可能避免创建新对象有关。
由于算法会预先搜索很多它可能永远不需要的东西,所以其他算法总是有机会更快,特别是在简单的情况下。其中许多情况可以单独处理,但我还没有尝试过那种优化。
在 Chrome 中,它现在在许多不同的场景中都优于其他实现,尽管这是一个不公平的比较,因为它们尚未以相同的方式进行调整。对于简单搜索,其他实现在 Firefox 中的速度往往稍快一些,但时间都在同一个范围内。
一些特别有趣的搜索:
a ab ba baba
。我添加了一个新的 'option' 并调整了 CSS 来演示这一点。这些算法的不同之处在于它们选择的执行突出显示的方式。我的算法倾向于查询中术语的顺序,而不是基于术语的长度。如果我不担心顺序,还有进一步的优化可用,但我认为它们只会在极端的重叠情况下有所帮助。
t r i s t a n d a c u n h a
。请注意字母之间的空格,这些是 14 个单独的搜索词。如果您一次键入一个术语,分而治之 将很快开始挣扎,但它确实会在最后恢复。 Wipe 和 Shadow 可以应付更长的时间,但是当您输入字母 c
时,它们会从悬崖上掉下来。我认为这是回溯的指数爆炸,但我还没有证实这一点。我确信通过一些工作可以在简单的情况下解决它,但在回溯是由无法解决的重叠引起的情况下修复可能会更棘手。
我相信所有的实现都可以通过更多的调整和一些精心设计的特殊情况处理来加速。对于真实场景,哪个实际上是 'best' 我不确定,但我目前的感觉是我的算法可能只有一个狭窄的最佳点,在真正公平的测试中它会优于其他算法。一种不执行所有预先搜索的算法似乎很难被真正的搜索击败。
更新 2
我已经尝试了我之前方法的不同实现:
https://jsfiddle.net/skirtle/ndeuqn02/9/
请注意,我只更新了我自己的实现,其他的仍然是过时的。
我想我会尽量避免不必要的搜索,方法是懒惰地执行它们,而不是预先全部执行。它仍然缓存它们,以便在算法需要回溯时可以重用它们。我不知道这是否会产生显着差异,因为对短字符串执行少量额外搜索可能不会增加太多开销。
我也尝试过去掉函数递归。虽然它看起来确实有效,但我觉得存在错误的风险很高(需要大量单元测试才能确保它确实有效)。我不相信这部分真的很成功,因为涉及的数据结构让它很难理解。它看起来确实很快,但还不足以证明其复杂性。
我还尝试了其他方法来构建最终的亮点。所有这些排序和切片似乎都在消耗性能,但同样,为了避免这种情况,代码会变得更加复杂。不过,其中一些收益可能适用于其他算法。
我在这里介绍的另一个想法是查询词的预搜索分析(仅依赖于查询而不依赖于选项)。它检查术语是否可以重叠,对于不可能重叠的任何术语(例如 cat dog
),它使用更简单的算法来获取匹配项。这个想法也可能应用于其他算法。
如评论中所述,运行 对选项进行某种预搜索分析的想法也是可能的,但我并未在此处实际实施。很难知道哪种搜索索引最有益,因为它取决于内存使用情况和选项的具体情况。但是,尝试将少量信息从一次搜索转移到下一次搜索可能更实用。
例如,如果有人搜索 united states
,那么他们最后输入的内容很可能是最后的 s
,而他们之前的搜索是 united state
。基于此的两个潜在优化是:
united states
的匹配选项将是 united state
的匹配选项的子集,因此如果我们记住该列表,就可以避免检查所有其他选项。这可以用于任何算法。
- 就我的算法而言,匹配缓存可以从一次搜索保留到下一次搜索。虽然
state
的缓存条目不会被任何使用,但 united
的条目从一次搜索到下一次搜索将完全相同,从而允许我们跳过该术语的算法中昂贵的部分。
分而治之
比单正则表达式 病毒排列 策略稍微复杂一些 - 这种递归算法一个接一个地搜索每个术语,从最长的术语开始。
每次找到匹配项时,它会将 "bite" 一分为三(除非在开头或结尾),将匹配的 "bite" 标记为已消耗,并尝试匹配下一个最长的术语在任何未消耗的 "bites".
当找不到最长的未匹配项时,它会回溯并尝试在不同的位置(甚至在不同的"bite")匹配前一个项。
如果它回到最长的任期并且找不到另一个匹配它的位置,它将return false。
这意味着在大多数情况下它可以很快 return 不匹配,仅仅是因为它们甚至不包含最长的术语。
当然,如果它用完了 - 即。成功匹配最短的 - 它会通过将所有 "bites" 重新组合在一起来 return 突出显示的匹配项。
演示:
已更新以提高性能:基本算法完全相同,但可以完全避免对 arr.slice()
的一些非常昂贵的调用。
function divide_and_conquer_replace(query, options, separator) {
var terms, terms_esc;
//The inner replacement function
function divide_and_conquer_inner(bites, depth) {
var this_term, i, bite, match, new_bites, found_all_others;
depth = depth ? depth : 1;
//Get the longest remaining term
this_term = terms_esc[terms_esc.length - depth];
//Loop all the bites
for (i = 0; i < bites.length; i++) {
bite = bites[i];
//Reset the lastIndex since we're reusing the RegExp objects
this_term.lastIndex = 0;
//Check that we have a string (ie. do not attempt to match bites
//that are already consumed)
if (typeof bite === 'string') {
//Find the next matching position (if any)
while (match = this_term.exec(bite)) {
new_bites = (i > 0) ? bites.slice(0, i) : [];
if (match.index > 0) {
new_bites.push(bite.slice(0, match.index));
}
new_bites.push(['<u>' + match[0] + '</u>']);
if (this_term.lastIndex < bite.length) {
new_bites.push(bite.slice(this_term.lastIndex));
}
if (i < bites.length - 1) {
new_bites = new_bites.concat(bites.slice(i + 1));
}
if (terms_esc.length > depth) {
//Attempt to find all other terms
found_all_others = divide_and_conquer_inner(new_bites, depth + 1);
//If we found all terms we'll pass the modified string all the
//way up to the original callee
if (found_all_others) {
return found_all_others;
}
//Otherwise try to match current term somewhere else
this_term.lastIndex = match.index + 1;
} else {
//If no terms remain we have a match
return new_bites.join('');
}
}
}
}
//If we reach this point at least one term was not found
return null;
};
// Split query in terms at delimiter
terms = query.split(separator).filter(Boolean);
if (!terms.length) return options;
//Sort terms according to length - longest term last
terms.sort(function(a, b) {
return a.length - b.length;
});
//Escape terms
//And store RegExp's instead of strings
terms_esc = terms.map(function (term) {
return term.replace(/[-[\]{}()*+?.,\^$|#\s]/g, "\$&");
}).map(function (term) {
return new RegExp(term, 'gi');
});
//Loop through each option
return options.map(function(option){
return divide_and_conquer_inner([option]);
}).filter(Boolean);
}
var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
var separator = ' ';
function divide_and_conquer(){
var query = document.getElementById('divide_and_conquer').value;
var res_elm = document.getElementById('divide_and_conquer_result');
var t0 = performance.now();
var results = divide_and_conquer_replace(query, options, separator);
var t1 = performance.now();
document.getElementById('divide_and_conquer_meta').innerHTML = 'Time: ' + (t1 - t0).toFixed(2) + 'ms';
res_elm.innerHTML = '';
for (var result of results) {
res_elm.innerHTML += '<li>' + result + '</li>';
}
};
divide_and_conquer();
<input type="text" id="divide_and_conquer" onkeyup="divide_and_conquer()">
<p id="divide_and_conquer_meta"></p>
<ul style="height:300px;overflow:auto" id="divide_and_conquer_result"></ul>
当查询仅由(通常非常短的)字符串组成时,此策略会出现性能问题,而这些字符串在许多选项中都存在/大部分存在 - 例如 a a a a a a a a
...
在现实场景中,它目前优于其他提议的算法 - 请参阅添加到问题中的 link to jsperf。
我建议分而治之的想法略有不同:不是将字符串分成块(字节),而是 "wipe out" 匹配的字符,然后对该字符执行进一步搜索细绳。要擦除的字符将是分隔符,因为它保证不会出现在任何术语中。
这里是:
function trincotWipeSearch(query, options, separator) {
// Split query in terms at delimiter
const terms = query.split(separator).filter(Boolean);
if (!terms.length) return options;
// Sort terms by descending size
terms.sort( (a,b) => b.length - a.length );
// Escape terms, and enrich with size of original term
// and a string of the same length filled with the separator char
const items = terms.map(term => ({
size: term.length,
wipe: separator.repeat(term.length),
regex: new RegExp(term.replace(/[-[\]{}()*+?.,\^$|#\s]/g, "\$&"), 'gi')
}));
function getOffsets(termIndex, text) {
// All terms found?
if (termIndex >= terms.length) return [];
let match;
const { regex, size, wipe } = items[termIndex];
regex.lastIndex = 0;
while (match = regex.exec(text)) {
let index = match.index;
// Wipe characters and recurse to find other terms
let offsets = getOffsets(termIndex+1,
text.substr(0, index) + wipe + text.substr(index + size));
if (offsets !== undefined) {
// Solution found, backtrack all the way
return offsets.concat([index, index + size]);
}
regex.lastIndex = match.index + 1;
}
}
// Loop through each option
return options.map( option => {
// Get the offsets of the matches
let offsets = getOffsets(0, option);
if (offsets) {
// Apply the offsets to add the markup
offsets
.sort( (a,b) => b - a )
.map((index, i) => {
option = option.substr(0, index)
+ (i%2 ? "<u>" : "</u>")
+ option.substr(index);
});
return option;
}
}).filter(Boolean); // get only the non-empty results
}
var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
/*
* I/O and performance measurements
*/
function processInput() {
var query = this.value.toLowerCase();
const t0 = performance.now();
const matches = trincotWipeSearch(query, options, ' ');
const spentTime = performance.now() - t0;
// Output the time spent
time.textContent = spentTime.toFixed(2);
// Output the matches
result.innerHTML = '';
for (var match of matches) {
// Append it to the result list
var li = document.createElement('li');
li.innerHTML = match;
result.appendChild(li);
}
}
findTerms.addEventListener('keyup', processInput);
processInput.call(findTerms);
ul {
height:300px;
font-size: smaller;
overflow: auto;
}
Input terms: <input type="text" id="findTerms"><br>
<h3>Trincot's Wipe Search</h3>
Time: <span id="time"></span>ms<br>
<ul id="result"></ul>
我已经从时间测量中排除了 DOM I/O。
这里是 jsfiddle 并排比较两种算法。添加第三种算法以与其他算法进行比较应该不难。
当分隔符可以是任何正则表达式时
...那么上面的功能就不能用了。克服这个问题的一种方法是引入一个 "shadow" 字符串,与选项字符串一样长,但其中只有 2 个不同的可能字符(例如 .
和 x
):
两者之一表示选项字符串中的相应字符(即在同一位置)已与一个术语匹配,因此不再可用于另一个术语的匹配。
另一个字符表示选项字符串中的相应字符仍可用于包含在术语匹配中。
显然这会使函数变慢一点,因为在检查此影子字符串后可能需要拒绝一些匹配项:
function trincotShadowMarks (query, options, separator) {
// Split query in terms at delimiter
const terms = query.split(separator).filter(Boolean);
if (!terms.length) return options;
// Sort terms by descending size
terms.sort( (a,b) => b.length - a.length );
// Escape terms, and enrich with size of original term
// and a string of the same length filled with the separator char
const items = terms.map(term => ({
size: term.length,
used: 'x'.repeat(term.length),
free: '.'.repeat(term.length),
regex: new RegExp(term.replace(/[-[\]{}()*+?.,\^$|#\s]/g, "\$&"), 'gi')
}));
function getOffsets(termIndex, text, shadow) {
// All terms found?
if (termIndex >= terms.length) return [];
let match;
const { regex, size, used, free } = items[termIndex];
regex.lastIndex = 0;
while (regex.lastIndex > -1 && (match = regex.exec(text))) {
let index = match.index;
// Is this match not overlapping with another match?
if (!shadow.substr(index, size).includes('x')) {
// Mark position as used and recurse to find other terms
let offsets = getOffsets(termIndex+1, text,
shadow.substr(0, index) + used + shadow.substr(index + size));
if (offsets !== undefined) {
// Solution found, backtrack all the way
return offsets.concat([index, index + size]);
}
}
regex.lastIndex = shadow.indexOf(free, match.index + 1);
}
}
// Loop through each option
return options.map( option => {
// Get the offsets of the matches
let offsets = getOffsets(0, option, '.'.repeat(option.length));
if (offsets) {
// Apply the offsets to add the markup
offsets
.sort( (a,b) => b - a )
.map((index, i) => {
option = option.substr(0, index)
+ (i%2 ? "<u>" : "</u>")
+ option.substr(index);
});
return option;
}
}).filter(Boolean); // get only the non-empty results
}
var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
/*
* I/O and performance measurements
*/
function processInput() {
var query = this.value.toLowerCase();
const t0 = performance.now();
const matches = trincotShadowMarks(query, options, ' ');
const spentTime = performance.now() - t0;
// Output the time spent
time.textContent = spentTime.toFixed(2);
// Output the matches
result.innerHTML = '';
for (var match of matches) {
// Append it to the result list
var li = document.createElement('li');
li.innerHTML = match;
result.appendChild(li);
}
}
findTerms.addEventListener('keyup', processInput);
processInput.call(findTerms);
ul {
height:300px;
font-size: smaller;
overflow: auto;
}
Input terms: <input type="text" id="findTerms"><br>
<h3>Trincot's Wipe Search</h3>
Time: <span id="time"></span>ms<br>
<ul id="result"></ul>
更新 2
由于在 Vue 中恢复工作字符串的问题,放弃了缩小集合的概念。
现在方法简单如下:
- 预处理选项集以保持显示与工作同步。
- 处理条款。
- 通过迭代来减少(过滤)选项集,循环遍历条款,在不匹配时短路。
- 使用缩减集,遍历每个选项,找到匹配范围。
- 在每个匹配范围周围插入 HTML 个字符串。
代码已注释。
原始 javascript(记录 filtered/manipulated 选项数组):https://jsfiddle.net/pvLj9uxe/14/
新的 Vue 实现:https://jsfiddle.net/15prcpxn/30/
计算速度似乎相当快 - DOM 更新导致它失败。
添加到比较中*:https://jsfiddle.net/ektyx133/4/
*警告:预处理选项(视为 "static")是策略的一部分,因此已在基准之外进行处理。
var separator = /\s|\*|,/;
// this function enhances the raw options array
function enhanceOptions(options) {
return options.map(option => ({
working: option.toLowerCase(), // for use in filtering the set and matching
display: option // for displaying
}))
}
// this function changes the input to lower case, splits the input into terms, removes empty strings from the array, and enhances the terms with the size and wiping string
function processInput(input) {
return input.trim().toLowerCase().split(separator).filter(term => term.length).map(term => ({
value: term.toLowerCase(),
size: term.length,
wipe: " ".repeat(term.length)
})).sort((a, b) => b.size - a.size);
}
// this function filters the data set, then finds the match ranges, and finally returns an array with HTML tags inserted
function filterAndHighlight(terms, enhancedOptions) {
let options = enhancedOptions,
l = terms.length;
// filter the options - consider recursion instead
options = options.filter(option => {
let i = 0,
working = option.working,
term;
while (i < l) {
if (!~working.indexOf((term = terms[i]).value)) return false;
working = working.replace(term.value, term.wipe);
i++;
}
return true;
})
// generate the display string array
let displayOptions = options.map(option => {
let rangeSet = [],
working = option.working,
display = option.display;
// find the match ranges
terms.forEach(term => {
working = working.replace(term.value, (match, offset) => { // duplicate the wipe string replacement from the filter, but grab the offsets
rangeSet.push({
start: offset,
end: offset + term.size
});
return term.wipe;
})
})
// sort the match ranges, last to first
rangeSet.sort((a, b) => b.start - a.start);
// insert the html tags within the string around each match range
rangeSet.forEach(range => {
display = display.slice(0, range.start) + '<u>' + display.slice(range.start, range.end) + '</u>' + display.slice(range.end)
})
return display;
})
return displayOptions;
}
旧尝试
https://jsfiddle.net/15prcpxn/25/
我的尝试,使用 Vue 进行渲染(方法是顺序的,因此您可以毫不费力地将它们全部放入一个整体函数中 - 输入将是术语和完整的选项集;输出将是过滤选项集,和突出显示的范围)。
- 将输入拆分为单独的项
- 按长度对术语进行排序(最长的术语在前,这样当您有诸如
"abc ab"
和术语 "a abc"
之类的选项时,即一个术语是另一个术语的子串,它将能够匹配 "abc"
)
- Copy/change 项小写
- 将选项(我们的"display set")复制为小写(我们的"working set")
- 对于每个术语,从 "working set" 中删除不匹配的工作选项,并同时从 "display set" 中删除显示选项 - 这样做时,从幸存的工作选项中删除匹配的术语字符串字符串,例如选项
"abc"
中的匹配项 "a"
产生 "bc"
[实际实现是相反的:对于每个项,在有匹配项时并行地重新创建 "working set"将显示选项添加到 "display set",然后将这些集传递给下一个术语] - 这为我们提供了过滤后的显示集
- 将过滤后的显示集复制为小写,为我们提供新的过滤后工作集
- 对于剩余过滤工作集中的每个工作选项,通过记录范围(即开始和结束,例如在选项
"abc"
中匹配术语 "a"
创建一个范围集:start = 0, end = 1
) 其中每个术语通过匹配的偏移量(开始)和 term/match 的长度进行匹配。用与该术语长度相等的空格(或其他未使用的字符)替换匹配的字符串,并将其提供给下一个术语,例如选项 "abc"
中的匹配项 "a"
产生 " bc"
- 这保留了工作选项的长度,确保过滤后的工作集(小写)与过滤后的显示集(原始大小写)相匹配.范围集的总数将等于筛选选项集中剩余选项的数量。
- 此外,对每个范围集中的范围进行排序(按降序排列,反向工作),以允许插入字符串。
- 对于过滤后的显示集中的每个选项,(反向操作以免在操作字符串时干扰索引)通过切片在匹配范围周围插入
<u>
</u>
标记向上显示选项,例如选项 "abc"
中的匹配项 "a"
:new option = "<u>" + "a" + "</u>" + "bc"
- 渲染它
当有很多匹配/无用的术语时(例如,当您输入单个字符时),性能会很差。对于最终用途,我可能会延迟输入计算。
我应该能够将其中一些步骤汇总为更少的步骤,这可能会提高性能。我明天再去看看。
Vue 大概还通过虚拟 DOM 等进行一些优化,因此它不一定反映香草 Javascript / DOM 渲染。
这是一个与我之前的回答完全不同的方法——我无法将以下所有内容添加到(大小限制),所以...这是一个单独的答案。
广义后缀树:预处理选项
generalized suffix tree 是一种理论上允许以有效方式在一组字符串中搜索子字符串的结构。所以我想我会尝试一下。
以有效的方式构建这样一棵树绝非易事,从 this awesome explanation of Ukkonen's algorithm, which concerns building a suffix tree 一个短语(选项)可以看出。
我从实现 found here 中获得了灵感,它需要一些适应:
- 应用更好的编码风格(例如摆脱非显式声明的全局变量)
- 无需在文本后添加分隔符即可使其正常工作。这真的很棘手,我希望我没有错过一些边界条件
- 使其适用于多个字符串(即广义)
所以这里是:
"use strict";
// Implementation of a Generalized Suffix Tree using Ukkonen's algorithm
// See also:
class Node {
constructor() {
this.edges = {};
this.suffixLink = null;
}
addEdge(ch, textId, start, end, node) {
this.edges[ch] = { textId, start, end, node };
}
}
class Nikkonen extends Node {
constructor() {
super(); // root node of the tree
this.texts = [];
}
findNode(s) {
if (!s.length) return;
let node = this,
len,
suffixSize = 0,
edge;
for (let i = 0; i < s.length; i += len) {
edge = node.edges[s.charAt(i)];
if (!edge) return;
len = Math.min(edge.end - edge.start, s.length - i);
if (this.texts[edge.textId].substr(edge.start, len) !== s.substr(i, len)) return;
node = edge.node;
}
return { edge, len };
}
findAll(term, termId = 1) {
const { edge, len } = this.findNode(term) || {};
if (!edge) return {}; // not found
// Find all leaves
const matches = new Map;
(function recurse({ node, textId, start, end }, suffixLen) {
suffixLen += end - start;
const edges = Object.values(node.edges);
if (!edges.length) { // leaf node: calculate the match
if (!(matches.has(textId))) matches.set(textId, []);
matches.get(textId).push({ offset: end - suffixLen, termId });
return;
}
edges.forEach( edge => recurse(edge, suffixLen) );
})(edge, term.length - len);
return matches;
}
addText(text) {
// Implements Nikkonen's algorithm for building the tree
// Inspired by https://felix-halim.net/misc/suffix-tree/
const root = this,
active = {
node: root,
textId: this.texts.length,
start: 0,
end: 0,
},
texts = this.texts;
// Private functions
function getChar(textId, i) {
return texts[textId].charAt(i) || '$' + textId;
}
function addEdge(fromNode, textId, start, end, node) {
fromNode.addEdge(getChar(textId, start), textId, start, end, node);
}
function testAndSplit() {
const ch = getChar(active.textId, active.end);
if (active.start < active.end) {
const edge = active.node.edges[getChar(active.textId, active.start)],
splitPoint = edge.start + active.end - active.start;
if (ch === getChar(edge.textId, splitPoint)) return;
const newNode = new Node();
addEdge(active.node, edge.textId, edge.start, splitPoint, newNode);
addEdge(newNode, edge.textId, splitPoint, edge.end, edge.node);
return newNode;
}
if (!(ch in active.node.edges)) return active.node;
}
function canonize() {
while (active.start < active.end) {
const edge = active.node.edges[getChar(active.textId, active.start)];
if (edge.end - edge.start > active.end - active.start) break;
active.start += edge.end - edge.start;
active.node = edge.node;
}
}
function update() {
let prevNewNode = root,
newNode;
while (newNode = testAndSplit()) {
addEdge(newNode, active.textId, active.end, text.length+1, new Node());
// Rule 2: add suffix-link from previously inserted node
if (prevNewNode !== root) {
prevNewNode.suffixLink = newNode;
}
prevNewNode = newNode;
// Rule 3: follow suffixLink after split
active.node = active.node.suffixLink || root;
canonize(); // because active.node changed
}
if (prevNewNode !== root) {
prevNewNode.suffixLink = active.node;
}
}
texts.push(text);
if (!root.suffixLink) root.suffixLink = new Node();
for (let i = 0; i < text.length; i++) {
addEdge(root.suffixLink, active.textId, i, i+1, root);
}
// Main Ukkonen loop: add each character from left to right to the tree
while (active.end <= text.length) {
update();
active.end++;
canonize(); // because active.end changed
}
}
}
function trincotSuffixTree(query, options, suffixTree, separator) {
// Split query in terms at delimiter
const terms = query.split(separator).filter(Boolean);
if (!terms.length) return options;
// Sort terms by descending size
terms.sort( (a,b) => b.length - a.length );
// create Map keyed by term with count info
const termMap = new Map(terms.map( (term, termId) => [term, { termId, count: 0, leftOver: 0, size: term.length }] ));
terms.forEach( (term) => termMap.get(term).count++ );
function getNonOverlaps(offsets, leftOver, lastIndex = 0, offsetIndex = 0) {
// All terms found?
if (!leftOver) return [];
let giveUpAt = Infinity;
// While still enough matches left over:
while (offsetIndex + leftOver <= offsets.length) {
const { termId, offset } = offsets[offsetIndex++];
if (offset < lastIndex) continue; // overlap, try next
if (offset >= giveUpAt) break; // Looking further makes no sense
const termInfo = termMap.get(terms[termId]);
//console.log('termId', termId, 'offset', offset, 'size', termInfo.size, 'lastIndex', lastIndex);
if (!termInfo.leftOver) continue; // too many of the same term, try next
termInfo.leftOver--;
const result = getNonOverlaps(offsets, leftOver - 1, offset + termInfo.size, offsetIndex);
// If success, then completely backtrack out of recursion.
if (result) return result.concat([offset + termInfo.size, offset]);
termInfo.leftOver++; // restore after failed recursive search and try next
// If a term-match at a given offset could not lead to a solution (in recursion),
// and if we keep those matched character postions all unmatched and only start matching after
// the end of that location, it will certainly not lead to a solution either.
giveUpAt = Math.min(giveUpAt, offset + termInfo.size);
}
}
let allTermsAllOptionsOffsets;
// Loop through the unique terms:
for (let [term, termInfo] of termMap) {
// Get the offsets of the matches of this term in all options (in the preprocessed tree)
const thisTermAllOptionsOffsets = suffixTree.findAll(term, termInfo.termId);
//console.log('findAll:', JSON.stringify(Array.from(thisTermAllOptionsOffsets)));
if (!thisTermAllOptionsOffsets.size) return []; // No option has this term, so bail out
if (!allTermsAllOptionsOffsets) {
allTermsAllOptionsOffsets = thisTermAllOptionsOffsets;
} else {
// Merge with all previously found offsets for other terms (intersection)
for (let [optionId, offsets] of allTermsAllOptionsOffsets) {
let newOffsets = thisTermAllOptionsOffsets.get(optionId);
if (!newOffsets || newOffsets.length < termInfo.count) {
// this option does not have enough occurrences of this term
allTermsAllOptionsOffsets.delete(optionId);
} else {
allTermsAllOptionsOffsets.set(optionId, offsets.concat(newOffsets));
}
}
if (!allTermsAllOptionsOffsets.size) return []; // No option has all terms, so bail out
}
}
// Per option, see if (and where) the offsets can serve non-overlapping matches for each term
const matches = Array.from(allTermsAllOptionsOffsets, ([optionId, offsets]) => {
// Indicate how many of each term must (still) be matched:
termMap.forEach( obj => obj.leftOver = obj.count );
return [optionId, getNonOverlaps(offsets.sort( (a, b) => a.offset - b.offset ), terms.length)];
})
// Remove options that could not provide non-overlapping offsets
.filter( ([_, offsets]) => offsets )
// Sort the remaining options in their original order
.sort( (a,b) => a[0] - b[1] )
// Replace optionId, by the corresponding text and apply mark-up at the offsets
.map( ([optionId, offsets]) => {
let option = options[optionId];
offsets.map((index, i) => {
option = option.substr(0, index)
+ (i%2 ? "<u>" : "</u>")
+ option.substr(index);
});
return option;
});
//console.log(JSON.stringify(matches));
return matches;
}
function trincotPreprocess(options) {
const nikkonen = new Nikkonen();
// Add all the options (lowercased) to the suffic tree
options.map(option => option.toLowerCase()).forEach(nikkonen.addText.bind(nikkonen));
return nikkonen;
}
const options = ['abbbba', 'United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
/*
* I/O and performance measurements
*/
let preprocessed;
function processInput() {
if (!preprocessed) { // Only first time
const t0 = performance.now();
preprocessed = trincotPreprocess(options);
const spentTime = performance.now() - t0;
// Output the time spent on preprocessing
pretime.textContent = spentTime.toFixed(2);
}
var query = this.value.toLowerCase();
const t0 = performance.now();
const matches = trincotSuffixTree(query, options, preprocessed, ' ');
const spentTime = performance.now() - t0;
// Output the time spent
time.textContent = spentTime.toFixed(2);
// Output the matches
result.innerHTML = '';
for (var match of matches) {
// Append it to the result list
var li = document.createElement('li');
li.innerHTML = match;
result.appendChild(li);
}
}
findTerms.addEventListener('keyup', processInput);
processInput.call(findTerms);
ul {
height:300px;
font-size: smaller;
overflow: auto;
}
Input terms: <input type="text" id="findTerms"><br>
<h3>Trincot's Suffix Tree Search</h3>
Preprocessing Time: <span id="pretime"></span>ms (only done once)<br>
Time: <span id="time"></span>ms<br>
<ul id="result"></ul>
这个方法背后有很多代码,所以我想它可能不会对小数据集表现出有趣的性能,而对于更大的数据集,它会消耗内存:树比原始选项数组占用更多内存.
要求是:
- 从数组中查找包含 ALL ANY[ 中的术语的字符串(从这里开始称为 options) =100=]订单
- 正确突出显示匹配项 - 即。在每个匹配项前后插入一个字符串 - 我在这里使用
<u>
和</u>
- 搜索查询和选项都可以是“任何”
为了简单起见,答案可以专注于通过仅包含 ASCII 字符的列表进行不区分大小写的搜索,并假设术语分隔符是普通的 space - 即。输入为“Foo bar baz”的查询意味着搜索词是 ['foo', 'bar', 'baz']
.
澄清一下:
- 所有术语必须在匹配选项中彼此分开存在 - 即。较短的术语不应仅作为较长术语的子串存在,并且两个术语不应重叠
- 选项中重复搜索词的出现次数必须至少与查询中出现的次数相同
最终的应用程序(也许并不奇怪)是一种自动完成功能。
TL;DR
Most recent fiddle comparing the proposed algorithms side by side:
https://jsfiddle.net/Mikk3lRo/ndeuqn02/7/
(feel free to update this link if you add a new algorithm)jsPerf comparing algorithms in a somewhat more realistic / representative way - a few strings are basically "entered" one character at a time on each rep:
https://jsperf.com/comparison-of-algorithms-to-search-and-highlightAt this point it is clear (thanks to trincot's excellent base-comparison) that the majority of time used by the original implementations was spent on DOM-output. Its significance has been minimized as much as possible in the fiddle.
There is still a clear difference in performance between the algorithms in each search, but not one of them is consistently fastest on every keystroke. After revisiting and cleaning up my own "Divide and Conquer" it does outperform the others consistently in any realistic scenario I try though.
Tigregalis introduced the idea of a pre-run optimization, which seems reasonable given that options are unlikely to change between keystrokes. I have added (a function for) this to all methods here. The only algorithm where I saw an obvious benefit from it was in Skirtle's Permutations, but I'll encourage each answerer to consider if it might be useful for their own algorithms.
Some algorithms will be much easier to adapt than others. It is still my opinion that this will be more important than the minor performance differences in a real implementation.
Note that the current version of Tigregalis' Shrinking Set has a bug - I've excluded it from fiddle and jsperf until that is fixed.
病毒排列
理论上,这可以通过“手动”构建一个 RegExp 来解决,该正则表达式包含由捕获组分隔的搜索词的所有可能排列,以捕获词之间的任何内容 - 搜索“foo bar”会导致 (foo)(.*?)(bar)|(bar)(.*?)(foo)
.
然后使用 string.replace()
一次性完成突出显示。如果字符串有任何变化,我们就会匹配。
演示:
var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
var terms, terms_esc;
function viral_permutations() {
var t0, t1, i, permuted, res_elm, meta_elm, regex_string, regex, li, option, match_groups, highlighted;
meta_elm = document.getElementById('viral_permutations_meta');
res_elm = document.getElementById('viral_permutations_result');
res_elm.innerHTML = meta_elm.innerHTML = '';
t0 = performance.now();
//Split query in terms at delimiter and lowercase them
terms = document.getElementById('viral_permutations').value.split(/\s/).filter(function(n) {
return n;
}).map(function(term){
return term.toLowerCase();
});
meta_elm.innerHTML += 'Terms: ' + JSON.stringify(terms) + '<br>';
//Escape terms
terms_esc = terms.map(function(term) {
return term.replace(/[-[\]{}()*+?.,\^$|#\s]/g, "\$&");
});
//Wrap terms in in individual capturing groups
terms_esc = terms.map(function(term) {
return '(' + term + ')';
});
//Find all permutations
permuted = permutate_array(terms_esc);
//Construct a group for each permutation
match_groups = [];
for (var i in permuted) {
match_groups.push(permuted[i].join('(.*?)'));
}
try {
//Construct the final regex
regex_string = match_groups.join('|');
//Display it
document.getElementById('viral_permutations_regex').innerHTML = regex_string;
meta_elm.innerHTML += 'RegExp length: ' + regex_string.length + '<br>';
regex = new RegExp(regex_string, 'i');
//Loop through each option
for (i = 0; i < options.length; i++) {
option = options[i];
//Replace the terms with highlighted terms
highlighted = option.replace(regex, viral_permutations_replace);
//If anything was changed (or the query is empty) we have a match
if (highlighted !== option || terms.length === 0) {
//Append it to the result list
li = document.createElement('li');
li.innerHTML = highlighted;
res_elm.appendChild(li);
}
}
//Print some meta
t1 = performance.now();
meta_elm.innerHTML += 'Time: ' + (Math.round((t1 - t0) * 100) / 100) + 'ms';
} catch(e) {
meta_elm.innerHTML += '<span style="color:red">' + e.message + '</span>';
}
}
//The replacement function
function viral_permutations_replace() {
var i, m, j, retval, m_cased, unmatched;
retval = '';
//Make a clone of the terms array (that we can modify without destroying the original)
unmatched = terms.slice(0);
//Loop arguments between the first (which is the full match) and
//the last 2 (which are the offset and the full option)
for (i = 1; i < arguments.length - 1; i++) {
m = arguments[i];
//Check that we have a string - most of the arguments will be undefined
if (typeof m !== 'string') continue;
//Lowercase the match
m_cased = m.toLowerCase();
//Append it to the return value - highlighted if it is among our terms
j = unmatched.indexOf(m_cased);
if (j >= 0) {
//Remove it from the unmatched terms array
unmatched.splice(j, 1);
retval += '<u>' + m + '</u>';
} else {
retval += m;
}
}
return retval;
}
//Helper function to return all possible permutations of an array
function permutate_array(arr) {
var perm, perm_intern;
perm_intern = function(perm, pre, post, n) {
var elem, i, j, ref, rest;
if (n > 0) {
for (i = j = 0, ref = post.length; 0 <= ref ? j < ref : j > ref; i = 0 <= ref ? ++j : --j) {
rest = post.slice(0);
elem = rest.splice(i, 1);
perm_intern(perm, pre.concat(elem), rest, n - 1);
}
} else {
perm.push(pre);
}
};
perm = [];
perm_intern(perm, [], arr, arr.length);
return perm;
}
viral_permutations();
<input type="text" id="viral_permutations" onkeyup="viral_permutations()">
<p id="viral_permutations_meta"></p>
<pre style="width:100%;overflow:auto;background:#eee" id="viral_permutations_regex"></pre>
<ul style="height:300px;overflow:auto" id="viral_permutations_result"></ul>
感谢 trincot 指出我的原始版本偶尔会两次突出显示重复出现的术语 - 此代码段已修复。
失败因为:
- 随着术语的添加,正则表达式呈指数级增长。在 7 个术语(甚至是单个字母)时,它超过了 250kb,我的浏览器放弃
Error: regexp too big
...
其他一些没有奏效的值得注意的策略:
正在捕获每组中包含所有术语的组:
(foo|bar)(.*)(foo|bar)
失败因为:
- 将匹配包含重复项的选项 - fx。
The food in the food court
会匹配,但显然不应该匹配。 - 如果我们“仔细检查”所有条款,事实上,发现它将无法找到有效的匹配 - fx。
The food in the food bar
会找到foo
两次,永远不会找到bar
。
否定前瞻和反向引用:
(foo|bar|baz)(.*?)((?!)(?:foo|bar|baz))(.*?)((?!|)(?:foo|bar|baz))
失败因为:
- 只要在查询中重复术语,它就会达到不可能的条件,例如“给我找一个
foo
、bar
或bar
,这不是foo
也不是bar
". - 我相当确定它还有其他问题,但当我意识到它在逻辑上有缺陷时我就停止了追求它。
正面前瞻
(?=.*foo)(?=.*bar)(?=.*baz)
失败因为:
- 很难(如果不是不可能)可靠地突出显示找到的匹配项。
- 我还没有找到任何方法来确保所有条款确实存在 - 即。它们可能都单独出现在选项中,但较短的术语可能仅作为较长术语的子串存在 - 或者术语可能重叠。
我试了一下,但我不确定这会有多大帮助。我的方法类似于您的分而治之技术。
我没有咬掉字符串的位,而是提前搜索每个术语并存储所有匹配项的集合,记录开始和结束位置。如果特定搜索词没有足够的匹配项,算法会立即放弃 'option'.
一旦它收集了所有可能的匹配项,它就会递归地尝试找到不重叠的组合。在那个递归中有很多数据结构的复制,我怀疑它可以比我在这里优化得更好。我也只能为一些变量名道歉,我一直在努力想出一个完全有意义的名字。
对于某些测试搜索,例如 a n a n a n a n ...
,它似乎比原来的分而治之技术更能应对,但我怀疑这可能只是因为在匹配不足时执行的早期救助找到特定搜索词。没有大量的真实数据,很难知道真正有价值的优化在哪里。
function search() {
var options = [
'ababababa',
'United States',
'United Kingdom',
'Afghanistan',
'Aland Islands',
'Albania',
'Algeria',
'American Samoa',
'Andorra',
'Angola',
'Anguilla',
'Antarctica',
'Antigua and Barbuda',
'Argentina',
'Armenia',
'Aruba',
'Australia',
'Austria',
'Azerbaijan',
'Bahamas',
'Bahrain',
'Bangladesh',
'Barbados',
'Belarus',
'Belgium',
'Belize',
'Benin',
'Bermuda',
'Bhutan',
'Bolivia, Plurinational State of',
'Bonaire, Sint Eustatius and Saba',
'Bosnia and Herzegovina',
'Botswana',
'Bouvet Island',
'Brazil',
'British Indian Ocean Territory',
'Brunei Darussalam',
'Bulgaria',
'Burkina Faso',
'Burundi',
'Cambodia',
'Cameroon',
'Canada',
'Cape Verde',
'Cayman Islands',
'Central African Republic',
'Chad',
'Chile',
'China',
'Christmas Island',
'Cocos (Keeling) Islands',
'Colombia',
'Comoros',
'Congo',
'Congo, The Democratic Republic of The',
'Cook Islands',
'Costa Rica',
'Cote D\'ivoire',
'Croatia',
'Cuba',
'Curacao',
'Cyprus',
'Czech Republic',
'Denmark',
'Djibouti',
'Dominica',
'Dominican Republic',
'Ecuador',
'Egypt',
'El Salvador',
'Equatorial Guinea',
'Eritrea',
'Estonia',
'Ethiopia',
'Falkland Islands (Malvinas)',
'Faroe Islands',
'Fiji',
'Finland',
'France',
'French Guiana',
'French Polynesia',
'French Southern Territories',
'Gabon',
'Gambia',
'Georgia',
'Germany',
'Ghana',
'Gibraltar',
'Greece',
'Greenland',
'Grenada',
'Guadeloupe',
'Guam',
'Guatemala',
'Guernsey',
'Guinea',
'Guinea-bissau',
'Guyana',
'Haiti',
'Heard Island and Mcdonald Islands',
'Holy See (Vatican City State)',
'Honduras',
'Hong Kong',
'Hungary',
'Iceland',
'India',
'Indonesia',
'Iran, Islamic Republic of',
'Iraq',
'Ireland',
'Isle of Man',
'Israel',
'Italy',
'Jamaica',
'Japan',
'Jersey',
'Jordan',
'Kazakhstan',
'Kenya',
'Kiribati',
'Korea, Democratic People\'s Republic of',
'Korea, Republic of',
'Kuwait',
'Kyrgyzstan',
'Lao People\'s Democratic Republic',
'Latvia',
'Lebanon',
'Lesotho',
'Liberia',
'Libya',
'Liechtenstein',
'Lithuania',
'Luxembourg',
'Macao',
'Macedonia, The Former Yugoslav Republic of',
'Madagascar',
'Malawi',
'Malaysia',
'Maldives',
'Mali',
'Malta',
'Marshall Islands',
'Martinique',
'Mauritania',
'Mauritius',
'Mayotte',
'Mexico',
'Micronesia, Federated States of',
'Moldova, Republic of',
'Monaco',
'Mongolia',
'Montenegro',
'Montserrat',
'Morocco',
'Mozambique',
'Myanmar',
'Namibia',
'Nauru',
'Nepal',
'Netherlands',
'New Caledonia',
'New Zealand',
'Nicaragua',
'Niger',
'Nigeria',
'Niue',
'Norfolk Island',
'Northern Mariana Islands',
'Norway',
'Oman',
'Pakistan',
'Palau',
'Palestinian Territory, Occupied',
'Panama',
'Papua New Guinea',
'Paraguay',
'Peru',
'Philippines',
'Pitcairn',
'Poland',
'Portugal',
'Puerto Rico',
'Qatar',
'Reunion',
'Romania',
'Russian Federation',
'Rwanda',
'Saint Barthelemy',
'Saint Helena, Ascension and Tristan da Cunha',
'Saint Kitts and Nevis',
'Saint Lucia',
'Saint Martin (French part)',
'Saint Pierre and Miquelon',
'Saint Vincent and The Grenadines',
'Samoa',
'San Marino',
'Sao Tome and Principe',
'Saudi Arabia',
'Senegal',
'Serbia',
'Seychelles',
'Sierra Leone',
'Singapore',
'Sint Maarten (Dutch part)',
'Slovakia',
'Slovenia',
'Solomon Islands',
'Somalia',
'South Africa',
'South Georgia and The South Sandwich Islands',
'South Sudan',
'Spain',
'Sri Lanka',
'Sudan',
'Suriname',
'Svalbard and Jan Mayen',
'Swaziland',
'Sweden',
'Switzerland',
'Syrian Arab Republic',
'Taiwan, Province of China',
'Tajikistan',
'Tanzania, United Republic of',
'Thailand',
'Timor-leste',
'Togo',
'Tokelau',
'Tonga',
'Trinidad and Tobago',
'Tunisia',
'Turkey',
'Turkmenistan',
'Turks and Caicos Islands',
'Tuvalu',
'Uganda',
'Ukraine',
'United Arab Emirates',
'United Kingdom',
'United States',
'United States Minor Outlying Islands',
'Uruguay',
'Uzbekistan',
'Vanuatu',
'Venezuela, Bolivarian Republic of',
'Viet Nam',
'Virgin Islands, British',
'Virgin Islands, U.S.',
'Wallis and Futuna',
'Western Sahara',
'Yemen',
'Zambia',
'Zimbabwe'
];
var terms = document.getElementById('search').value.trim().toLowerCase().split(/\s+/);
if (!terms[0]) {
terms = [];
}
document.getElementById('terms').innerText = 'Terms: ' + JSON.stringify(terms);
var startTime = performance.now();
// Term counts is a map storing how many times each search term appears in the query
var termCounts = {};
terms.forEach(function(term) {
termCounts[term] = (termCounts[term] || 0) + 1;
});
// An array of search terms with the duplicates removed
var uniqueTerms = Object.keys(termCounts);
// Loop through each option and map to either a highlight version or null
options = options.map(function(optionText) {
var matches = {},
lastMatchIndex = {},
option = optionText.toLowerCase();
uniqueTerms.forEach(function(term) {
// This array will be populated with start/end position of each match for this term
matches[term] = [];
// The index of the last match... which could be deduced from the matches but this is slightly easier
lastMatchIndex[term] = -1;
});
var incompleteMatchTerms = uniqueTerms.slice(),
nextMatchTerm;
// This is probably a premature optimization but doing it this
// way ensures we check that each search term occurs at least
// once as quickly as possible.
while (nextMatchTerm = incompleteMatchTerms.shift()) {
var nextMatchIndex = option.indexOf(nextMatchTerm, lastMatchIndex[nextMatchTerm] + 1);
if (nextMatchIndex === -1) {
// Haven't found enough matches for this term, so the option doesn't match
if (termCounts[nextMatchTerm] > matches[nextMatchTerm].length) {
return null;
}
}
else {
// Found another match, put the term back on the queue
// for another round
incompleteMatchTerms.push(nextMatchTerm);
lastMatchIndex[nextMatchTerm] = nextMatchIndex;
matches[nextMatchTerm].push({
start: nextMatchIndex,
end: nextMatchIndex + nextMatchTerm.length
});
}
}
// Pass in the original array of terms... we attempt to highlight in the order of the original query
var highlights = performHighlight(terms, matches);
if (!highlights) {
return null;
}
// We need the highlights sorted so that we do the replacing from the end of the string
highlights.sort(function(h1, h2) {
return h2.start - h1.start;
});
highlights.forEach(function(highlight) {
optionText = optionText.slice(0, highlight.start)
+ '<u>' + optionText.slice(highlight.start, highlight.end) + '</u>'
+ optionText.slice(highlight.end);
});
return optionText;
function performHighlight(terms, allMatches) {
// If there are no terms left to match we've got a hit
if (terms.length === 0) {
return [];
}
var nextTerms = terms.slice(),
term = nextTerms.shift(),
matches = allMatches[term].slice(),
match;
while (match = matches.shift()) {
var nextMatches = {};
// We need to purge any entries from nextMatches that overlap the current match
uniqueTerms.forEach(function(nextTerm) {
var nextMatch = term === nextTerm ? matches : allMatches[nextTerm];
nextMatches[nextTerm] = nextMatch.filter(function(match2) {
return match.start >= match2.end || match.end <= match2.start;
});
});
var highlights = performHighlight(nextTerms, nextMatches);
if (highlights) {
highlights.push(match);
return highlights;
}
}
return null;
}
});
document.getElementById('results').innerHTML = options.map(function(option) {
if (option) {
return '<li>' + option + '</li>';
}
return '';
}).join('');
document.getElementById('time').innerText = Math.round((performance.now() - startTime) * 100) / 100 + 'ms';
}
<h1>Permutations</h1>
<input type="text" id="search" onkeyup="search()" autocomplete="off">
<p id="terms"></p>
<p id="time"></p>
<ul id="results"></ul>
更新:
根据 Mikk3lRo 在评论中的反馈,我做了一些性能调整并提出了这个:
https://jsfiddle.net/skirtle/ndeuqn02/1/
核心算法是一样的,但为了提高性能,我让它变得更难理解了。大多数更改与尽可能避免创建新对象有关。
由于算法会预先搜索很多它可能永远不需要的东西,所以其他算法总是有机会更快,特别是在简单的情况下。其中许多情况可以单独处理,但我还没有尝试过那种优化。
在 Chrome 中,它现在在许多不同的场景中都优于其他实现,尽管这是一个不公平的比较,因为它们尚未以相同的方式进行调整。对于简单搜索,其他实现在 Firefox 中的速度往往稍快一些,但时间都在同一个范围内。
一些特别有趣的搜索:
a ab ba baba
。我添加了一个新的 'option' 并调整了 CSS 来演示这一点。这些算法的不同之处在于它们选择的执行突出显示的方式。我的算法倾向于查询中术语的顺序,而不是基于术语的长度。如果我不担心顺序,还有进一步的优化可用,但我认为它们只会在极端的重叠情况下有所帮助。t r i s t a n d a c u n h a
。请注意字母之间的空格,这些是 14 个单独的搜索词。如果您一次键入一个术语,分而治之 将很快开始挣扎,但它确实会在最后恢复。 Wipe 和 Shadow 可以应付更长的时间,但是当您输入字母c
时,它们会从悬崖上掉下来。我认为这是回溯的指数爆炸,但我还没有证实这一点。我确信通过一些工作可以在简单的情况下解决它,但在回溯是由无法解决的重叠引起的情况下修复可能会更棘手。
我相信所有的实现都可以通过更多的调整和一些精心设计的特殊情况处理来加速。对于真实场景,哪个实际上是 'best' 我不确定,但我目前的感觉是我的算法可能只有一个狭窄的最佳点,在真正公平的测试中它会优于其他算法。一种不执行所有预先搜索的算法似乎很难被真正的搜索击败。
更新 2
我已经尝试了我之前方法的不同实现:
https://jsfiddle.net/skirtle/ndeuqn02/9/
请注意,我只更新了我自己的实现,其他的仍然是过时的。
我想我会尽量避免不必要的搜索,方法是懒惰地执行它们,而不是预先全部执行。它仍然缓存它们,以便在算法需要回溯时可以重用它们。我不知道这是否会产生显着差异,因为对短字符串执行少量额外搜索可能不会增加太多开销。
我也尝试过去掉函数递归。虽然它看起来确实有效,但我觉得存在错误的风险很高(需要大量单元测试才能确保它确实有效)。我不相信这部分真的很成功,因为涉及的数据结构让它很难理解。它看起来确实很快,但还不足以证明其复杂性。
我还尝试了其他方法来构建最终的亮点。所有这些排序和切片似乎都在消耗性能,但同样,为了避免这种情况,代码会变得更加复杂。不过,其中一些收益可能适用于其他算法。
我在这里介绍的另一个想法是查询词的预搜索分析(仅依赖于查询而不依赖于选项)。它检查术语是否可以重叠,对于不可能重叠的任何术语(例如 cat dog
),它使用更简单的算法来获取匹配项。这个想法也可能应用于其他算法。
如评论中所述,运行 对选项进行某种预搜索分析的想法也是可能的,但我并未在此处实际实施。很难知道哪种搜索索引最有益,因为它取决于内存使用情况和选项的具体情况。但是,尝试将少量信息从一次搜索转移到下一次搜索可能更实用。
例如,如果有人搜索 united states
,那么他们最后输入的内容很可能是最后的 s
,而他们之前的搜索是 united state
。基于此的两个潜在优化是:
united states
的匹配选项将是united state
的匹配选项的子集,因此如果我们记住该列表,就可以避免检查所有其他选项。这可以用于任何算法。- 就我的算法而言,匹配缓存可以从一次搜索保留到下一次搜索。虽然
state
的缓存条目不会被任何使用,但united
的条目从一次搜索到下一次搜索将完全相同,从而允许我们跳过该术语的算法中昂贵的部分。
分而治之
比单正则表达式 病毒排列 策略稍微复杂一些 - 这种递归算法一个接一个地搜索每个术语,从最长的术语开始。
每次找到匹配项时,它会将 "bite" 一分为三(除非在开头或结尾),将匹配的 "bite" 标记为已消耗,并尝试匹配下一个最长的术语在任何未消耗的 "bites".
当找不到最长的未匹配项时,它会回溯并尝试在不同的位置(甚至在不同的"bite")匹配前一个项。
如果它回到最长的任期并且找不到另一个匹配它的位置,它将return false。
这意味着在大多数情况下它可以很快 return 不匹配,仅仅是因为它们甚至不包含最长的术语。
当然,如果它用完了 - 即。成功匹配最短的 - 它会通过将所有 "bites" 重新组合在一起来 return 突出显示的匹配项。
演示:
已更新以提高性能:基本算法完全相同,但可以完全避免对 arr.slice()
的一些非常昂贵的调用。
function divide_and_conquer_replace(query, options, separator) {
var terms, terms_esc;
//The inner replacement function
function divide_and_conquer_inner(bites, depth) {
var this_term, i, bite, match, new_bites, found_all_others;
depth = depth ? depth : 1;
//Get the longest remaining term
this_term = terms_esc[terms_esc.length - depth];
//Loop all the bites
for (i = 0; i < bites.length; i++) {
bite = bites[i];
//Reset the lastIndex since we're reusing the RegExp objects
this_term.lastIndex = 0;
//Check that we have a string (ie. do not attempt to match bites
//that are already consumed)
if (typeof bite === 'string') {
//Find the next matching position (if any)
while (match = this_term.exec(bite)) {
new_bites = (i > 0) ? bites.slice(0, i) : [];
if (match.index > 0) {
new_bites.push(bite.slice(0, match.index));
}
new_bites.push(['<u>' + match[0] + '</u>']);
if (this_term.lastIndex < bite.length) {
new_bites.push(bite.slice(this_term.lastIndex));
}
if (i < bites.length - 1) {
new_bites = new_bites.concat(bites.slice(i + 1));
}
if (terms_esc.length > depth) {
//Attempt to find all other terms
found_all_others = divide_and_conquer_inner(new_bites, depth + 1);
//If we found all terms we'll pass the modified string all the
//way up to the original callee
if (found_all_others) {
return found_all_others;
}
//Otherwise try to match current term somewhere else
this_term.lastIndex = match.index + 1;
} else {
//If no terms remain we have a match
return new_bites.join('');
}
}
}
}
//If we reach this point at least one term was not found
return null;
};
// Split query in terms at delimiter
terms = query.split(separator).filter(Boolean);
if (!terms.length) return options;
//Sort terms according to length - longest term last
terms.sort(function(a, b) {
return a.length - b.length;
});
//Escape terms
//And store RegExp's instead of strings
terms_esc = terms.map(function (term) {
return term.replace(/[-[\]{}()*+?.,\^$|#\s]/g, "\$&");
}).map(function (term) {
return new RegExp(term, 'gi');
});
//Loop through each option
return options.map(function(option){
return divide_and_conquer_inner([option]);
}).filter(Boolean);
}
var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
var separator = ' ';
function divide_and_conquer(){
var query = document.getElementById('divide_and_conquer').value;
var res_elm = document.getElementById('divide_and_conquer_result');
var t0 = performance.now();
var results = divide_and_conquer_replace(query, options, separator);
var t1 = performance.now();
document.getElementById('divide_and_conquer_meta').innerHTML = 'Time: ' + (t1 - t0).toFixed(2) + 'ms';
res_elm.innerHTML = '';
for (var result of results) {
res_elm.innerHTML += '<li>' + result + '</li>';
}
};
divide_and_conquer();
<input type="text" id="divide_and_conquer" onkeyup="divide_and_conquer()">
<p id="divide_and_conquer_meta"></p>
<ul style="height:300px;overflow:auto" id="divide_and_conquer_result"></ul>
当查询仅由(通常非常短的)字符串组成时,此策略会出现性能问题,而这些字符串在许多选项中都存在/大部分存在 - 例如 a a a a a a a a
...
在现实场景中,它目前优于其他提议的算法 - 请参阅添加到问题中的 link to jsperf。
我建议分而治之的想法略有不同:不是将字符串分成块(字节),而是 "wipe out" 匹配的字符,然后对该字符执行进一步搜索细绳。要擦除的字符将是分隔符,因为它保证不会出现在任何术语中。
这里是:
function trincotWipeSearch(query, options, separator) {
// Split query in terms at delimiter
const terms = query.split(separator).filter(Boolean);
if (!terms.length) return options;
// Sort terms by descending size
terms.sort( (a,b) => b.length - a.length );
// Escape terms, and enrich with size of original term
// and a string of the same length filled with the separator char
const items = terms.map(term => ({
size: term.length,
wipe: separator.repeat(term.length),
regex: new RegExp(term.replace(/[-[\]{}()*+?.,\^$|#\s]/g, "\$&"), 'gi')
}));
function getOffsets(termIndex, text) {
// All terms found?
if (termIndex >= terms.length) return [];
let match;
const { regex, size, wipe } = items[termIndex];
regex.lastIndex = 0;
while (match = regex.exec(text)) {
let index = match.index;
// Wipe characters and recurse to find other terms
let offsets = getOffsets(termIndex+1,
text.substr(0, index) + wipe + text.substr(index + size));
if (offsets !== undefined) {
// Solution found, backtrack all the way
return offsets.concat([index, index + size]);
}
regex.lastIndex = match.index + 1;
}
}
// Loop through each option
return options.map( option => {
// Get the offsets of the matches
let offsets = getOffsets(0, option);
if (offsets) {
// Apply the offsets to add the markup
offsets
.sort( (a,b) => b - a )
.map((index, i) => {
option = option.substr(0, index)
+ (i%2 ? "<u>" : "</u>")
+ option.substr(index);
});
return option;
}
}).filter(Boolean); // get only the non-empty results
}
var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
/*
* I/O and performance measurements
*/
function processInput() {
var query = this.value.toLowerCase();
const t0 = performance.now();
const matches = trincotWipeSearch(query, options, ' ');
const spentTime = performance.now() - t0;
// Output the time spent
time.textContent = spentTime.toFixed(2);
// Output the matches
result.innerHTML = '';
for (var match of matches) {
// Append it to the result list
var li = document.createElement('li');
li.innerHTML = match;
result.appendChild(li);
}
}
findTerms.addEventListener('keyup', processInput);
processInput.call(findTerms);
ul {
height:300px;
font-size: smaller;
overflow: auto;
}
Input terms: <input type="text" id="findTerms"><br>
<h3>Trincot's Wipe Search</h3>
Time: <span id="time"></span>ms<br>
<ul id="result"></ul>
我已经从时间测量中排除了 DOM I/O。
这里是 jsfiddle 并排比较两种算法。添加第三种算法以与其他算法进行比较应该不难。
当分隔符可以是任何正则表达式时
...那么上面的功能就不能用了。克服这个问题的一种方法是引入一个 "shadow" 字符串,与选项字符串一样长,但其中只有 2 个不同的可能字符(例如 .
和 x
):
两者之一表示选项字符串中的相应字符(即在同一位置)已与一个术语匹配,因此不再可用于另一个术语的匹配。
另一个字符表示选项字符串中的相应字符仍可用于包含在术语匹配中。
显然这会使函数变慢一点,因为在检查此影子字符串后可能需要拒绝一些匹配项:
function trincotShadowMarks (query, options, separator) {
// Split query in terms at delimiter
const terms = query.split(separator).filter(Boolean);
if (!terms.length) return options;
// Sort terms by descending size
terms.sort( (a,b) => b.length - a.length );
// Escape terms, and enrich with size of original term
// and a string of the same length filled with the separator char
const items = terms.map(term => ({
size: term.length,
used: 'x'.repeat(term.length),
free: '.'.repeat(term.length),
regex: new RegExp(term.replace(/[-[\]{}()*+?.,\^$|#\s]/g, "\$&"), 'gi')
}));
function getOffsets(termIndex, text, shadow) {
// All terms found?
if (termIndex >= terms.length) return [];
let match;
const { regex, size, used, free } = items[termIndex];
regex.lastIndex = 0;
while (regex.lastIndex > -1 && (match = regex.exec(text))) {
let index = match.index;
// Is this match not overlapping with another match?
if (!shadow.substr(index, size).includes('x')) {
// Mark position as used and recurse to find other terms
let offsets = getOffsets(termIndex+1, text,
shadow.substr(0, index) + used + shadow.substr(index + size));
if (offsets !== undefined) {
// Solution found, backtrack all the way
return offsets.concat([index, index + size]);
}
}
regex.lastIndex = shadow.indexOf(free, match.index + 1);
}
}
// Loop through each option
return options.map( option => {
// Get the offsets of the matches
let offsets = getOffsets(0, option, '.'.repeat(option.length));
if (offsets) {
// Apply the offsets to add the markup
offsets
.sort( (a,b) => b - a )
.map((index, i) => {
option = option.substr(0, index)
+ (i%2 ? "<u>" : "</u>")
+ option.substr(index);
});
return option;
}
}).filter(Boolean); // get only the non-empty results
}
var options = ['United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
/*
* I/O and performance measurements
*/
function processInput() {
var query = this.value.toLowerCase();
const t0 = performance.now();
const matches = trincotShadowMarks(query, options, ' ');
const spentTime = performance.now() - t0;
// Output the time spent
time.textContent = spentTime.toFixed(2);
// Output the matches
result.innerHTML = '';
for (var match of matches) {
// Append it to the result list
var li = document.createElement('li');
li.innerHTML = match;
result.appendChild(li);
}
}
findTerms.addEventListener('keyup', processInput);
processInput.call(findTerms);
ul {
height:300px;
font-size: smaller;
overflow: auto;
}
Input terms: <input type="text" id="findTerms"><br>
<h3>Trincot's Wipe Search</h3>
Time: <span id="time"></span>ms<br>
<ul id="result"></ul>
更新 2
由于在 Vue 中恢复工作字符串的问题,放弃了缩小集合的概念。
现在方法简单如下:
- 预处理选项集以保持显示与工作同步。
- 处理条款。
- 通过迭代来减少(过滤)选项集,循环遍历条款,在不匹配时短路。
- 使用缩减集,遍历每个选项,找到匹配范围。
- 在每个匹配范围周围插入 HTML 个字符串。
代码已注释。
原始 javascript(记录 filtered/manipulated 选项数组):https://jsfiddle.net/pvLj9uxe/14/
新的 Vue 实现:https://jsfiddle.net/15prcpxn/30/
计算速度似乎相当快 - DOM 更新导致它失败。
添加到比较中*:https://jsfiddle.net/ektyx133/4/
*警告:预处理选项(视为 "static")是策略的一部分,因此已在基准之外进行处理。
var separator = /\s|\*|,/;
// this function enhances the raw options array
function enhanceOptions(options) {
return options.map(option => ({
working: option.toLowerCase(), // for use in filtering the set and matching
display: option // for displaying
}))
}
// this function changes the input to lower case, splits the input into terms, removes empty strings from the array, and enhances the terms with the size and wiping string
function processInput(input) {
return input.trim().toLowerCase().split(separator).filter(term => term.length).map(term => ({
value: term.toLowerCase(),
size: term.length,
wipe: " ".repeat(term.length)
})).sort((a, b) => b.size - a.size);
}
// this function filters the data set, then finds the match ranges, and finally returns an array with HTML tags inserted
function filterAndHighlight(terms, enhancedOptions) {
let options = enhancedOptions,
l = terms.length;
// filter the options - consider recursion instead
options = options.filter(option => {
let i = 0,
working = option.working,
term;
while (i < l) {
if (!~working.indexOf((term = terms[i]).value)) return false;
working = working.replace(term.value, term.wipe);
i++;
}
return true;
})
// generate the display string array
let displayOptions = options.map(option => {
let rangeSet = [],
working = option.working,
display = option.display;
// find the match ranges
terms.forEach(term => {
working = working.replace(term.value, (match, offset) => { // duplicate the wipe string replacement from the filter, but grab the offsets
rangeSet.push({
start: offset,
end: offset + term.size
});
return term.wipe;
})
})
// sort the match ranges, last to first
rangeSet.sort((a, b) => b.start - a.start);
// insert the html tags within the string around each match range
rangeSet.forEach(range => {
display = display.slice(0, range.start) + '<u>' + display.slice(range.start, range.end) + '</u>' + display.slice(range.end)
})
return display;
})
return displayOptions;
}
旧尝试
https://jsfiddle.net/15prcpxn/25/
我的尝试,使用 Vue 进行渲染(方法是顺序的,因此您可以毫不费力地将它们全部放入一个整体函数中 - 输入将是术语和完整的选项集;输出将是过滤选项集,和突出显示的范围)。
- 将输入拆分为单独的项
- 按长度对术语进行排序(最长的术语在前,这样当您有诸如
"abc ab"
和术语"a abc"
之类的选项时,即一个术语是另一个术语的子串,它将能够匹配"abc"
) - Copy/change 项小写
- 将选项(我们的"display set")复制为小写(我们的"working set")
- 对于每个术语,从 "working set" 中删除不匹配的工作选项,并同时从 "display set" 中删除显示选项 - 这样做时,从幸存的工作选项中删除匹配的术语字符串字符串,例如选项
"abc"
中的匹配项"a"
产生"bc"
[实际实现是相反的:对于每个项,在有匹配项时并行地重新创建 "working set"将显示选项添加到 "display set",然后将这些集传递给下一个术语] - 这为我们提供了过滤后的显示集 - 将过滤后的显示集复制为小写,为我们提供新的过滤后工作集
- 对于剩余过滤工作集中的每个工作选项,通过记录范围(即开始和结束,例如在选项
"abc"
中匹配术语"a"
创建一个范围集:start = 0, end = 1
) 其中每个术语通过匹配的偏移量(开始)和 term/match 的长度进行匹配。用与该术语长度相等的空格(或其他未使用的字符)替换匹配的字符串,并将其提供给下一个术语,例如选项"abc"
中的匹配项"a"
产生" bc"
- 这保留了工作选项的长度,确保过滤后的工作集(小写)与过滤后的显示集(原始大小写)相匹配.范围集的总数将等于筛选选项集中剩余选项的数量。 - 此外,对每个范围集中的范围进行排序(按降序排列,反向工作),以允许插入字符串。
- 对于过滤后的显示集中的每个选项,(反向操作以免在操作字符串时干扰索引)通过切片在匹配范围周围插入
<u>
</u>
标记向上显示选项,例如选项"abc"
中的匹配项"a"
:new option = "<u>" + "a" + "</u>" + "bc"
- 渲染它
当有很多匹配/无用的术语时(例如,当您输入单个字符时),性能会很差。对于最终用途,我可能会延迟输入计算。
我应该能够将其中一些步骤汇总为更少的步骤,这可能会提高性能。我明天再去看看。
Vue 大概还通过虚拟 DOM 等进行一些优化,因此它不一定反映香草 Javascript / DOM 渲染。
这是一个与我之前的回答完全不同的方法——我无法将以下所有内容添加到(大小限制),所以...这是一个单独的答案。
广义后缀树:预处理选项
generalized suffix tree 是一种理论上允许以有效方式在一组字符串中搜索子字符串的结构。所以我想我会尝试一下。
以有效的方式构建这样一棵树绝非易事,从 this awesome explanation of Ukkonen's algorithm, which concerns building a suffix tree 一个短语(选项)可以看出。
我从实现 found here 中获得了灵感,它需要一些适应:
- 应用更好的编码风格(例如摆脱非显式声明的全局变量)
- 无需在文本后添加分隔符即可使其正常工作。这真的很棘手,我希望我没有错过一些边界条件
- 使其适用于多个字符串(即广义)
所以这里是:
"use strict";
// Implementation of a Generalized Suffix Tree using Ukkonen's algorithm
// See also:
class Node {
constructor() {
this.edges = {};
this.suffixLink = null;
}
addEdge(ch, textId, start, end, node) {
this.edges[ch] = { textId, start, end, node };
}
}
class Nikkonen extends Node {
constructor() {
super(); // root node of the tree
this.texts = [];
}
findNode(s) {
if (!s.length) return;
let node = this,
len,
suffixSize = 0,
edge;
for (let i = 0; i < s.length; i += len) {
edge = node.edges[s.charAt(i)];
if (!edge) return;
len = Math.min(edge.end - edge.start, s.length - i);
if (this.texts[edge.textId].substr(edge.start, len) !== s.substr(i, len)) return;
node = edge.node;
}
return { edge, len };
}
findAll(term, termId = 1) {
const { edge, len } = this.findNode(term) || {};
if (!edge) return {}; // not found
// Find all leaves
const matches = new Map;
(function recurse({ node, textId, start, end }, suffixLen) {
suffixLen += end - start;
const edges = Object.values(node.edges);
if (!edges.length) { // leaf node: calculate the match
if (!(matches.has(textId))) matches.set(textId, []);
matches.get(textId).push({ offset: end - suffixLen, termId });
return;
}
edges.forEach( edge => recurse(edge, suffixLen) );
})(edge, term.length - len);
return matches;
}
addText(text) {
// Implements Nikkonen's algorithm for building the tree
// Inspired by https://felix-halim.net/misc/suffix-tree/
const root = this,
active = {
node: root,
textId: this.texts.length,
start: 0,
end: 0,
},
texts = this.texts;
// Private functions
function getChar(textId, i) {
return texts[textId].charAt(i) || '$' + textId;
}
function addEdge(fromNode, textId, start, end, node) {
fromNode.addEdge(getChar(textId, start), textId, start, end, node);
}
function testAndSplit() {
const ch = getChar(active.textId, active.end);
if (active.start < active.end) {
const edge = active.node.edges[getChar(active.textId, active.start)],
splitPoint = edge.start + active.end - active.start;
if (ch === getChar(edge.textId, splitPoint)) return;
const newNode = new Node();
addEdge(active.node, edge.textId, edge.start, splitPoint, newNode);
addEdge(newNode, edge.textId, splitPoint, edge.end, edge.node);
return newNode;
}
if (!(ch in active.node.edges)) return active.node;
}
function canonize() {
while (active.start < active.end) {
const edge = active.node.edges[getChar(active.textId, active.start)];
if (edge.end - edge.start > active.end - active.start) break;
active.start += edge.end - edge.start;
active.node = edge.node;
}
}
function update() {
let prevNewNode = root,
newNode;
while (newNode = testAndSplit()) {
addEdge(newNode, active.textId, active.end, text.length+1, new Node());
// Rule 2: add suffix-link from previously inserted node
if (prevNewNode !== root) {
prevNewNode.suffixLink = newNode;
}
prevNewNode = newNode;
// Rule 3: follow suffixLink after split
active.node = active.node.suffixLink || root;
canonize(); // because active.node changed
}
if (prevNewNode !== root) {
prevNewNode.suffixLink = active.node;
}
}
texts.push(text);
if (!root.suffixLink) root.suffixLink = new Node();
for (let i = 0; i < text.length; i++) {
addEdge(root.suffixLink, active.textId, i, i+1, root);
}
// Main Ukkonen loop: add each character from left to right to the tree
while (active.end <= text.length) {
update();
active.end++;
canonize(); // because active.end changed
}
}
}
function trincotSuffixTree(query, options, suffixTree, separator) {
// Split query in terms at delimiter
const terms = query.split(separator).filter(Boolean);
if (!terms.length) return options;
// Sort terms by descending size
terms.sort( (a,b) => b.length - a.length );
// create Map keyed by term with count info
const termMap = new Map(terms.map( (term, termId) => [term, { termId, count: 0, leftOver: 0, size: term.length }] ));
terms.forEach( (term) => termMap.get(term).count++ );
function getNonOverlaps(offsets, leftOver, lastIndex = 0, offsetIndex = 0) {
// All terms found?
if (!leftOver) return [];
let giveUpAt = Infinity;
// While still enough matches left over:
while (offsetIndex + leftOver <= offsets.length) {
const { termId, offset } = offsets[offsetIndex++];
if (offset < lastIndex) continue; // overlap, try next
if (offset >= giveUpAt) break; // Looking further makes no sense
const termInfo = termMap.get(terms[termId]);
//console.log('termId', termId, 'offset', offset, 'size', termInfo.size, 'lastIndex', lastIndex);
if (!termInfo.leftOver) continue; // too many of the same term, try next
termInfo.leftOver--;
const result = getNonOverlaps(offsets, leftOver - 1, offset + termInfo.size, offsetIndex);
// If success, then completely backtrack out of recursion.
if (result) return result.concat([offset + termInfo.size, offset]);
termInfo.leftOver++; // restore after failed recursive search and try next
// If a term-match at a given offset could not lead to a solution (in recursion),
// and if we keep those matched character postions all unmatched and only start matching after
// the end of that location, it will certainly not lead to a solution either.
giveUpAt = Math.min(giveUpAt, offset + termInfo.size);
}
}
let allTermsAllOptionsOffsets;
// Loop through the unique terms:
for (let [term, termInfo] of termMap) {
// Get the offsets of the matches of this term in all options (in the preprocessed tree)
const thisTermAllOptionsOffsets = suffixTree.findAll(term, termInfo.termId);
//console.log('findAll:', JSON.stringify(Array.from(thisTermAllOptionsOffsets)));
if (!thisTermAllOptionsOffsets.size) return []; // No option has this term, so bail out
if (!allTermsAllOptionsOffsets) {
allTermsAllOptionsOffsets = thisTermAllOptionsOffsets;
} else {
// Merge with all previously found offsets for other terms (intersection)
for (let [optionId, offsets] of allTermsAllOptionsOffsets) {
let newOffsets = thisTermAllOptionsOffsets.get(optionId);
if (!newOffsets || newOffsets.length < termInfo.count) {
// this option does not have enough occurrences of this term
allTermsAllOptionsOffsets.delete(optionId);
} else {
allTermsAllOptionsOffsets.set(optionId, offsets.concat(newOffsets));
}
}
if (!allTermsAllOptionsOffsets.size) return []; // No option has all terms, so bail out
}
}
// Per option, see if (and where) the offsets can serve non-overlapping matches for each term
const matches = Array.from(allTermsAllOptionsOffsets, ([optionId, offsets]) => {
// Indicate how many of each term must (still) be matched:
termMap.forEach( obj => obj.leftOver = obj.count );
return [optionId, getNonOverlaps(offsets.sort( (a, b) => a.offset - b.offset ), terms.length)];
})
// Remove options that could not provide non-overlapping offsets
.filter( ([_, offsets]) => offsets )
// Sort the remaining options in their original order
.sort( (a,b) => a[0] - b[1] )
// Replace optionId, by the corresponding text and apply mark-up at the offsets
.map( ([optionId, offsets]) => {
let option = options[optionId];
offsets.map((index, i) => {
option = option.substr(0, index)
+ (i%2 ? "<u>" : "</u>")
+ option.substr(index);
});
return option;
});
//console.log(JSON.stringify(matches));
return matches;
}
function trincotPreprocess(options) {
const nikkonen = new Nikkonen();
// Add all the options (lowercased) to the suffic tree
options.map(option => option.toLowerCase()).forEach(nikkonen.addText.bind(nikkonen));
return nikkonen;
}
const options = ['abbbba', 'United States', 'United Kingdom', 'Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia, Plurinational State of', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of The', 'Cook Islands', 'Costa Rica', 'Cote D\'ivoire', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-bissau', 'Guyana', 'Haiti', 'Heard Island and Mcdonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Korea, Democratic People\'s Republic of', 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', 'Lao People\'s Democratic Republic', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, The Former Yugoslav Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Moldova, Republic of', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthelemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and The Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and The South Sandwich Islands', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, Bolivarian Republic of', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'];
/*
* I/O and performance measurements
*/
let preprocessed;
function processInput() {
if (!preprocessed) { // Only first time
const t0 = performance.now();
preprocessed = trincotPreprocess(options);
const spentTime = performance.now() - t0;
// Output the time spent on preprocessing
pretime.textContent = spentTime.toFixed(2);
}
var query = this.value.toLowerCase();
const t0 = performance.now();
const matches = trincotSuffixTree(query, options, preprocessed, ' ');
const spentTime = performance.now() - t0;
// Output the time spent
time.textContent = spentTime.toFixed(2);
// Output the matches
result.innerHTML = '';
for (var match of matches) {
// Append it to the result list
var li = document.createElement('li');
li.innerHTML = match;
result.appendChild(li);
}
}
findTerms.addEventListener('keyup', processInput);
processInput.call(findTerms);
ul {
height:300px;
font-size: smaller;
overflow: auto;
}
Input terms: <input type="text" id="findTerms"><br>
<h3>Trincot's Suffix Tree Search</h3>
Preprocessing Time: <span id="pretime"></span>ms (only done once)<br>
Time: <span id="time"></span>ms<br>
<ul id="result"></ul>
这个方法背后有很多代码,所以我想它可能不会对小数据集表现出有趣的性能,而对于更大的数据集,它会消耗内存:树比原始选项数组占用更多内存.