查找文本中相邻子字符串的出现
Find occurrences of the adjacent sub strings in the text
我有一个 Word 文档的文本和一个字符串数组。目标是在文档文本中找到这些字符串的所有匹配项。我尝试使用 Aho-Corasick 算法的 Aho-Corasick string matching in C# 实现,但默认实现不适合我。
文本的典型部分看起来像
“Activation” means a written notice from Lender to the Bank substantially in the form of Exhibit A.
“Activation Notice” means a written notice from Lender to the Bank substantially in the form of Exhibit A and Activation.
“Business Day" means each day (except Saturdays and Sundays) on which banks are open for general business and Activation Notice.
关键字数组看起来像
var keywords = new[] {"Activation", "Activation Notice"};
Aho-Corasick 算法的默认实现returns 以下出现次数
Activation - 4
Activation Notice - 2
对于 'Activation Notes' 这是正确的结果。但是对于 'Activation' 正确的计数也应该是 2
因为我不需要考虑相邻关键字 'Activation Notice'.
内的出现
这种情况有合适的算法吗?
我假设您根据链接的示例得到了结果。
StringSearchResult[] results = searchAlg.FindAll(textToSearch);
对于那些 results
,如果您假设唯一的重叠是子集,您可以按索引排序并一次收集所需的结果。
public class SearchResultComparer : IComparer<StringSearchResult> {
public int StringSearchResult(StringSearchResult x, StringSearchResult y)
{
// Try ordering by the start index.
int compare = x.Index.CompareTo(y.Index);
if (compare == 0)
{
// In case of ties, reverse order by keyword length.
compare = y.Keyword.Length.CompareTo(x.Keyword.Length);
}
return compare;
}
}
// ...
IComparer searchResultComparer = new SearchResultComparer();
Array.Sort(results, searchResultComparer);
int activeEndIndex = -1;
List<StringSearchResult> nonOverlappingResults = new List<StringSearchResult>();
foreach(StringSearchResult r in results)
{
if (r.Index < activeEndIndex)
{
// This range starts before the active range ends.
// Since it's an overlap, skip it.
continue;
}
// Save this result, track when it ends.
nonOverlappingResults.Add(r);
activeEndIndex = r.Index + r.Keyword.Length;
}
由于索引排序,循环保证只保留非重叠范围。但有些范围将被拒绝。这只能有两个原因。
- 候选人从与活动范围相同的索引开始。由于排序打破了这些联系,所以最长的排在第一位,候选人必须比活动范围短,可以跳过。
- 候选人在活动范围之后开始。由于唯一的重叠是子集,并且这与活动范围重叠,因此它是一个开始较晚但仍然结束于或之前的子集。
因此唯一被拒绝的候选将是子集,并且必须在活动范围之前结束。所以活动范围仍然是唯一需要担心重叠的事情。
我有一个 Word 文档的文本和一个字符串数组。目标是在文档文本中找到这些字符串的所有匹配项。我尝试使用 Aho-Corasick 算法的 Aho-Corasick string matching in C# 实现,但默认实现不适合我。 文本的典型部分看起来像
“Activation” means a written notice from Lender to the Bank substantially in the form of Exhibit A.
“Activation Notice” means a written notice from Lender to the Bank substantially in the form of Exhibit A and Activation.
“Business Day" means each day (except Saturdays and Sundays) on which banks are open for general business and Activation Notice.
关键字数组看起来像
var keywords = new[] {"Activation", "Activation Notice"};
Aho-Corasick 算法的默认实现returns 以下出现次数
Activation - 4
Activation Notice - 2
对于 'Activation Notes' 这是正确的结果。但是对于 'Activation' 正确的计数也应该是 2 因为我不需要考虑相邻关键字 'Activation Notice'.
内的出现这种情况有合适的算法吗?
我假设您根据链接的示例得到了结果。
StringSearchResult[] results = searchAlg.FindAll(textToSearch);
对于那些 results
,如果您假设唯一的重叠是子集,您可以按索引排序并一次收集所需的结果。
public class SearchResultComparer : IComparer<StringSearchResult> {
public int StringSearchResult(StringSearchResult x, StringSearchResult y)
{
// Try ordering by the start index.
int compare = x.Index.CompareTo(y.Index);
if (compare == 0)
{
// In case of ties, reverse order by keyword length.
compare = y.Keyword.Length.CompareTo(x.Keyword.Length);
}
return compare;
}
}
// ...
IComparer searchResultComparer = new SearchResultComparer();
Array.Sort(results, searchResultComparer);
int activeEndIndex = -1;
List<StringSearchResult> nonOverlappingResults = new List<StringSearchResult>();
foreach(StringSearchResult r in results)
{
if (r.Index < activeEndIndex)
{
// This range starts before the active range ends.
// Since it's an overlap, skip it.
continue;
}
// Save this result, track when it ends.
nonOverlappingResults.Add(r);
activeEndIndex = r.Index + r.Keyword.Length;
}
由于索引排序,循环保证只保留非重叠范围。但有些范围将被拒绝。这只能有两个原因。
- 候选人从与活动范围相同的索引开始。由于排序打破了这些联系,所以最长的排在第一位,候选人必须比活动范围短,可以跳过。
- 候选人在活动范围之后开始。由于唯一的重叠是子集,并且这与活动范围重叠,因此它是一个开始较晚但仍然结束于或之前的子集。
因此唯一被拒绝的候选将是子集,并且必须在活动范围之前结束。所以活动范围仍然是唯一需要担心重叠的事情。