如何找到出现在数组的每个元素中的最长子串?
How to find the longest substring that occurs in every element of an array?
我有 collection 一些作者的文章。每个作者都有一个独特的签名或 link,出现在他们的所有文本中。
Example for Author1:
$texts=['sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd
@jhsad.sadas.com sdsdADSA sada',
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];
Expected output for Author1 is: @jhsad.sadas.com
Example for Author2:
$texts=['This is some random string representative of non-signature text.
This is the
*author\'s* signature.',
'Different message body text. This is the
*author\'s* signature.
This is an afterthought that expresses that a signature is not always at the end.',
'Finally, this is unwanted stuff. This is the
*author\'s* signature.'];
Expected output for Author2 is:
This is the
*author's* signature.
请特别注意,没有可靠的识别字符(或位置)来表示签名的开始或结束。它可以是 url、Twitter 提及、任何类型的任何长度的纯文本等,包含出现在字符串开头、结尾或中间的任何字符序列。
我正在寻找一种方法来提取单个作者的所有 $text
元素中存在的最长子字符串。
为了完成这项任务,预计所有作者都将在每个 post/text.
中拥有一个签名子字符串
想法:
我正在考虑将单词转换为向量并找到每个文本之间的相似性。我们可以使用余弦相似度来找到签名。我认为解决方案必须是这样的想法。
mickmackusa's commented code抓住了想要的东西的本质,但我想看看是否有其他方法可以达到想要的结果。
您可以使用 preg_match()
和正则表达式来实现此目的。
$str = "KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf";
preg_match("/\@[^\s]+/", $str, $match);
var_dump($match); //Will output the signature
这是我的想法:
- 按字符串长度(升序)对作者的 post 集合进行排序,以便您从较小的文本到较大的文本。
- 将每个 post 的文本拆分为一个或多个白色-space 字符,以便您在处理过程中只处理完全非白色-space 的子字符串。
- 查找出现在每个后续 post 中的匹配子字符串与不断缩小的子字符串数组 (
overlaps
)。
- 通过分析其索引值对连续匹配的子串进行分组。
- 将分组的连续子字符串“重组”为其原始字符串形式(当然,前导和尾随白色-space字符被修剪)。
- 按字符串长度(降序)对重构的字符串进行排序,以便为最长的字符串分配
0
索引。
- 打印以根据通用性和长度筛选假定为作者签名(作为最佳猜测)的子字符串。
代码:(Demo)
$posts['Author1'] = ['sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd
@jhsad.sadas.com sdsdADSA sada',
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];
$posts['Author2'] = ['This is some random string representative of non-signature text.
This is the
*author\'s* signature.',
'Different message body text. This is the
*author\'s* signature.
This is an afterthought that expresses that a signature is not always at the end.',
'Finally, this is unwanted stuff. This is the
*author\'s* signature.'];
foreach ($posts as $author => $texts) {
echo "Author: $author\n";
usort($texts, function($a, $b) {
return strlen($a) <=> strlen($b); // sort ASC by strlen; mb_strlen probably isn't advantageous
});
var_export($texts);
echo "\n";
foreach ($texts as $index => $string) {
if (!$index) {
$overlaps = preg_split('/\s+/', $string, 0, PREG_SPLIT_NO_EMPTY); // declare with all non-white-space substrings from first text
} else {
$overlaps = array_intersect($overlaps, preg_split('/\s+/', $string, 0, PREG_SPLIT_NO_EMPTY)); // filter word bank using narrowing number of words
}
}
var_export($overlaps);
echo "\n";
// batch consecutive substrings
$group = null;
$consecutives = []; // clear previous iteration's data
foreach ($overlaps as $i => $word) {
if ($group === null || $i - $last > 1) {
$group = $i;
}
$last = $i;
$consecutives[$group][] = $word;
}
var_export($consecutives);
echo "\n";
foreach($consecutives as $words){
// match potential signatures in first text for measurement:
if (preg_match_all('/\Q' . implode('\E\s+\Q', $words) . '\E/', $texts[0], $out)) { // make alternatives characters literal using \Q & \E
$potential_signatures = $out[0];
}
}
usort($potential_signatures, function($a,$b){
return strlen($b) <=> strlen($a); // sort DESC by strlen; mb_strlen probably isn't advantageous
});
echo "Assumed Signature: {$potential_signatures[0]}\n\n";
}
输出:
Author: Author1
array (
0 => 'sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd
@jhsad.sadas.com sdsdADSA sada',
1 => 'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl',
2 => 'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
)
array (
11 => '@jhsad.sadas.com',
)
array (
11 =>
array (
0 => '@jhsad.sadas.com',
),
)
Assumed Signature: @jhsad.sadas.com
Author: Author2
array (
0 => 'Finally, this is unwanted stuff. This is the
*author\'s* signature.',
1 => 'This is some random string representative of non-signature text.
This is the
*author\'s* signature.',
2 => 'Different message body text. This is the
*author\'s* signature.
This is an afterthought that expresses that a signature is not always at the end.',
)
array (
2 => 'is',
5 => 'This',
6 => 'is',
7 => 'the',
8 => '*author\'s*',
9 => 'signature.',
)
array (
2 =>
array (
0 => 'is',
),
5 =>
array (
0 => 'This',
1 => 'is',
2 => 'the',
3 => '*author\'s*',
4 => 'signature.',
),
)
Assumed Signature: This is the
*author's* signature.
我有 collection 一些作者的文章。每个作者都有一个独特的签名或 link,出现在他们的所有文本中。
Example for Author1:
$texts=['sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd @jhsad.sadas.com sdsdADSA sada', 'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf', 'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl @jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];
Expected output for Author1 is:
@jhsad.sadas.com
Example for Author2:
$texts=['This is some random string representative of non-signature text. This is the *author\'s* signature.', 'Different message body text. This is the *author\'s* signature. This is an afterthought that expresses that a signature is not always at the end.', 'Finally, this is unwanted stuff. This is the *author\'s* signature.'];
Expected output for Author2 is:
This is the *author's* signature.
请特别注意,没有可靠的识别字符(或位置)来表示签名的开始或结束。它可以是 url、Twitter 提及、任何类型的任何长度的纯文本等,包含出现在字符串开头、结尾或中间的任何字符序列。
我正在寻找一种方法来提取单个作者的所有 $text
元素中存在的最长子字符串。
为了完成这项任务,预计所有作者都将在每个 post/text.
中拥有一个签名子字符串想法: 我正在考虑将单词转换为向量并找到每个文本之间的相似性。我们可以使用余弦相似度来找到签名。我认为解决方案必须是这样的想法。
mickmackusa's commented code抓住了想要的东西的本质,但我想看看是否有其他方法可以达到想要的结果。
您可以使用 preg_match()
和正则表达式来实现此目的。
$str = "KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf";
preg_match("/\@[^\s]+/", $str, $match);
var_dump($match); //Will output the signature
这是我的想法:
- 按字符串长度(升序)对作者的 post 集合进行排序,以便您从较小的文本到较大的文本。
- 将每个 post 的文本拆分为一个或多个白色-space 字符,以便您在处理过程中只处理完全非白色-space 的子字符串。
- 查找出现在每个后续 post 中的匹配子字符串与不断缩小的子字符串数组 (
overlaps
)。 - 通过分析其索引值对连续匹配的子串进行分组。
- 将分组的连续子字符串“重组”为其原始字符串形式(当然,前导和尾随白色-space字符被修剪)。
- 按字符串长度(降序)对重构的字符串进行排序,以便为最长的字符串分配
0
索引。 - 打印以根据通用性和长度筛选假定为作者签名(作为最佳猜测)的子字符串。
代码:(Demo)
$posts['Author1'] = ['sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd
@jhsad.sadas.com sdsdADSA sada',
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];
$posts['Author2'] = ['This is some random string representative of non-signature text.
This is the
*author\'s* signature.',
'Different message body text. This is the
*author\'s* signature.
This is an afterthought that expresses that a signature is not always at the end.',
'Finally, this is unwanted stuff. This is the
*author\'s* signature.'];
foreach ($posts as $author => $texts) {
echo "Author: $author\n";
usort($texts, function($a, $b) {
return strlen($a) <=> strlen($b); // sort ASC by strlen; mb_strlen probably isn't advantageous
});
var_export($texts);
echo "\n";
foreach ($texts as $index => $string) {
if (!$index) {
$overlaps = preg_split('/\s+/', $string, 0, PREG_SPLIT_NO_EMPTY); // declare with all non-white-space substrings from first text
} else {
$overlaps = array_intersect($overlaps, preg_split('/\s+/', $string, 0, PREG_SPLIT_NO_EMPTY)); // filter word bank using narrowing number of words
}
}
var_export($overlaps);
echo "\n";
// batch consecutive substrings
$group = null;
$consecutives = []; // clear previous iteration's data
foreach ($overlaps as $i => $word) {
if ($group === null || $i - $last > 1) {
$group = $i;
}
$last = $i;
$consecutives[$group][] = $word;
}
var_export($consecutives);
echo "\n";
foreach($consecutives as $words){
// match potential signatures in first text for measurement:
if (preg_match_all('/\Q' . implode('\E\s+\Q', $words) . '\E/', $texts[0], $out)) { // make alternatives characters literal using \Q & \E
$potential_signatures = $out[0];
}
}
usort($potential_signatures, function($a,$b){
return strlen($b) <=> strlen($a); // sort DESC by strlen; mb_strlen probably isn't advantageous
});
echo "Assumed Signature: {$potential_signatures[0]}\n\n";
}
输出:
Author: Author1
array (
0 => 'sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd
@jhsad.sadas.com sdsdADSA sada',
1 => 'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl',
2 => 'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
)
array (
11 => '@jhsad.sadas.com',
)
array (
11 =>
array (
0 => '@jhsad.sadas.com',
),
)
Assumed Signature: @jhsad.sadas.com
Author: Author2
array (
0 => 'Finally, this is unwanted stuff. This is the
*author\'s* signature.',
1 => 'This is some random string representative of non-signature text.
This is the
*author\'s* signature.',
2 => 'Different message body text. This is the
*author\'s* signature.
This is an afterthought that expresses that a signature is not always at the end.',
)
array (
2 => 'is',
5 => 'This',
6 => 'is',
7 => 'the',
8 => '*author\'s*',
9 => 'signature.',
)
array (
2 =>
array (
0 => 'is',
),
5 =>
array (
0 => 'This',
1 => 'is',
2 => 'the',
3 => '*author\'s*',
4 => 'signature.',
),
)
Assumed Signature: This is the
*author's* signature.