如何找到出现在数组的每个元素中的最长子串?

How to find the longest substring that occurs in every element of an array?

我有 collection 一些作者的文章。每个作者都有一个独特的签名或 link,出现在他们的所有文本中。

Example for Author1:

$texts=['sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd

@jhsad.sadas.com sdsdADSA sada',
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];

Expected output for Author1 is: @jhsad.sadas.com


Example for Author2:

$texts=['This is some random string representative of non-signature text.

This is the
*author\'s* signature.',
'Different message body text.      This is the
*author\'s* signature.

This is an afterthought that expresses that a signature is not always at the end.',
'Finally, this is unwanted stuff. This is the
*author\'s* signature.'];

Expected output for Author2 is:

This is the
 *author's* signature.

请特别注意,没有可靠的识别字符(或位置)来表示签名的开始或结束。它可以是 url、Twitter 提及、任何类型的任何长度的纯文本等,包含出现在字符串开头、结尾或中间的任何字符序列。

我正在寻找一种方法来提取单个作者的所有 $text 元素中存在的最长子字符串。

为了完成这项任务,预计所有作者都将在每个 post/text.

中拥有一个签名子字符串

想法: 我正在考虑将单词转换为向量并找到每个文本之间的相似性。我们可以使用余弦相似度来找到签名。我认为解决方案必须是这样的想法。

mickmackusa's commented code抓住了想要的东西的本质,但我想看看是否有其他方法可以达到想要的结果。

您可以使用 preg_match() 和正则表达式来实现此目的。

$str = "KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf";

preg_match("/\@[^\s]+/", $str, $match);

var_dump($match); //Will output the signature

这是我的想法:

  1. 按字符串长度(升序)对作者的 post 集合进行排序,以便您从较小的文本到较大的文本。
  2. 将每个 post 的文本拆分为一个或多个白色-space 字符,以便您在处理过程中只处理完全非白色-space 的子字符串。
  3. 查找出现在每个后续 post 中的匹配子字符串与不断缩小的子字符串数组 (overlaps)。
  4. 通过分析其索引值对连续匹配的子串进行分组。
  5. 将分组的连续子字符串“重组”为其原始字符串形式(当然,前导和尾随白色-space字符被修剪)。
  6. 按字符串长度(降序)对重构的字符串进行排序,以便为最长的字符串分配 0 索引。
  7. 打印以根据通用性和长度筛选假定为作者签名(作为最佳猜测)的子字符串。

代码:(Demo)

$posts['Author1'] = ['sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd

@jhsad.sadas.com sdsdADSA sada',
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];

$posts['Author2'] = ['This is some random string representative of non-signature text.

This is the
 *author\'s* signature.',
        'Different message body text.      This is the
 *author\'s* signature.

    This is an afterthought that expresses that a signature is not always at the end.',
        'Finally, this is unwanted stuff. This is the
 *author\'s* signature.'];

foreach ($posts as $author => $texts) {
    echo "Author: $author\n";
    
    usort($texts, function($a, $b) {
        return strlen($a) <=> strlen($b);  // sort ASC by strlen; mb_strlen probably isn't advantageous
    });
    var_export($texts);
    echo "\n";

    foreach ($texts as $index => $string) {
        if (!$index) {
            $overlaps = preg_split('/\s+/', $string, 0, PREG_SPLIT_NO_EMPTY);  // declare with all non-white-space substrings from first text
        } else {
            $overlaps = array_intersect($overlaps, preg_split('/\s+/', $string, 0, PREG_SPLIT_NO_EMPTY));  // filter word bank using narrowing number of words
        }
    }
    var_export($overlaps);
    echo "\n";
    
    // batch consecutive substrings
    $group = null;
    $consecutives = [];  // clear previous iteration's data
    foreach ($overlaps as $i => $word) {
        if ($group === null || $i - $last > 1) {
            $group = $i;
        }
        $last = $i;
        $consecutives[$group][] = $word;
    }
    var_export($consecutives);
    echo "\n";
    
    foreach($consecutives as $words){
        // match potential signatures in first text for measurement:
        if (preg_match_all('/\Q' . implode('\E\s+\Q', $words) . '\E/', $texts[0], $out)) {  // make alternatives characters literal using \Q & \E
            $potential_signatures = $out[0];
        }
    }
    usort($potential_signatures, function($a,$b){
        return strlen($b) <=> strlen($a); // sort DESC by strlen; mb_strlen probably isn't advantageous
    });
    
    echo "Assumed Signature: {$potential_signatures[0]}\n\n";
}

输出:

Author: Author1
array (
  0 => 'sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd

@jhsad.sadas.com sdsdADSA sada',
  1 => 'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl',
  2 => 'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
)
array (
  11 => '@jhsad.sadas.com',
)
array (
  11 => 
  array (
    0 => '@jhsad.sadas.com',
  ),
)
Assumed Signature: @jhsad.sadas.com

Author: Author2
array (
  0 => 'Finally, this is unwanted stuff. This is the
 *author\'s* signature.',
  1 => 'This is some random string representative of non-signature text.

This is the
 *author\'s* signature.',
  2 => 'Different message body text.      This is the
 *author\'s* signature.

    This is an afterthought that expresses that a signature is not always at the end.',
)
array (
  2 => 'is',
  5 => 'This',
  6 => 'is',
  7 => 'the',
  8 => '*author\'s*',
  9 => 'signature.',
)
array (
  2 => 
  array (
    0 => 'is',
  ),
  5 => 
  array (
    0 => 'This',
    1 => 'is',
    2 => 'the',
    3 => '*author\'s*',
    4 => 'signature.',
  ),
)
Assumed Signature: This is the
 *author's* signature.