我的 Loop of the Loop 非常慢

My Loop of the Loop is painstakingly slow

我有一个 object $Posts,其中包含标题和 SimTitles 字段等。我需要将每个标题与其他标题进行比较,并在 SimTitles 字段中给它一个相似度分数。因此,如果我有 80 个 $Posts,则需要涵盖 6400 个 re-iterations,因为每个标题都需要与其他标题进行比较。

除了我认为已优化的 Measure-TitleSimilarity 例程之外,任何人都可以找到提高我缺少的这个双循环速度的方法吗?

编辑: 我已经包含了函数 Measure-TitleSimilarity。我实际上是将数组传递给函数。为相似性量化数组的整个主题很吸引人。我尝试使用 Title.ToCharArray() 将幻数更改为更高的数字。只要字符相同,它也可以生成具有两个完全不同标题的匹配项。 (例如:'Mother Teresa' 将紧密匹配 'Earthmovers' 或 'Thermometer' 但显然不是相同的含义)。余弦相似度虽然只是一种方法,但似乎最容易处理。 @Mclayton 和@bryancook - 我看到了你的建议,但无法掌握跟踪不再需要查看相似词的内容。

Function Get-SimTitles([psobject]$NewPosts) {

  $CKTitles = $NewPosts.title

  foreach ($Ck in $CkTitles) {
    $NewPosts | & { 
      process { 
        if ((Measure-TitleSimilarity $Ck.split(' ') $_.title.split(' ') -gt .2) {
          $_.SimTitles = $_.SimTitles + 1 
        } 
      } 
    } 
  }
}

Function Measure-TitleSimilarity
{
## Based on VectorSimilarity by .AUTHOR Lee Holmes 
## Modified slightly to match use

[CmdletBinding()]
param(
    
    [Parameter(Position = 0)]
    $Title1,

    [Parameter(Position = 1)]   
    $Title2
    
        
) 

$allkeys = @($Title1) + @($Title2) |  Sort-Object -Unique

$set1Hash = @{}
$set2Hash = @{}
$setsToProcess = @($Title1, $Set1Hash), @($Title2, $Set2Hash)

foreach($set in $setsToProcess)
{
    $set[0] | Foreach-Object {
         $value = 1 
         $set[1][$_] = $value
    }
}

$dot = 0
$mag1 = 0
$mag2 = 0

foreach($key in $allkeys)
{
    $dot += $set1Hash[$key] * $set2Hash[$key]
    $mag1 +=  ($set1Hash[$key] * $set1Hash[$key])
    $mag2 +=  ($set2Hash[$key] * $set2Hash[$key])
}

$mag1 = [Math]::Sqrt($mag1)
$mag2 = [Math]::Sqrt($mag2)

return [Math]::Round($dot / ($mag1 * $mag2), 3)

}

您可以通过删除重复比较将处理时间减半。 IE。一旦你比较了“title1”和“title2”,你就不需要比较“title2”和“title1”——你已经知道答案了。 所以,你的内部循环不应该从数组的开头开始

我尝试了其他方法来衡量标题的相似性,包括词频与特定标题中词频的对比。我认为比较标题是单独 post 的主题。我仍然喜欢只循环一次的想法。

@MikeSh - 根据你的回答,这就是我想出的。

Function Get-SimTitles([psobject]$NewPosts) {

  $i=0
  $end = $NewPosts.Count - 1
   
  For($i =0; $i -lt $end; $i++){
    
      $k=$i+1        
      $k..$end | Where{{$NewPosts[$i].source -ne $NewPosts[$_].source}} |
      Where-Object {(Measure-TitleSimilarity $NewPosts[$i].title.split(' ') $NewPosts[$_].title.split(' ')) -gt .35}  |
       & {process {$NewPosts[$_].SimTitles = $NewPosts[$_].SimTitles + 1; $NewPosts[$i].SimTitles+=1} }
       } 
                       
 }  

部分回答

这包括评论中的一些建议:

  • Mathias R. Jessen - “您不必将每个标题与每个标题进行比较 - 相反,您只需要比较所有独特的对”

  • 我的评论 - “您可以在开始比较之前将您的标题拆分为单词数组一次,然后循环遍历它们,而不是每次都拆分它们”

$ErrorActionPreference = "Stop";
Set-StrictMode -Version "Latest";

function ConvertTo-WordSets( [psobject] $Posts )
{

    # preprocess each post to break its title into word counts 
    # so we don't need to do it every time we compare 2 posts

    foreach( $post in $Posts )
    {
        $set = new-object PSCustomObject -Property ([ordered] @{
            "Post"   = $post
            "Title"  = $post.Title.Trim()
            "Words"  = $null
            "Counts" = $null
        });
        $set.Words  = $set.Title.Split(" ");
        $set.Counts = $set.Words `
            | group-object `
            | foreach-object `
                -Begin   { $counts = @{} } `
                -Process { $counts.Add($_.Name, $_.Count) } `
                -End     { $counts };
        write-output $set;
    }

}

function Get-SimTitles( [psobject] $NewPosts )
{

    # instead of comparing every object to every object, just compare unique combinations
    # e.g. X compared to Y is the same as Y compared to X so score them both at the same time
    # (and we don't need to compare an object to itself either)

    for( $i = 0; $i -lt $NewPosts.Length; $i++ )
    {
        $left = $NewPosts[$i];
        for( $j = $i + 1; $j -lt $NewPosts.Length; $j++ )
        {
            $right = $NewPosts[$j];
            if ((Measure-TitleSimilarity2 $left $right) -gt .5)
            {
                $left.Post.SimTitles  = $left.Post.SimTitles + 1;
                $right.Post.SimTitles = $right.Post.SimTitles + 1;
            } 
        } 
    }

}

Function Measure-TitleSimilarity
{
    param
    (
        [Parameter(Position = 0)]
        $Left,
        [Parameter(Position = 1)]   
        $Right
    ) 

    # we can use the pre-processed word counts now

    $allkeys = $Left.Words + $Right.Words | Sort-Object -Unique

    $dot = 0
    $mag1 = 0
    $mag2 = 0

    foreach($key in $allkeys)
    {
        $dot  += $Left.Counts[$key] * $Right.Counts[$key]
        $mag1 += $Left.Counts[$key] * $Left.Counts[$key]
        $mag2 += $Right.Counts[$key] * $Right.Counts[$key]
    }

    $mag1 = [Math]::Sqrt($mag1)
    $mag2 = [Math]::Sqrt($mag2)

    return [Math]::Round($dot / ($mag1 * $mag2), 3)

}

性能

即使对于中等大小的样本,这个和原来的都不是特别快,但是这个快了大约 4 倍。

# get some test data
$sentences = (Invoke-WebRequest -Uri "https://raw.githubusercontent.com/SteveMansfield/MNREAD-sentences/master/XMNREAD01.txt").Content;
$sentences = $sentences.Trim("`n").Split("`n") | foreach-object { $_.Substring(1, $_.Length - 3) };

$posts = $sentences `
    | select-object -First 200 `
    | foreach-object {
        new-object PSCustomObject -Property ([ordered] @{
            "Title"     = $_
            "SimTitles" = 0
        })
    };
Measure-Command { Get-SimTitles $posts; }

# build some test data
$posts = $sentences `
    | select-object -First 200 `
    | foreach-object {
        new-object PSCustomObject -Property ([ordered] @{
            "Title"     = $_
            "SimTitles" = 0
        })
    };

Measure-Command {
    $wordSets = @( ConvertTo-WordSets $Posts );
    Get-SimTitles $wordSets;
}
Size Original This one
10 0.2 0.02
20 0.4 0.1
50 1.9 0.5
100 8.7 1.9
200 38 9
500 246 82

(以秒为单位的时间)