我的 Loop of the Loop 非常慢
My Loop of the Loop is painstakingly slow
我有一个 object $Posts,其中包含标题和 SimTitles 字段等。我需要将每个标题与其他标题进行比较,并在 SimTitles 字段中给它一个相似度分数。因此,如果我有 80 个 $Posts,则需要涵盖 6400 个 re-iterations,因为每个标题都需要与其他标题进行比较。
除了我认为已优化的 Measure-TitleSimilarity 例程之外,任何人都可以找到提高我缺少的这个双循环速度的方法吗?
编辑: 我已经包含了函数 Measure-TitleSimilarity。我实际上是将数组传递给函数。为相似性量化数组的整个主题很吸引人。我尝试使用 Title.ToCharArray() 将幻数更改为更高的数字。只要字符相同,它也可以生成具有两个完全不同标题的匹配项。 (例如:'Mother Teresa' 将紧密匹配 'Earthmovers' 或 'Thermometer' 但显然不是相同的含义)。余弦相似度虽然只是一种方法,但似乎最容易处理。
@Mclayton 和@bryancook - 我看到了你的建议,但无法掌握跟踪不再需要查看相似词的内容。
Function Get-SimTitles([psobject]$NewPosts) {
$CKTitles = $NewPosts.title
foreach ($Ck in $CkTitles) {
$NewPosts | & {
process {
if ((Measure-TitleSimilarity $Ck.split(' ') $_.title.split(' ') -gt .2) {
$_.SimTitles = $_.SimTitles + 1
}
}
}
}
}
Function Measure-TitleSimilarity
{
## Based on VectorSimilarity by .AUTHOR Lee Holmes
## Modified slightly to match use
[CmdletBinding()]
param(
[Parameter(Position = 0)]
$Title1,
[Parameter(Position = 1)]
$Title2
)
$allkeys = @($Title1) + @($Title2) | Sort-Object -Unique
$set1Hash = @{}
$set2Hash = @{}
$setsToProcess = @($Title1, $Set1Hash), @($Title2, $Set2Hash)
foreach($set in $setsToProcess)
{
$set[0] | Foreach-Object {
$value = 1
$set[1][$_] = $value
}
}
$dot = 0
$mag1 = 0
$mag2 = 0
foreach($key in $allkeys)
{
$dot += $set1Hash[$key] * $set2Hash[$key]
$mag1 += ($set1Hash[$key] * $set1Hash[$key])
$mag2 += ($set2Hash[$key] * $set2Hash[$key])
}
$mag1 = [Math]::Sqrt($mag1)
$mag2 = [Math]::Sqrt($mag2)
return [Math]::Round($dot / ($mag1 * $mag2), 3)
}
您可以通过删除重复比较将处理时间减半。
IE。一旦你比较了“title1”和“title2”,你就不需要比较“title2”和“title1”——你已经知道答案了。
所以,你的内部循环不应该从数组的开头开始
我尝试了其他方法来衡量标题的相似性,包括词频与特定标题中词频的对比。我认为比较标题是单独 post 的主题。我仍然喜欢只循环一次的想法。
@MikeSh - 根据你的回答,这就是我想出的。
Function Get-SimTitles([psobject]$NewPosts) {
$i=0
$end = $NewPosts.Count - 1
For($i =0; $i -lt $end; $i++){
$k=$i+1
$k..$end | Where{{$NewPosts[$i].source -ne $NewPosts[$_].source}} |
Where-Object {(Measure-TitleSimilarity $NewPosts[$i].title.split(' ') $NewPosts[$_].title.split(' ')) -gt .35} |
& {process {$NewPosts[$_].SimTitles = $NewPosts[$_].SimTitles + 1; $NewPosts[$i].SimTitles+=1} }
}
}
部分回答
这包括评论中的一些建议:
Mathias R. Jessen - “您不必将每个标题与每个标题进行比较 - 相反,您只需要比较所有独特的对”
我的评论 - “您可以在开始比较之前将您的标题拆分为单词数组一次,然后循环遍历它们,而不是每次都拆分它们”
$ErrorActionPreference = "Stop";
Set-StrictMode -Version "Latest";
function ConvertTo-WordSets( [psobject] $Posts )
{
# preprocess each post to break its title into word counts
# so we don't need to do it every time we compare 2 posts
foreach( $post in $Posts )
{
$set = new-object PSCustomObject -Property ([ordered] @{
"Post" = $post
"Title" = $post.Title.Trim()
"Words" = $null
"Counts" = $null
});
$set.Words = $set.Title.Split(" ");
$set.Counts = $set.Words `
| group-object `
| foreach-object `
-Begin { $counts = @{} } `
-Process { $counts.Add($_.Name, $_.Count) } `
-End { $counts };
write-output $set;
}
}
function Get-SimTitles( [psobject] $NewPosts )
{
# instead of comparing every object to every object, just compare unique combinations
# e.g. X compared to Y is the same as Y compared to X so score them both at the same time
# (and we don't need to compare an object to itself either)
for( $i = 0; $i -lt $NewPosts.Length; $i++ )
{
$left = $NewPosts[$i];
for( $j = $i + 1; $j -lt $NewPosts.Length; $j++ )
{
$right = $NewPosts[$j];
if ((Measure-TitleSimilarity2 $left $right) -gt .5)
{
$left.Post.SimTitles = $left.Post.SimTitles + 1;
$right.Post.SimTitles = $right.Post.SimTitles + 1;
}
}
}
}
Function Measure-TitleSimilarity
{
param
(
[Parameter(Position = 0)]
$Left,
[Parameter(Position = 1)]
$Right
)
# we can use the pre-processed word counts now
$allkeys = $Left.Words + $Right.Words | Sort-Object -Unique
$dot = 0
$mag1 = 0
$mag2 = 0
foreach($key in $allkeys)
{
$dot += $Left.Counts[$key] * $Right.Counts[$key]
$mag1 += $Left.Counts[$key] * $Left.Counts[$key]
$mag2 += $Right.Counts[$key] * $Right.Counts[$key]
}
$mag1 = [Math]::Sqrt($mag1)
$mag2 = [Math]::Sqrt($mag2)
return [Math]::Round($dot / ($mag1 * $mag2), 3)
}
性能
即使对于中等大小的样本,这个和原来的都不是特别快,但是这个快了大约 4 倍。
# get some test data
$sentences = (Invoke-WebRequest -Uri "https://raw.githubusercontent.com/SteveMansfield/MNREAD-sentences/master/XMNREAD01.txt").Content;
$sentences = $sentences.Trim("`n").Split("`n") | foreach-object { $_.Substring(1, $_.Length - 3) };
$posts = $sentences `
| select-object -First 200 `
| foreach-object {
new-object PSCustomObject -Property ([ordered] @{
"Title" = $_
"SimTitles" = 0
})
};
Measure-Command { Get-SimTitles $posts; }
# build some test data
$posts = $sentences `
| select-object -First 200 `
| foreach-object {
new-object PSCustomObject -Property ([ordered] @{
"Title" = $_
"SimTitles" = 0
})
};
Measure-Command {
$wordSets = @( ConvertTo-WordSets $Posts );
Get-SimTitles $wordSets;
}
Size
Original
This one
10
0.2
0.02
20
0.4
0.1
50
1.9
0.5
100
8.7
1.9
200
38
9
500
246
82
(以秒为单位的时间)
我有一个 object $Posts,其中包含标题和 SimTitles 字段等。我需要将每个标题与其他标题进行比较,并在 SimTitles 字段中给它一个相似度分数。因此,如果我有 80 个 $Posts,则需要涵盖 6400 个 re-iterations,因为每个标题都需要与其他标题进行比较。
除了我认为已优化的 Measure-TitleSimilarity 例程之外,任何人都可以找到提高我缺少的这个双循环速度的方法吗?
编辑: 我已经包含了函数 Measure-TitleSimilarity。我实际上是将数组传递给函数。为相似性量化数组的整个主题很吸引人。我尝试使用 Title.ToCharArray() 将幻数更改为更高的数字。只要字符相同,它也可以生成具有两个完全不同标题的匹配项。 (例如:'Mother Teresa' 将紧密匹配 'Earthmovers' 或 'Thermometer' 但显然不是相同的含义)。余弦相似度虽然只是一种方法,但似乎最容易处理。 @Mclayton 和@bryancook - 我看到了你的建议,但无法掌握跟踪不再需要查看相似词的内容。
Function Get-SimTitles([psobject]$NewPosts) {
$CKTitles = $NewPosts.title
foreach ($Ck in $CkTitles) {
$NewPosts | & {
process {
if ((Measure-TitleSimilarity $Ck.split(' ') $_.title.split(' ') -gt .2) {
$_.SimTitles = $_.SimTitles + 1
}
}
}
}
}
Function Measure-TitleSimilarity
{
## Based on VectorSimilarity by .AUTHOR Lee Holmes
## Modified slightly to match use
[CmdletBinding()]
param(
[Parameter(Position = 0)]
$Title1,
[Parameter(Position = 1)]
$Title2
)
$allkeys = @($Title1) + @($Title2) | Sort-Object -Unique
$set1Hash = @{}
$set2Hash = @{}
$setsToProcess = @($Title1, $Set1Hash), @($Title2, $Set2Hash)
foreach($set in $setsToProcess)
{
$set[0] | Foreach-Object {
$value = 1
$set[1][$_] = $value
}
}
$dot = 0
$mag1 = 0
$mag2 = 0
foreach($key in $allkeys)
{
$dot += $set1Hash[$key] * $set2Hash[$key]
$mag1 += ($set1Hash[$key] * $set1Hash[$key])
$mag2 += ($set2Hash[$key] * $set2Hash[$key])
}
$mag1 = [Math]::Sqrt($mag1)
$mag2 = [Math]::Sqrt($mag2)
return [Math]::Round($dot / ($mag1 * $mag2), 3)
}
您可以通过删除重复比较将处理时间减半。 IE。一旦你比较了“title1”和“title2”,你就不需要比较“title2”和“title1”——你已经知道答案了。 所以,你的内部循环不应该从数组的开头开始
我尝试了其他方法来衡量标题的相似性,包括词频与特定标题中词频的对比。我认为比较标题是单独 post 的主题。我仍然喜欢只循环一次的想法。
@MikeSh - 根据你的回答,这就是我想出的。
Function Get-SimTitles([psobject]$NewPosts) {
$i=0
$end = $NewPosts.Count - 1
For($i =0; $i -lt $end; $i++){
$k=$i+1
$k..$end | Where{{$NewPosts[$i].source -ne $NewPosts[$_].source}} |
Where-Object {(Measure-TitleSimilarity $NewPosts[$i].title.split(' ') $NewPosts[$_].title.split(' ')) -gt .35} |
& {process {$NewPosts[$_].SimTitles = $NewPosts[$_].SimTitles + 1; $NewPosts[$i].SimTitles+=1} }
}
}
部分回答
这包括评论中的一些建议:
Mathias R. Jessen - “您不必将每个标题与每个标题进行比较 - 相反,您只需要比较所有独特的对”
我的评论 - “您可以在开始比较之前将您的标题拆分为单词数组一次,然后循环遍历它们,而不是每次都拆分它们”
$ErrorActionPreference = "Stop";
Set-StrictMode -Version "Latest";
function ConvertTo-WordSets( [psobject] $Posts )
{
# preprocess each post to break its title into word counts
# so we don't need to do it every time we compare 2 posts
foreach( $post in $Posts )
{
$set = new-object PSCustomObject -Property ([ordered] @{
"Post" = $post
"Title" = $post.Title.Trim()
"Words" = $null
"Counts" = $null
});
$set.Words = $set.Title.Split(" ");
$set.Counts = $set.Words `
| group-object `
| foreach-object `
-Begin { $counts = @{} } `
-Process { $counts.Add($_.Name, $_.Count) } `
-End { $counts };
write-output $set;
}
}
function Get-SimTitles( [psobject] $NewPosts )
{
# instead of comparing every object to every object, just compare unique combinations
# e.g. X compared to Y is the same as Y compared to X so score them both at the same time
# (and we don't need to compare an object to itself either)
for( $i = 0; $i -lt $NewPosts.Length; $i++ )
{
$left = $NewPosts[$i];
for( $j = $i + 1; $j -lt $NewPosts.Length; $j++ )
{
$right = $NewPosts[$j];
if ((Measure-TitleSimilarity2 $left $right) -gt .5)
{
$left.Post.SimTitles = $left.Post.SimTitles + 1;
$right.Post.SimTitles = $right.Post.SimTitles + 1;
}
}
}
}
Function Measure-TitleSimilarity
{
param
(
[Parameter(Position = 0)]
$Left,
[Parameter(Position = 1)]
$Right
)
# we can use the pre-processed word counts now
$allkeys = $Left.Words + $Right.Words | Sort-Object -Unique
$dot = 0
$mag1 = 0
$mag2 = 0
foreach($key in $allkeys)
{
$dot += $Left.Counts[$key] * $Right.Counts[$key]
$mag1 += $Left.Counts[$key] * $Left.Counts[$key]
$mag2 += $Right.Counts[$key] * $Right.Counts[$key]
}
$mag1 = [Math]::Sqrt($mag1)
$mag2 = [Math]::Sqrt($mag2)
return [Math]::Round($dot / ($mag1 * $mag2), 3)
}
性能
即使对于中等大小的样本,这个和原来的都不是特别快,但是这个快了大约 4 倍。
# get some test data
$sentences = (Invoke-WebRequest -Uri "https://raw.githubusercontent.com/SteveMansfield/MNREAD-sentences/master/XMNREAD01.txt").Content;
$sentences = $sentences.Trim("`n").Split("`n") | foreach-object { $_.Substring(1, $_.Length - 3) };
$posts = $sentences `
| select-object -First 200 `
| foreach-object {
new-object PSCustomObject -Property ([ordered] @{
"Title" = $_
"SimTitles" = 0
})
};
Measure-Command { Get-SimTitles $posts; }
# build some test data
$posts = $sentences `
| select-object -First 200 `
| foreach-object {
new-object PSCustomObject -Property ([ordered] @{
"Title" = $_
"SimTitles" = 0
})
};
Measure-Command {
$wordSets = @( ConvertTo-WordSets $Posts );
Get-SimTitles $wordSets;
}
Size | Original | This one |
---|---|---|
10 | 0.2 | 0.02 |
20 | 0.4 | 0.1 |
50 | 1.9 | 0.5 |
100 | 8.7 | 1.9 |
200 | 38 | 9 |
500 | 246 | 82 |
(以秒为单位的时间)