哈希集处理以避免在迭代期间陷入循环

Question

我正在从事图像挖掘项目，我使用哈希集而不是数组来避免在收集 url 时添加重复的 url，我达到了迭代代码的地步包含主要 url 的哈希集，在迭代中我去下载主要 URL 的页面并将它们添加到哈希中，然后继续，在迭代期间我应该排除每个扫描的 url，并且还排除（删除）每个以 jpg 结尾的 url，直到 url 计数的 Hashet 达到 0，问题是我在这次迭代中面临无限循环，我在哪里可能会得到 url（我们称之为 X）

1- 我扫描了 url X 的页面 2- 获取第 X 页的所有 urls（通过应用过滤器） 3- 使用 unioinwith 添加 urls 到 Hashset 4-删除扫描的url X

当 URL 之一的 Y 扫描时再次带 X 时问题就出现了

我可以使用字典和密钥作为 "scanned" 吗？？我会尝试 post 这里的结果，抱歉，在我 post 提出问题后我想到了...

我设法为一个 url 解决了它，但它似乎与其他 url 一起发生了生成循环，那么即使在删除链接后如何处理哈希集以避免重复， , 我希望我的观点是清楚的。

while (URL_Can.Count != 0)
 {

                  tempURL = URL_Can.First();

                   if (tempURL.EndsWith("jpg")) 
                    {
                        URL_CanToSave.Add(tempURL);
                        URL_Can.Remove(tempURL);

                    }
                    else
                    {

                        if (ExtractUrlsfromLink(client, tempURL, filterlink1).Contains(toAvoidLoopinLinks))
                        {

                            URL_Can.Remove(tempURL);

                            URL_Can.Remove(toAvoidLoopinLinks);
                        }
                        else 
                        {
                            URL_Can.UnionWith(ExtractUrlsfromLink(client, tempURL, filterlink1));

                            URL_Can.UnionWith(ExtractUrlsfromLink(client, tempURL, filterlink2));

                            URL_Can.Remove(tempURL);

                            richTextBox2.PerformSafely(() => richTextBox2.AppendText(tempURL + "\n"));
                        }

                    }

                   toAvoidLoopinLinks = tempURL;

                }

Answer 1

感谢大家，我设法使用 Dictionary 而不是 Hashset 解决了这个问题，并使用 Key 保存 URL ，并使用值保存 int ，如果 urls 被扫描，如果 url 仍未处理则为 0，下面是我的代码。我使用了另一个字典 "URL_CANtoSave to hold the url that ends with jpg "my target"...和这个 While 循环..可以循环直到网站运行的所有 url 都根据您在过滤器中指定的值输出您相应地解析 urls 的字符串变量。

所以要打破循环，您可以指定要进入 URL_CantoSave 的图像数量 url。

  return Task.Factory.StartNew(() =>
        {
            try
            {


                string tempURL;

                int i = 0;

// 我用来设置Dictionary Key的值，1或者0（1表示已扫描， 0 表示还没有并迭代，直到所有词典键都被扫描或者你根据你在另一个词典

中收集的图像 urls 的数量在中间中断

               while (URL_Can.Values.Where(value => value.Equals(0)).Any())


                {

// 取 1 把钥匙放在临时变量中

                    tempURL = URL_Can.ElementAt(i).Key;

// 检查它是否以您的目标文件扩展名结尾。在这种情况下图像文件

                   if (tempURL.EndsWith("jpg")) 
                    {
                        URL_CanToSave.Add(tempURL,0);

                        URL_Can.Remove(tempURL);

                    }

//如果不是图片，请根据url下载页面并继续分析

                    else
                    {

//如果url之前没有扫描到

                        if (URL_Can[tempURL] != 1) 
                        {

// 这里看起来有点复杂，其中 Add2Dic 是在不再次添加 Key 的情况下添加到字典的过程（解决主要问题！！） "ExtractURLfromLink" 是 return 字典的另一个过程，通过下载 url 的文档字符串并分析它来分析所有链接，您可以根据您的分析添加删除过滤器字符串

Add2Dic(ExtractUrlsfromLink(client, tempURL, filterlink1), URL_Can, false);
Add2Dic(ExtractUrlsfromLink(client, tempURL, filterlink2), URL_Can, false);

 URL_Can[tempURL] = 1;  //  to set it as scanned link


    richTextBox2.PerformSafely(() => richTextBox2.AppendText(tempURL + "\n"));
                        }



                    }


        statusStrip1.PerformSafely(() => toolStripProgressBar1.PerformStep());

// 另一个技巧可以让这个迭代继续进行，直到它扫描所有收集到的链接

                    i++;  if (i >= URL_Can.Count) { i = 0; }

                    if (URL_CanToSave.Count >= 150) { break; }

                }


                richTextBox2.PerformSafely(() => richTextBox2.Clear());

                textBox1.PerformSafely(() => textBox1.Text = URL_Can.Count.ToString());


                return ProcessCompleted = true;




            }
            catch (Exception aih)
            {

                MessageBox.Show(aih.Message);

                return ProcessCompleted = false;

                throw;
            }


            {
              richTextBox2.PerformSafely(()=>richTextBox2.AppendText(url+"\n"));
            }
        })

哈希集处理以避免在迭代期间陷入循环

Hashset handling to avoid stuck in loop during iteration

url

hashset

mining