发出许多 Http 请求时缓冲区大小不足或 queue 已满

Question

我目前正在尝试使用维基百科 public API 从维基百科中获取大量有关视频游戏的数据。我已经掌握了一些方法。我目前可以通过他们的相关文章 title 获得我需要的所有 pageid。但是然后我需要获得他们的唯一标识符（Qxxxx，其中 x 是数字）这需要很长时间......可能是因为我必须对每个标题进行单个查询（有 22031）或者因为我不了解维基百科查询.

所以我想 "Why not just make multiple queries at once?" 所以我开始着手解决这个问题，但我已经运行解决了标题中的问题。在程序运行一段时间后（通常是 3-4 分钟）大约一分钟后，应用程序崩溃并显示标题中的错误。我认为这是因为我的方法很糟糕：

ConcurrentBag<Entry> entrybag = new ConcurrentBag<Entry>(entries);
Console.WriteLine("Getting Wikibase Item Ids...");
Parallel.ForEach<Entry>(entrybag, (entry) =>
{
    entry.WikibaseItemId = GetWikibaseItemId(entry).Result;
});

调用的方法如下：

async static Task<String> GetWikibaseItemId(Entry entry)
{
    using (var client = new HttpClient(new HttpClientHandler { AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate }))
    {
        client.BaseAddress = new Uri("https://en.wikipedia.org/w/api.php");
        entry.Title.Replace("+", "Plus");
        entry.Title.Replace("&", "and");
        String queryString = "?action=query&prop=pageprops&ppprop=wikibase_item&format=json&redirects=1&titles=" + entry.Title;
        HttpResponseMessage response = await client.GetAsync(queryString);

        response.EnsureSuccessStatusCode();

        String result = response.Content.ReadAsStringAsync().Result;
        dynamic deserialized = JsonConvert.DeserializeObject(result);
        String data = deserialized.ToString();
        try
        {
            if (data.Contains("wikibase_item"))
            {
                return deserialized["query"]["pages"]["" + entry.PageId + ""]["pageprops"]["wikibase_item"].ToString();
            }
            else
            {
                return "NONE";
            }
        }
        catch (RuntimeBinderException)
        {
            return "NULL";
        }
        catch (Exception)
        {
            return "ERROR";
        }
    }
}

为了更好的衡量，这里是条目 Class:

public class Entry
{
    public EntryCategory Category { get; set; }
    public int PageId { get; set; }
    public String Title { get; set; }
    public String WikibaseItemId { get; set; }
}

有人可以帮忙吗？我是否只需要更改我的查询方式或其他方式？

Answer 1

从一个进程并行发起大约 22000 个 http 请求实在是太多了。如果您的机器有无限的资源和互联网连接带宽，这将接近于拒绝服务攻击。

您看到的是 TCP/IP 端口耗尽或队列争用。要解决它，请以较小的块处理您的数组，例如获取 10 个项目，并行处理这些项目，然后获取接下来的十个，依此类推。

具体Wikimedia sites have a recommendation串行处理请求：

There is no hard and fast limit on read requests, but we ask that you be considerate and try not to take a site down. Most sysadmins reserve the right to unceremoniously block you if you do endanger the stability of their site.

If you make your requests in series rather than in parallel (i.e. wait for the one request to finish before sending a new request, such that you're never making more than one request at the same time), then you should definitely be fine.

请务必查看他们的 API 服务条款，了解是否以及有多少并行请求符合要求。

发出许多 Http 请求时缓冲区大小不足或 queue 已满

Buffer size is not sufficient or queue is full when making many Http Requests

c#

concurrency

wikipedia-api