维基百科页面浏览量分析

Wikipedia pageviews analysis

我一直在接受维基百科 pageviews 分析的挑战。对我来说,这是第一个拥有如此大量数据的项目,我有点迷茫。当我从 link 下载文件并解压缩时,我可以看到它有一个类似于 table 的结构,其中的行如下所示:

1   |  2                             |3|4

en.m The_Beatles_in_the_United_States 2 0

我很难找出每一列中究竟可以找到什么。我的猜测:

语言版本和附加信息(.m = 移动版?)

文章名称

我最关心的是最后两列。最后一个只有“0”值,我不知道它代表什么。我假设第三个显示的是观看次数,但我不确定。

如果有人可以帮助我了解每栏中的确切内容或推荐一些有关此主题的阅读材料,我将不胜感激。谢谢!

Line format:

  • wiki code (subproject.project)
  • article title
  • monthly total (with interpolation when data is missing)
  • hourly counts

(来自 pagecounts-ez,这是相同的数据集,只是过滤较少。)

不过显然有问题;它使用域名的前两部分作为 wiki 代码,这不适用于移动域(格式为 <language>.m.<project>.org)。

在这上面花费了更多时间后,我终于找到了解决方案。我发布这个以防将来有人遇到同样的问题。维基百科解释了可以在数据库中找到的内容。这些解释很难找到,但您可以访问主题 here and here.

基于此,您可以看到行具有以下结构:

  • 域代码
  • page_title
  • count_views
  • total_response_size(不再维护)

每列的一些解释:

第 1 列:

Domain name of the request, abbreviated. (...) Domain_code now can also be an abbreviation for mobile and zero domain names, in which case .m or .zero is inserted as second part of the domain name (just like with full domain name). E.g. 'en.m.v' stands for "en.m.wikiversity.org".

第 2 列:

For page-level files, it holds the title of the unnormalized part after /wiki/ -in the request Url (E.g.: Main_Page Berlin). For project-level files, it is - .

第 3 列:

The number of times this page has been viewed in the respective hour.

第 4 列:

The total response size caused by the requests for this page in the respective hour. If I understand it correctly response size is discontinued due to low accuracy. That's why there are only 0s. The pagecounts and projectcounts files also include total response byte sizes at their respective aggregation level, but this was dropped from the pageviews and projectviews files because it wasn't very accurate.

希望有人觉得有用。