使用 Hadoop 查询 github 数据

Query github data using Hadoop

hadoop
github
apache-pig
bigdata
hdfs

我正在尝试使用 hadoop 查询 GitHub 由 ghtorrent API 提供的数据。如何将这么多数据（4-5 TB）注入 HDFS？此外，他们的数据库是实时的。是否可以使用pig、hive、hbase等工具在hadoop中处理实时数据？

通过this presentation . It has described the way you can connect to their MySql or MongoDb instance and fetch data. Basically you have to share your public key, they will add that key to their repository and then you can ssh. As an alternative you can download their periodic dumps from thislink

小鬼 Link :

query mongodb programatically
connect to mysql instance

要处理实时数据，你不能用 Pig、Hive 来做。这些是批处理工具。考虑使用 Apache Spark。

使用 Hadoop 查询 github 数据

Query github data using Hadoop

hadoop

github

apache-pig

bigdata

hdfs