如何从 Mapreduce 作业查询存储在 hdfs 中的嵌入式数据库?

How to query an embedded database stored in hdfs from a Mapreduce job?

我正在尝试从 Hadoop MapReduce 映射器查询 GeoLite 数据库以解析 IP 地址的国家/地区。我尝试了两种方法:

1.Using File 仅适用于本地文件系统,我收到一个文件未找到异常

File database = new File("hdfs://localhost:9000/input/GeoLite2-City.mmdb"); // <<< HERE
DatabaseReader reader = new DatabaseReader.Builder(database).build();

2.Using 流,但我在运行时遇到此错误

Error: Java Heap Space

Path pt = new Path("hdfs://localhost:9000/input/GeoLite2-City.mmdb");
FileSystem fs = FileSystem.get(new Configuration());

FSDataInputStream stream = fs.open(pt);
DatabaseReader reader = new DatabaseReader.Builder(stream).build();

InetAddress ipAddress = InetAddress.getByName(address.getHostAddress());
CityResponse response = null;
try {
    response = reader.city(ipAddress);
} catch (GeoIp2Exception ex) {
    ex.printStackTrace();
    return;
}

我的问题:Hadoop中的mapper如何查询geolite数据库?

我通过分布式缓存方法解决了它,通过将 GeoLite 数据库文件缓存到 MapReduce 作业中的每个映射器。

    @Override
      public void setup(Context context)
    
      {
        Configuration conf = context.getConfiguration();
    
        try {
    
          cachefiles = DistributedCache.getLocalCacheFiles(conf);
    
          File database = new File(cachefiles[0].toString()); //
    
          reader = new DatabaseReader.Builder(database).build();
    
        } catch (IOException e) {
          e.printStackTrace();
        }
    
      }
public void map(Object key, Text line, Context context) throws IOException,
      InterruptedException {

                     .....

InetAddress ipAddress = InetAddress.getByName(address.getHostAddress());
      CityResponse response = null;
      try {
        response = reader.city(ipAddress);
      } catch (GeoIp2Exception ex) {
        ex.printStackTrace();
        return;
      }
                     ......