Hive UDF - 解析 IP 地址时速度极慢
Hive UDF - exetremely slow when parsing IP addresses
我有一列包含 IP 地址。现在我需要将它们解析为 contries/cities:
select IPUtils('199.999.999.999')
并且 returns ['Aisa', 'Hongkong', 'xxx', 'Hongkong']
我写了一个 hive udf 来做这个但是它运行起来非常慢,如下所示:
INFO : 2021-09-08 18:51:10,817 Stage-2 map = 100%, reduce = 30%, Cumulative CPU 9074.06 sec
map = 100%
而 reduce
的进度每 15 分钟增加 1%。
UDF从项目的资源文件夹中读取文件,所以它可能会一次又一次地重复读取文件? udf如下所示,欢迎大家帮助:
public class IPUtil extends UDF {
public List<String> evaluate(String ip){
try{
ClassLoader classloader = Thread.currentThread().getContextClassLoader();
// I put the mmdb file in resource folder of the java project
InputStream is = classloader.getResourceAsStream("GeoLite2-City.mmdb");
DatabaseReader reader = new DatabaseReader.Builder(is).build();
InetAddress ipAddress = InetAddress.getByName(ip);
CityResponse response = reader.city(ipAddress);
Country country = response.getCountry();
Subdivision subdivision = response.getMostSpecificSubdivision();
City city = response.getCity();
Continent continent = response.getContinent();
List<String> list = new LinkedList<String>();
list.add(continent.getNames().get("zh-CN"));
list.add(country.getNames().get("zh-CN"));
list.add(subdivision.getNames().get("zh-CN"));
list.add(city.getNames().get("zh-CN"));
return list;
} catch (UnknownHostException e) {
e.printStackTrace();
return null;
} catch (IOException e) {
e.printStackTrace();
return null;
} catch (GeoIp2Exception e) {
e.printStackTrace();
return null;
}
}
@Test
public void test()throws Exception{
System.out.println(evaluate("175.45.20.138"));
}
}
移动这个
InputStream is = classloader.getResourceAsStream("GeoLite2-City.mmdb");
DatabaseReader reader = new DatabaseReader.Builder(is).build();
到class初始化。
我有一列包含 IP 地址。现在我需要将它们解析为 contries/cities:
select IPUtils('199.999.999.999')
并且 returns ['Aisa', 'Hongkong', 'xxx', 'Hongkong']
我写了一个 hive udf 来做这个但是它运行起来非常慢,如下所示:
INFO : 2021-09-08 18:51:10,817 Stage-2 map = 100%, reduce = 30%, Cumulative CPU 9074.06 sec
map = 100%
而 reduce
的进度每 15 分钟增加 1%。
UDF从项目的资源文件夹中读取文件,所以它可能会一次又一次地重复读取文件? udf如下所示,欢迎大家帮助:
public class IPUtil extends UDF {
public List<String> evaluate(String ip){
try{
ClassLoader classloader = Thread.currentThread().getContextClassLoader();
// I put the mmdb file in resource folder of the java project
InputStream is = classloader.getResourceAsStream("GeoLite2-City.mmdb");
DatabaseReader reader = new DatabaseReader.Builder(is).build();
InetAddress ipAddress = InetAddress.getByName(ip);
CityResponse response = reader.city(ipAddress);
Country country = response.getCountry();
Subdivision subdivision = response.getMostSpecificSubdivision();
City city = response.getCity();
Continent continent = response.getContinent();
List<String> list = new LinkedList<String>();
list.add(continent.getNames().get("zh-CN"));
list.add(country.getNames().get("zh-CN"));
list.add(subdivision.getNames().get("zh-CN"));
list.add(city.getNames().get("zh-CN"));
return list;
} catch (UnknownHostException e) {
e.printStackTrace();
return null;
} catch (IOException e) {
e.printStackTrace();
return null;
} catch (GeoIp2Exception e) {
e.printStackTrace();
return null;
}
}
@Test
public void test()throws Exception{
System.out.println(evaluate("175.45.20.138"));
}
}
移动这个
InputStream is = classloader.getResourceAsStream("GeoLite2-City.mmdb");
DatabaseReader reader = new DatabaseReader.Builder(is).build();
到class初始化。