Hive UDF - 解析 IP 地址时速度极慢

Question

我有一列包含 IP 地址。现在我需要将它们解析为 contries/cities： select IPUtils('199.999.999.999') 并且 returns ['Aisa', 'Hongkong', 'xxx', 'Hongkong']

我写了一个 hive udf 来做这个但是它运行起来非常慢，如下所示：

INFO : 2021-09-08 18:51:10,817 Stage-2 map = 100%, reduce = 30%, Cumulative CPU 9074.06 sec

map = 100% 而 reduce 的进度每 15 分钟增加 1%。

UDF从项目的资源文件夹中读取文件，所以它可能会一次又一次地重复读取文件？ udf如下所示，欢迎大家帮助：

public class IPUtil extends UDF {

    public List<String>  evaluate(String  ip){
        try{
            ClassLoader classloader = Thread.currentThread().getContextClassLoader();

            // I put the mmdb file in resource folder of the java project
            InputStream is = classloader.getResourceAsStream("GeoLite2-City.mmdb");
            DatabaseReader reader = new DatabaseReader.Builder(is).build();

            InetAddress ipAddress = InetAddress.getByName(ip);
            CityResponse response = reader.city(ipAddress);
            Country country = response.getCountry();
            Subdivision subdivision = response.getMostSpecificSubdivision();
            City city = response.getCity();
            Continent continent = response.getContinent();

            List<String> list = new LinkedList<String>();

            list.add(continent.getNames().get("zh-CN"));
            list.add(country.getNames().get("zh-CN"));
            list.add(subdivision.getNames().get("zh-CN"));
            list.add(city.getNames().get("zh-CN"));

            return list;

        } catch (UnknownHostException e) {
            e.printStackTrace();
            return null;
        } catch (IOException e) {
            e.printStackTrace();
            return null;
        } catch (GeoIp2Exception e) {
            e.printStackTrace();
            return null;
        }
    }

    @Test
    public void test()throws Exception{
        System.out.println(evaluate("175.45.20.138"));
    }
}

Answer 1

移动这个

InputStream is = classloader.getResourceAsStream("GeoLite2-City.mmdb");
DatabaseReader reader = new DatabaseReader.Builder(is).build();

到class初始化。

Hive UDF - 解析 IP 地址时速度极慢

Hive UDF - exetremely slow when parsing IP addresses

java

performance

hive

user-defined-functions

hive-udf