PostgreSQL 根据国家 table 解析数组中的国家

Question

我们有内容和国家 table。国家非常简单： country_name 列定义为字符串：阿尔巴尼亚，比利时，中国，丹麦等...

Content 是一个 table 有 50 万行的各种数据，国家列定义为数组文本 []。那里的每个值都有许多国家/地区串联在一起，例如： {"denmark,finland,france,germany,ireland,gb,italy,netherlands,poland,russia,spain,sweden,australia,brazil,canada,china,india,indonesia,japan,malaysia,vietnam,mexico,"韩国",泰国,美国,新加坡,阿联酋"}

内部团队更新的是一千条记录，我们不确定国家/地区是否都拼写正确。因此，任务是与 table 国家/地区的 country_name 和解。

我正在做 replace(replace(country_array::text,'{',''),'}','') as country_text 并考虑使用 UNPIVOT 来对照国家 table 检查每一列。是否有任何其他更简单的方法来确保 Content table 中的国家/地区数组具有来自国家/地区 table 的有效国家/地区名称？

谢谢

Answer 1

如果您怀疑某些国家/地区拼写不正确，那么毫无疑问有这样的例子。

首先获取参考文献中非的国家/地区列表 table:

select c_country, count(*)
from content c cross join lateral
     unnnest(c.countries) c_country left join
     countries co
     on co.country_name = c_country
where co.country_name is not null
group by c_country
order by count(*) desc;

然后，你就可以进去修复数据了

在数组中存储值先验没有错。但是，如果您是从头开始设计数据库，我可能会推荐 contentCountries table 和 countryId。这将确保明确的关系。

在您的情况下，您可能应该修复摄取过程，以便在输入时已知值是正确的。这可能就足够了，因为您已经拥有大量数据并且只需要修复它。

Answer 2

您可以unnest()每个数组到一组行，并确保所有值都出现在countrytable中。以下查询为您提供引用 table:

中缺少的数组元素

select *
from 
    content c
    cross join lateral unnest(c.countries) as t(country_name)
    left join country y on y.country_name = t.country_name
where y.country_name is null

Demo on DB Fiddle

国家table:

id | country_name
-: | :-----------
 1 | albania     
 2 | denmark

内容table:

id | countries        
-: | :----------------
 1 | {albania,denmark}
 1 | {albania,france}

查询结果：

id | countries        | country_name
-: | :--------------- | :-----------
 1 | {albania,france} | france

PostgreSQL 根据国家 table 解析数组中的国家

PostgreSQL parse countries in array against the countries table

sql

arrays

postgresql

unnest

lateral-join