如何使用 Spark SQL 在字符串中查找重复项?

How to find duplicates in a string using Spark SQL?

我想在字符串中查找重复项,我想知道有什么方法可以使用 spark sql 找到它。下面是我写的查询。

spark.sql("""select case when lower(value) like '%,code,%' or lower(value) like '%,code%' or lower(value) like '%code,%' then 'Y' else 'N' end as value from input""").show(false)

输入:

val sample = Seq(("code,code")).toDF("value")

问题是如果值 'code' 重复两次怎么办?在这种情况下,我的要求是 return 标记 'N'。有什么方法可以使用 Spark SQL 来做到这一点吗?请分享您的建议。 TIA

您可以使用长度确定出现次数并替换

set @a = 'code,aaa';
set @b = 'code,code';
set @c = 'aaa,bbb,ccc';

select @a, length(@a) - length(replace(@a,'code','')) lengthdiff,
        case when length(@a) - length(replace(@a,'code','')) = 4 then 'y'
             when length(@a) - length(replace(@a,'code','')) >= 4 then 'n'
        else 'n'
        end  occurances
;

+----------+------------+------------+
| @a       | lengthdiff | occurances |
+----------+------------+------------+
| code,aaa |          4 | y          |
+----------+------------+------------+
1 row in set (0.001 sec)

select @b, length(@b) - length(replace(@b,'code','')) lengthdiff,
        case when length(@b) - length(replace(@b,'code','')) = 4 then 'y'
             when length(@b) - length(replace(@b,'code','')) > 4 then 'n'
        else 'n'
        end  occurances
;

+-----------+------------+------------+
| @b        | lengthdiff | occurances |
+-----------+------------+------------+
| code,code |          8 | n          |
+-----------+------------+------------+
1 row in set (0.000 sec)

select @c, length(@c) - length(replace(@c,'code','')) lengthdiff,
        case when length(@c) - length(replace(@c,'code','')) = 4 then 'y'
             when length(@c) - length(replace(@c,'code','')) > 4 then 'n'
        else 'n'
        end  occurances
;

+-------------+------------+------------+
| @c          | lengthdiff | occurances |
+-------------+------------+------------+
| aaa,bbb,ccc |          0 | n          |
+-------------+------------+------------+
1 row in set (0.001 sec)