如何使用 Spark SQL 在字符串中查找重复项?
How to find duplicates in a string using Spark SQL?
我想在字符串中查找重复项,我想知道有什么方法可以使用 spark sql 找到它。下面是我写的查询。
spark.sql("""select case when lower(value) like '%,code,%' or lower(value) like '%,code%' or lower(value) like '%code,%' then 'Y' else 'N' end as value from input""").show(false)
输入:
val sample = Seq(("code,code")).toDF("value")
问题是如果值 'code' 重复两次怎么办?在这种情况下,我的要求是 return 标记 'N'。有什么方法可以使用 Spark SQL 来做到这一点吗?请分享您的建议。 TIA
您可以使用长度确定出现次数并替换
set @a = 'code,aaa';
set @b = 'code,code';
set @c = 'aaa,bbb,ccc';
select @a, length(@a) - length(replace(@a,'code','')) lengthdiff,
case when length(@a) - length(replace(@a,'code','')) = 4 then 'y'
when length(@a) - length(replace(@a,'code','')) >= 4 then 'n'
else 'n'
end occurances
;
+----------+------------+------------+
| @a | lengthdiff | occurances |
+----------+------------+------------+
| code,aaa | 4 | y |
+----------+------------+------------+
1 row in set (0.001 sec)
select @b, length(@b) - length(replace(@b,'code','')) lengthdiff,
case when length(@b) - length(replace(@b,'code','')) = 4 then 'y'
when length(@b) - length(replace(@b,'code','')) > 4 then 'n'
else 'n'
end occurances
;
+-----------+------------+------------+
| @b | lengthdiff | occurances |
+-----------+------------+------------+
| code,code | 8 | n |
+-----------+------------+------------+
1 row in set (0.000 sec)
select @c, length(@c) - length(replace(@c,'code','')) lengthdiff,
case when length(@c) - length(replace(@c,'code','')) = 4 then 'y'
when length(@c) - length(replace(@c,'code','')) > 4 then 'n'
else 'n'
end occurances
;
+-------------+------------+------------+
| @c | lengthdiff | occurances |
+-------------+------------+------------+
| aaa,bbb,ccc | 0 | n |
+-------------+------------+------------+
1 row in set (0.001 sec)
我想在字符串中查找重复项,我想知道有什么方法可以使用 spark sql 找到它。下面是我写的查询。
spark.sql("""select case when lower(value) like '%,code,%' or lower(value) like '%,code%' or lower(value) like '%code,%' then 'Y' else 'N' end as value from input""").show(false)
输入:
val sample = Seq(("code,code")).toDF("value")
问题是如果值 'code' 重复两次怎么办?在这种情况下,我的要求是 return 标记 'N'。有什么方法可以使用 Spark SQL 来做到这一点吗?请分享您的建议。 TIA
您可以使用长度确定出现次数并替换
set @a = 'code,aaa';
set @b = 'code,code';
set @c = 'aaa,bbb,ccc';
select @a, length(@a) - length(replace(@a,'code','')) lengthdiff,
case when length(@a) - length(replace(@a,'code','')) = 4 then 'y'
when length(@a) - length(replace(@a,'code','')) >= 4 then 'n'
else 'n'
end occurances
;
+----------+------------+------------+
| @a | lengthdiff | occurances |
+----------+------------+------------+
| code,aaa | 4 | y |
+----------+------------+------------+
1 row in set (0.001 sec)
select @b, length(@b) - length(replace(@b,'code','')) lengthdiff,
case when length(@b) - length(replace(@b,'code','')) = 4 then 'y'
when length(@b) - length(replace(@b,'code','')) > 4 then 'n'
else 'n'
end occurances
;
+-----------+------------+------------+
| @b | lengthdiff | occurances |
+-----------+------------+------------+
| code,code | 8 | n |
+-----------+------------+------------+
1 row in set (0.000 sec)
select @c, length(@c) - length(replace(@c,'code','')) lengthdiff,
case when length(@c) - length(replace(@c,'code','')) = 4 then 'y'
when length(@c) - length(replace(@c,'code','')) > 4 then 'n'
else 'n'
end occurances
;
+-------------+------------+------------+
| @c | lengthdiff | occurances |
+-------------+------------+------------+
| aaa,bbb,ccc | 0 | n |
+-------------+------------+------------+
1 row in set (0.001 sec)