从 pyspark 或 python 中的合并数字导出出生日期/出生日期？

Question

我正在尝试从 pyspark 中的一个列中以日期数据类型格式 (YYYY-MM-DD) 导出 DOB，该列的信息被反转，作为一个大整数，并且具有单个数字 day/month's删除前导零。最后一点意味着数据的长度在 6-8 位数字之间变化。对于 7 位数字，有两种情况由于歧义而无法推导，我很乐意为这些情况输出空值。

示例：

831945 = 1945 年 3 月 8 日
1232000 = 2000 年 3 月 12 日
11102000 = 2000 年 10 月 11 日

模棱两可的例子：

111YYYY = 11 月 1 日或 1 月 11 日
112YYYY = 12 月 1 日或 2 月 11 日

代码逻辑比较复杂，有点高我。我想我可以在派生 DOB 之前派生出新的年、月和日列。首先导出日期和年份 cols，同时从合并的数字 col 中删除这些数字以离开月份。

年份 = 最后 4 位数字。

天：

if len 6 then day = first digit
if len 7 and 2nd digit = 0 or >=2 then day = first two digits
if len 7 and 2nd digit = 1 and 3rd digit ==0 then day = first digit
if len 7 and 2nd digit = 1 and 3rd digit ==1,2 then output null
if len 7 and 2nd digit = 1 and 3rd digit >=3 then day = first 2 位数
if len 7 and 2nd digit =>2 then day = first two digits
if len 8 then day = 前两位数字

之前：

merged digits  | Day | Year
1232000        |     |

之后：

merged digits  | Day | Year
3              | 12  | 2000

只是一个想法。感谢您的帮助和想法！

Answer 1

按照你的逻辑：

from pyspark.sql import functions as F


F.when(
    F.size("merged_digits") == 6,
    F.Array(
        F.lpad(F.substring("merged_digits", 1, 1), 2, "0"),
        F.lpad(F.substring("merged_digits", 2, 1), 2, "0"),
        F.substring("merged_digits", 3, 4),
    ),
).when(
    F.size("merged_digits") == 8,
    F.Array(
        F.substring("merged_digits", 1, 2),
        F.substring("merged_digits", 3, 2),
        F.substring("merged_digits", 5, 4),
    ),
).when(
    F.substring("merged_digits", 1, 1) == "0",
    F.Array(
        F.substring("merged_digits", 1, 2),
        F.lpad(F.substring("merged_digits", 3, 1), 2, "0"),
        F.substring("merged_digits", 4, 4),
    ),
).when(
    F.substring("merged_digits", 2, 1).cast("int") >= 2,
    F.Array(
        F.substring("merged_digits", 1, 2),
        F.lpad(F.substring("merged_digits", 3, 1), 2, "0"),
        F.substring("merged_digits", 4, 4),
    ),
).when(
    (F.substring("merged_digits", 2, 1) == "1")
    & (F.substring("merged_digits", 3, 1) == "0"),
    F.Array(
        F.lpad(F.substring("merged_digits", 1, 1), 2, "0"),
        F.substring("merged_digits", 2, 2),
        F.substring("merged_digits", 4, 4),
    ),
)

我创建的输出是一个数组，第一个元素是日，第二个元素是月，最后一个元素是年，所有元素都用 0 左填充以具有相同的格式。

从 pyspark 或 python 中的合并数字导出出生日期/出生日期？

Derive dob / date of birth from merged numbers in pyspark or python?

python

date

string-to-datetime

pyspark