Pyspark - 如何获得大写的名字?

Pyspark - How to get capitalized names?

如何获取大写的名字?

from pyspark.sql import types as T
import pyspark.sql.functions as F
from datetime import datetime
from pyspark.sql.functions import to_timestamp   

test = spark.createDataFrame(
[
(1,'2021-10-04 09:05:14', "For the 2nd copy of the ticket, access the link: wa.me/11223332211 (Whats) use ID and our number(1112222333344455). Duvidas, www.abtech.com . AB Tech"),
(2,'2021-10-04 09:10:05', ". MARCIOG, let's try again? Get in touch to rectify your situation. For WhatsApp Link: ab-ab.ab.com/n/12345467. AB Tech"),
(3,'2021-10-04 09:27:27', ", we do not identify the payment of the installment of your agreement, if paid disregard. You doubt, link: wa.me/99998-88822 (Whats) ou 0800-999-9999. AB Tech"),
(4,'2021-10-04 14:55:26', "Mr, SUELI. enjoy the holiday with money in your account. AB has great conditions for you. Call now and hire 0800899-9999 (Mon to Fri from 12pm to 6pm)"),
(5,'2021-10-06 09:15:11', ". DEPREZC, let's try again? Get in touch to rectify your situation. For whatsapp Link: csi-csi.abtech.com/n/12345467. AB Tech"),
(6,'2022-02-03 08:00:12', "Mr. SARA. We have great discount options. Regularize your situation with AB! Link: wa.me/25544-8855 (Whats) ou 0800-999-9999. AB."),
(7,'2021-10-04 09:26:00', ", we do not identify the payment of the installment of your agreement, if paid disregard. You doubt, link: wa.me/999999999 (Whats) or 0800-999-9999. AB Tech"),
(8,'2018-10-09 12:31:33', "Mr.(a) ANTONI, regularize your situation with the Ammmm Bhhhh. Ligue 0800-729-2406 or access the CHAT www.abtech.com. AB Tech."),
(9,'2018-10-09 15:14:51', "Follow code of bars of your updated deal for today (11111.111111 1111.11111 11111.111111 1 11111111111). Doubts call 0800-999-9999. AB Tech.")
],
T.StructType(
[
T.StructField("id_mt", T.StringType(), True),
T.StructField("date_send", T.StringType(), True),
T.StructField("message", T.StringType(), True),
]
),
)

你能告诉我检查大写名称的逻辑是什么吗?

所以,有一个列名 'names' 是答案:

enter image description here

我们制作了 Fugue project 以将本机 Python 或 Pandas 代码移植到 Spark 或 Dask。这使您可以通过用原生 Python 表达逻辑来保持逻辑的可读性。然后,Fugue 可以通过一个函数调用将其移植到 Spark。

我认为这个具体案例很难,但 Spark 在 Python 中却很容易。我将介绍解决方案。

首先我们制作一个Pandas DataFrame 用于快速测试:

import pandas as pd
df = pd.DataFrame([(1,'2021-10-04 09:05:14', "For the 2nd copy of the ticket, access the link: wa.me/11223332211 (Whats) use ID and our number(1112222333344455). Duvidas, www.abtech.com . AB Tech"),
(2,'2021-10-04 09:10:05', ". MARCIOG, let's try again? Get in touch to rectify your situation. For WhatsApp Link: ab-ab.ab.com/n/12345467. AB Tech"),
(3,'2021-10-04 09:27:27', ", we do not identify the payment of the installment of your agreement, if paid disregard. You doubt, link: wa.me/99998-88822 (Whats) ou 0800-999-9999. AB Tech"),
(4,'2021-10-04 14:55:26', "Mr, SUELI. enjoy the holiday with money in your account. AB has great conditions for you. Call now and hire 0800899-9999 (Mon to Fri from 12pm to 6pm)"),
(5,'2021-10-06 09:15:11', ". DEPREZC, let's try again? Get in touch to rectify your situation. For whatsapp Link: csi-csi.abtech.com/n/12345467. AB Tech"),
(6,'2022-02-03 08:00:12', "Mr. SARA. We have great discount options. Regularize your situation with AB! Link: wa.me/25544-8855 (Whats) ou 0800-999-9999. AB."),
(7,'2021-10-04 09:26:00', ", we do not identify the payment of the installment of your agreement, if paid disregard. You doubt, link: wa.me/999999999 (Whats) or 0800-999-9999. AB Tech"),
(8,'2018-10-09 12:31:33', "Mr.(a) ANTONI, regularize your situation with the Ammmm Bhhhh. Ligue 0800-729-2406 or access the CHAT www.abtech.com. AB Tech."),
(9,'2018-10-09 15:14:51', "Follow code of bars of your updated deal for today (11111.111111 1111.11111 11111.111111 1 11111111111). Doubts call 0800-999-9999. AB Tech.")], columns=["id_mt", "date_send","message"])

现在我们创建一个本机 Python 函数来提取字符串。 get_name_for_one_string 对一个字符串进行操作,get_names 将接受整个 DataFrame。

from typing import List, Dict, Any
import re

def get_name_for_one_string(message: str) -> str:
    # drop non-alphanumeric
    message = re.sub(r"\s*[^A-Za-z]+\s*", " ", message)
    # string split
    items = message.split(" ")
    # keep all caps and len > 2
    item = [x for x in items if (x.upper() == x and len(x) > 2)]
    if len(item) > 0:
        return item[0]
    else:
        return None
                                 
def get_names(df: List[Dict[str,Any]]) -> List[Dict[str,Any]]:
    for row in df:
        row["names"] = get_name_for_one_string(row["message"])
    return df

现在我们可以使用 Fugue transform 函数在 Pandas DataFrame 上使用它,Fugue 将处理转换

from fugue import transform
transform(df, get_names, schema="*,names:str")

这行得通,现在我们只需指定引擎就可以将它引入 Spark。

import fugue_spark
transform(df, get_names, schema="*,names:str", engine="spark").show()
+-----+-------------------+--------------------+-------+
|id_mt|          date_send|             message|  names|
+-----+-------------------+--------------------+-------+
|    1|2021-10-04 09:05:14|For the 2nd copy ...|   null|
|    2|2021-10-04 09:10:05|. MARCIOG, let's ...|MARCIOG|
|    3|2021-10-04 09:27:27|, we do not ident...|   null|
|    4|2021-10-04 14:55:26|Mr, SUELI. enjoy ...|  SUELI|
|    5|2021-10-06 09:15:11|. DEPREZC, let's ...|DEPREZC|
|    6|2022-02-03 08:00:12|Mr. SARA. We have...|   SARA|
|    7|2021-10-04 09:26:00|, we do not ident...|   null|
|    8|2018-10-09 12:31:33|Mr.(a) ANTONI, re...| ANTONI|
|    9|2018-10-09 15:14:51|Follow code of ba...|   null|
+-----+-------------------+--------------------+-------+

请注意,您需要 .show(),因为 Spark 会延迟计算。 transform 函数可以接受 Pandas 和 Spark DataFrame。如果您使用 Spark 引擎,输出也将是一个 Spark DataFrame。