如何使用 pyspark 将 html 文本转换为纯文本?替换字符串中的 html 个标签
how to convert html text into plain text using pyspark? Replacing html tags from string
我有一个文本文件,其中有一列 'descn' 有一些文本,但它们是 html 格式。所以我想使用 pyspark 将 html 文本转换为纯文本。请帮我做一下。
文件名:
mdcl_insigt.txt
输入:
PROTEUSÂ <div><br></div><div>We are struggling with pathology. We don't control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.</div>
应该这样转换,输出:
PROTEUS We are struggling with pathology. We don't control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.
你可以试试 regexp_replace():
from pyspark.sql.functions import regexp_replace
df = df.withColumn("parsed_descn", regexp_replace("descn", "<[^>]+>", ""))
正则表达式并不完美,可能会失败。请多做一些研究,让它变得更好。
当我在 regexr
上尝试时,它对您的示例字符串有效
截图如下:
Pyspark 输出:
df.withColumn("parsed", F.regexp_replace("descn", "<[^>]+>", "")).select("parsed").collect()
[Row(parsed='PROTEUSÂ We are struggling with pathology. We don't control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.')]
我有一个文本文件,其中有一列 'descn' 有一些文本,但它们是 html 格式。所以我想使用 pyspark 将 html 文本转换为纯文本。请帮我做一下。
文件名:
mdcl_insigt.txt
输入:
PROTEUSÂ <div><br></div><div>We are struggling with pathology. We don't control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.</div>
应该这样转换,输出:
PROTEUS We are struggling with pathology. We don't control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.
你可以试试 regexp_replace():
from pyspark.sql.functions import regexp_replace
df = df.withColumn("parsed_descn", regexp_replace("descn", "<[^>]+>", ""))
正则表达式并不完美,可能会失败。请多做一些研究,让它变得更好。
当我在 regexr
上尝试时,它对您的示例字符串有效截图如下:
Pyspark 输出:
df.withColumn("parsed", F.regexp_replace("descn", "<[^>]+>", "")).select("parsed").collect()
[Row(parsed='PROTEUSÂ We are struggling with pathology. We don't control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS] protocol for managing of RP [specimens]. Please e-mail me with work around options.')]