在 Scala - Spark 中删除标点符号形式的文本
Removing punctuation marks form text in Scala - Spark
这是我的数据样本之一:
case time (especially it's purse), read manual care, follow care instructions make stays waterproof -- example, inspect rubber seals doors (especially battery/memory card door open time)
xm "life support" picture . flip part bit flimsy guessing won't long . sound great altec speaker dock it! chance back base (xm3020) . traveling bag connect laptop extra speaker . amount paid ().
我想删除除点 (.) 之外的所有标点符号,并删除带有 length < = 2
的单词,例如我的预期输出是:
case time especially its purse read manual care follow care instructions . make stays waterproof example inspect rubber seals doors especially batterymemory card door open time
life support picture . flip part bit flimsy guessing wont long . sound great altec speaker dock chance back base xm3020 . traveling bag connect laptop extra speaker . amount paid .
这应该在 Scala 中实现,
我试过了:
replaceAll( """\W\s""", "")
replaceAll(""""[^a-zA-Z\.]""", "")
但效果不佳,有人可以帮助我吗?
这个怎么样:
replaceAll("(\(|\)|'|/", "")
那么您只需使用 | 添加更多标点符号即可删除,并确保使用双反斜杠转义 ( 和 ) 等字符?
您可以尝试像这样过滤字符串:
val example = "Hey there! It's me, myself and I."
example.filterNot(x => x == ',' || x == '!' || x == 'm')
res3: String = Hey there It's e yself and I.
试试这个,它应该有效:
val str = """
|case time (especially it's purse), read manual care, follow care instructions make stays waterproof -- example, inspect rubber seals doors (especially battery/memory card door open time)
|xm "life support" picture . flip part bit flimsy guessing won't long . sound great altec speaker dock it! chance back base (xm3020) . traveling bag connect laptop extra speaker . amount paid ().
""".stripMargin('|')
println(str)
val pat = """[^\w\s\.$]"""
val pat2 = """\s\w{2}\s"""
println(str.replaceAll(pat, "").replaceAll(pat2, ""))
输出:
case time especially its purse read manual care follow care instructions make stays waterproof example inspect rubber seals doors especially batterymemory card door open time
life support picture . flip part bit flimsy guessing wont long . sound great altec speaker dockchance back base xm3020 . traveling bag connect laptop extra speaker . amount paid .
查看正则表达式 javadoc (http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html),我们看到用于标点符号的字符 class 是 \p{Punct}
并且我们可以从字符 [=35= 中删除一个字符] 使用一些东西作为 [a-z&&[^def]]
。从那时起,很容易定义一个正则表达式,它将删除除点之外的所有标点符号:
s.replaceAll("""[\p{Punct}&&[^.]]""", "")
删除大小 <= 2 的单词可以像这样完成:
s.replaceAll("""\b\p{IsLetter}{1,2}\b""")
将两者结合起来,得到:
s.replaceAll("""([\p{Punct}&&[^.]]|\b\p{IsLetter}{1,2}\b)\s*""", "")
请注意我是如何添加 \s*
来删除多余空格的。
此外,您可以看到上面的正则表达式完全删除了“$”,因为它是一个标点符号(由 unicode 定义)。
如果这是不可取的(似乎表明您的预期输出),请在您认为的标点符号中更加精确。
例如,您可能只想将以下字符视为标点符号:?.!:()
:
s.replaceAll("""([?.!:]|\b\p{IsLetter}{1,2}\b)\s*""", "")
或者,您可以只将“$”添加到您的 "not-punctuation" 字符列表中,连同点:
s.replaceAll("""([\p{Punct}&&[^.$]]|\b\p{IsLetter}{1,2}\b)\s*""", "")
这是我的数据样本之一:
case time (especially it's purse), read manual care, follow care instructions make stays waterproof -- example, inspect rubber seals doors (especially battery/memory card door open time)
xm "life support" picture . flip part bit flimsy guessing won't long . sound great altec speaker dock it! chance back base (xm3020) . traveling bag connect laptop extra speaker . amount paid ().
我想删除除点 (.) 之外的所有标点符号,并删除带有 length < = 2
的单词,例如我的预期输出是:
case time especially its purse read manual care follow care instructions . make stays waterproof example inspect rubber seals doors especially batterymemory card door open time
life support picture . flip part bit flimsy guessing wont long . sound great altec speaker dock chance back base xm3020 . traveling bag connect laptop extra speaker . amount paid .
这应该在 Scala 中实现, 我试过了:
replaceAll( """\W\s""", "")
replaceAll(""""[^a-zA-Z\.]""", "")
但效果不佳,有人可以帮助我吗?
这个怎么样:
replaceAll("(\(|\)|'|/", "")
那么您只需使用 | 添加更多标点符号即可删除,并确保使用双反斜杠转义 ( 和 ) 等字符?
您可以尝试像这样过滤字符串:
val example = "Hey there! It's me, myself and I."
example.filterNot(x => x == ',' || x == '!' || x == 'm')
res3: String = Hey there It's e yself and I.
试试这个,它应该有效:
val str = """
|case time (especially it's purse), read manual care, follow care instructions make stays waterproof -- example, inspect rubber seals doors (especially battery/memory card door open time)
|xm "life support" picture . flip part bit flimsy guessing won't long . sound great altec speaker dock it! chance back base (xm3020) . traveling bag connect laptop extra speaker . amount paid ().
""".stripMargin('|')
println(str)
val pat = """[^\w\s\.$]"""
val pat2 = """\s\w{2}\s"""
println(str.replaceAll(pat, "").replaceAll(pat2, ""))
输出:
case time especially its purse read manual care follow care instructions make stays waterproof example inspect rubber seals doors especially batterymemory card door open time
life support picture . flip part bit flimsy guessing wont long . sound great altec speaker dockchance back base xm3020 . traveling bag connect laptop extra speaker . amount paid .
查看正则表达式 javadoc (http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html),我们看到用于标点符号的字符 class 是 \p{Punct}
并且我们可以从字符 [=35= 中删除一个字符] 使用一些东西作为 [a-z&&[^def]]
。从那时起,很容易定义一个正则表达式,它将删除除点之外的所有标点符号:
s.replaceAll("""[\p{Punct}&&[^.]]""", "")
删除大小 <= 2 的单词可以像这样完成:
s.replaceAll("""\b\p{IsLetter}{1,2}\b""")
将两者结合起来,得到:
s.replaceAll("""([\p{Punct}&&[^.]]|\b\p{IsLetter}{1,2}\b)\s*""", "")
请注意我是如何添加 \s*
来删除多余空格的。
此外,您可以看到上面的正则表达式完全删除了“$”,因为它是一个标点符号(由 unicode 定义)。
如果这是不可取的(似乎表明您的预期输出),请在您认为的标点符号中更加精确。
例如,您可能只想将以下字符视为标点符号:?.!:()
:
s.replaceAll("""([?.!:]|\b\p{IsLetter}{1,2}\b)\s*""", "")
或者,您可以只将“$”添加到您的 "not-punctuation" 字符列表中,连同点:
s.replaceAll("""([\p{Punct}&&[^.$]]|\b\p{IsLetter}{1,2}\b)\s*""", "")