从字符串中删除所有标点符号，除非它在数字之间

Question

我有一个包含单词和数字的文本。我举一个有代表性的例子：

string = "This is a 1example of the text. But, it only is 2.5 percent of all data"

我想将其转换为类似的内容：

"This is a  1 example of the text But it only is  2.5  percent of all data"

所以去掉标点符号（可以是.,或string.punctuation中的任何其他标点符号）并放一个space连接时数字和单词之间。但在我的示例中，请将浮点数保持为 2.5。

我使用了以下代码：

item = "This is a 1example of the text. But, it only is 2.5 percent of all data"
item = ' '.join(re.sub( r"([A-Z])", r" ", item).split())
# This a start but not there yet !
#item = ' '.join([x.strip(string.punctuation) for x in item.split() if x not in string.digits])
item = ' '.join(re.split(r'(\d+)', item) )
print item

结果是：

 >> "This is a  1 example of the text. But, it only is  2 . 5  percent of all data"

我快到了，但无法弄清楚最后的和平。

Answer 1

我与 Python 脱节，但对正则表达式有一些了解。我建议使用 or? 我会使用这个正则表达式："(\d+)([a-zA-Z])|([a-zA-Z])(\d+)"，然后作为替换字符串使用： " "
如果某些特殊情况困扰您，您可以将反向引用传递给过程，然后一对一处理，可能是通过检查您的“”是否可以转换为浮动。 TCL有这样的内置功能，Python应该也有。

Answer 2

我试过了，效果很好。

a = "This is a 1example of the text. But, it only is 2.5 percent of all data" a.replace(". ", " ").replace(", "," ")

注意，替换函数中标点符号后面有space。我只是用 space 替换了标点符号和 space。

Answer 3

代码：

from itertools import groupby

s1 = "This is a 1example of the text. But, it only is 2.5 percent of all data"
s2 = [''.join(g) for _, g in groupby(s1, str.isalpha)]
s3 = ' '.join(s2).replace("   ", "  ").replace("  ", " ")

#you can keep adding a replace for each ponctuation
s4 = s3.replace(". ", " ").replace(", "," ").replace("; "," ").replace(", "," ").replace("- "," ").replace("? "," ").replace("! "," ").replace(" ("," ").replace(") "," ").replace('" '," ").replace(' "'," ").replace('... '," ").replace('/ '," ").replace(' “'," ").replace('” '," ").replace('] '," ").replace(' ['," ")

s5 = s4.replace("  ", " ")
print(s5)

输出：

'This is a 1 example of the text But it only is 2.5 percent of all data'

P.s.: 你可以看看 Punctuation Marks 并在 .replace() 函数中继续添加它们。

Answer 4

好的，伙计们，这是一个答案（最好的？我不知道，但它似乎有效）：

item = "This is a 1example 2Ex of the text.But, it only is 2.5 percent of all data?"
#if there is two strings contatenated with the second starting with capital letter
item = ' '.join(re.sub( r"([A-Z])", r" ", item).split())
#if a word starts with a digit like "1example"
item = ' '.join(re.split(r'(\d+)([A-Za-z]+)', item) )
#Magical line that removes punctuation apart from floats
item = re.sub('\S+', lambda m: re.match(r'^\W*(.*\w)\W*$', m.group()).group(1), item)
item = item.replace("  "," ")
print item

Answer 5

您可以像这样使用正则表达式环视：

(?<!\d)[.,;:](?!\d)

Working demo

想法是让一个字符 class 收集您要替换的标点符号并使用环视来匹配周围没有数字的标点符号

regex = r"(?<!\d)[.,;:](?!\d)"

test_str = "This is a 1example of the text. But, it only is 2.5 percent of all data"

result = re.sub(regex, "", test_str, 0)

结果是：

This is a 1example of the text But it only is 2.5 percent of all data

Answer 6

这是一个正则表达式方法

([^ ]?)(?:[^\P{punct}.]|(?<!\d)\.(?!\d))([^ ]?)

在回调中替换：

如果 $1 长度 > 0 且 $2 长度 > 0
替换为 $1 + space + $2
别的替换为 $1$2

展开

 ( [^ ]? )                     # (1)
 (?:
      [^\P{punct}.] 
   |  
      (?<! \d )
      \.
      (?! \d )
 )
 ( [^ ]? )                     # (2)

如果您不想对 punct 相邻的字符使用逻辑
使用 (?:[^\P{punct}.]|(?<!\d)\.(?!\d)) 并替换为空。

从字符串中删除所有标点符号，除非它在数字之间

Remove all punctuation from string, except if it's between digits

python

regex

string

text

mining