Python 用于从语音转文本生成的字符串中删除社会安全号码的正则表达式
Python RegEx to remove social security number from string generated with speech-to-text
出于 GDPR 合规原因,我正在尝试从语音转文本生成的混乱数据中删除社会安全号码 (SSN)。这是一个示例字符串(翻译成英文,解释了为什么在列出 SSN 时出现 'and'):
sample1 = "hello my name is sofie my social security number is thirteen zero four five and seventy eighteen seven and forty and I live on mountain street number twelve"
我的目标是删除部分 "thirteen ... forty "
,同时保留可能出现在字符串中的其他数字,结果是:
sample1_wo_ssn = "hello my name is sofie my social security number is and I live on mountain street number twelve"
社会安全号码的长度会因数据的生成方式而异(3-10 个分隔的数字)。
我的做法:
- 使用字典将写入的数字替换为数字
- 使用正则表达式查找 3 个或更多数字出现的位置,只有空格或
"and"
将它们分隔开,并将它们与这 3 个数字后面的任何数字一起删除。
这是我的代码:
import re
number_dict = {
'zero': '0',
'one': '1',
'two': '2',
'three': '3',
'four': '4',
'five': '5',
'six': '6',
'seven': '7',
'eight': '8',
'nine': '9',
'ten': '10',
'eleven': '11',
'twelve': '12',
'thirteen': '13',
'fourteen': '14',
'fifteen': '15',
'sixteen': '16',
'seventeen': '17',
'eighteen': '18',
'nineteen': '19',
'twenty': '20',
'thirty': '30',
'forty': '40',
'fifty': '50',
'sixty': '60',
'seventy': '70',
'eighty': '80',
'ninety': '90'
}
sample1 = "hello my name is sofie my social security number is thirteen zero four five and seventy eighteen seven and forty and I live on mountain street number twelve"
sample1_temp = [number_dict.get(item,item) for item in sample1.split()]
sample1_numb = ' '.join(sample1_temp)
re_results = re.findall(r'(\d+ (and\s)?\d+ (and\s)?\d+\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?)', sample1_numb)
print(re_results)
输出:
[('13 0 4 5 and 70 18 7 and 40 and ', '', '', '', '5', 'and ', '70', '', '18', '', '7', 'and ', '40', 'and ', '', '', '', '', '')]
这就是我卡住的地方。
在此示例中,我可以执行类似 sample1_wh_ssn = re.sub(re_results[0][0],'',sample1_numb)
的操作来获得所需的结果,但这不会一概而论。
如有任何帮助,我们将不胜感激。
这是您当前逻辑的一个实现,即:
- 将
1
到 99
的单词数字转换为数字
- 删除由空格分隔的 3 个或更多数字的所有实例
- 将两位数字转换回单词。
学分:
- 将单词转换为数字:Is there a way to convert number words to Integers? by recursive
- 将数字转换为单词:How do I tell Python to convert integers into words by kindall
参见Python code:
import re
number_words = [ "zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen"]
number_words_tens =[ "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety" ]
number_words_rx = re.compile(r'\b(?:(?:{0})?(?:{1})|(?:{0}))\b'.format("|".join(number_words_tens),"|".join(number_words)))
main_rx = re.compile(r'\s*\d+(?:\s+(?:and\s+)?\d+){2,}')
numbers_1_99 = number_words
numbers_1_99.extend(tens if ones == "zero" else (tens + "-" + ones) # whosebug.com/a/8982279/3832970
for tens in "twenty thirty forty fifty sixty seventy eighty ninety".split()
for ones in numbers_1_99[0:10])
def text2int(textnum, numwords={}): # whosebug.com/a/493788/3832970
units = [
"zero", "one", "two", "three", "four", "five", "six", "seven", "eight",
"nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
"sixteen", "seventeen", "eighteen", "nineteen",
]
tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]
numwords["and"] = (1, 0)
for idx, word in enumerate(units):
numwords[word] = (1, idx)
for idx, word in enumerate(tens):
numwords[word] = (1, idx * 10)
current = result = 0
for word in textnum.split():
if word not in numwords:
raise Exception("Illegal word: " + word)
scale, increment = numwords[word]
current = current + increment
return result + current
sample1 = "hello my name is sofie my social security number is thirteen zero four five and seventy eighteen seven and forty and I live on mountain street number twelve"
sample1 = number_words_rx.sub(lambda x: str(text2int(x.group())), sample1)
re_results = main_rx.sub('', sample1)
print( re.sub(r'\d{1,2}', lambda x: numbers_1_99[int(x.group())], re_results) )
输出:hello my name is sofie my social security number is and I live on mountain street number twelve
出于 GDPR 合规原因,我正在尝试从语音转文本生成的混乱数据中删除社会安全号码 (SSN)。这是一个示例字符串(翻译成英文,解释了为什么在列出 SSN 时出现 'and'):
sample1 = "hello my name is sofie my social security number is thirteen zero four five and seventy eighteen seven and forty and I live on mountain street number twelve"
我的目标是删除部分 "thirteen ... forty "
,同时保留可能出现在字符串中的其他数字,结果是:
sample1_wo_ssn = "hello my name is sofie my social security number is and I live on mountain street number twelve"
社会安全号码的长度会因数据的生成方式而异(3-10 个分隔的数字)。
我的做法:
- 使用字典将写入的数字替换为数字
- 使用正则表达式查找 3 个或更多数字出现的位置,只有空格或
"and"
将它们分隔开,并将它们与这 3 个数字后面的任何数字一起删除。
这是我的代码:
import re
number_dict = {
'zero': '0',
'one': '1',
'two': '2',
'three': '3',
'four': '4',
'five': '5',
'six': '6',
'seven': '7',
'eight': '8',
'nine': '9',
'ten': '10',
'eleven': '11',
'twelve': '12',
'thirteen': '13',
'fourteen': '14',
'fifteen': '15',
'sixteen': '16',
'seventeen': '17',
'eighteen': '18',
'nineteen': '19',
'twenty': '20',
'thirty': '30',
'forty': '40',
'fifty': '50',
'sixty': '60',
'seventy': '70',
'eighty': '80',
'ninety': '90'
}
sample1 = "hello my name is sofie my social security number is thirteen zero four five and seventy eighteen seven and forty and I live on mountain street number twelve"
sample1_temp = [number_dict.get(item,item) for item in sample1.split()]
sample1_numb = ' '.join(sample1_temp)
re_results = re.findall(r'(\d+ (and\s)?\d+ (and\s)?\d+\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?)', sample1_numb)
print(re_results)
输出:
[('13 0 4 5 and 70 18 7 and 40 and ', '', '', '', '5', 'and ', '70', '', '18', '', '7', 'and ', '40', 'and ', '', '', '', '', '')]
这就是我卡住的地方。
在此示例中,我可以执行类似 sample1_wh_ssn = re.sub(re_results[0][0],'',sample1_numb)
的操作来获得所需的结果,但这不会一概而论。
如有任何帮助,我们将不胜感激。
这是您当前逻辑的一个实现,即:
- 将
1
到99
的单词数字转换为数字 - 删除由空格分隔的 3 个或更多数字的所有实例
- 将两位数字转换回单词。
学分:
- 将单词转换为数字:Is there a way to convert number words to Integers? by recursive
- 将数字转换为单词:How do I tell Python to convert integers into words by kindall
参见Python code:
import re
number_words = [ "zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen"]
number_words_tens =[ "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety" ]
number_words_rx = re.compile(r'\b(?:(?:{0})?(?:{1})|(?:{0}))\b'.format("|".join(number_words_tens),"|".join(number_words)))
main_rx = re.compile(r'\s*\d+(?:\s+(?:and\s+)?\d+){2,}')
numbers_1_99 = number_words
numbers_1_99.extend(tens if ones == "zero" else (tens + "-" + ones) # whosebug.com/a/8982279/3832970
for tens in "twenty thirty forty fifty sixty seventy eighty ninety".split()
for ones in numbers_1_99[0:10])
def text2int(textnum, numwords={}): # whosebug.com/a/493788/3832970
units = [
"zero", "one", "two", "three", "four", "five", "six", "seven", "eight",
"nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
"sixteen", "seventeen", "eighteen", "nineteen",
]
tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]
numwords["and"] = (1, 0)
for idx, word in enumerate(units):
numwords[word] = (1, idx)
for idx, word in enumerate(tens):
numwords[word] = (1, idx * 10)
current = result = 0
for word in textnum.split():
if word not in numwords:
raise Exception("Illegal word: " + word)
scale, increment = numwords[word]
current = current + increment
return result + current
sample1 = "hello my name is sofie my social security number is thirteen zero four five and seventy eighteen seven and forty and I live on mountain street number twelve"
sample1 = number_words_rx.sub(lambda x: str(text2int(x.group())), sample1)
re_results = main_rx.sub('', sample1)
print( re.sub(r'\d{1,2}', lambda x: numbers_1_99[int(x.group())], re_results) )
输出:hello my name is sofie my social security number is and I live on mountain street number twelve