如何在一列字符串中找到特定的数字模式并将该值替换为该序号的文本版本?
How do I find a specific number pattern within a column of strings and replace that value with a text version of that ordinal number?
请原谅,我是python的新手。但是我正在构建一个功能,我可以用它来清理各种调查的文本。我觉得我快要将序数的数字版本转换为文本版本了,但还差得远。这是我要构建的函数(请注意,我尝试了两种方法来在函数的 *nbr = * 行上找到正则表达式模式,但我在下面解释了这两种方法时都遇到了错误):
import pandas as pd
from num2words import num2words
import re
my_df = pd.DataFrame({"record": [47,56,59,134,454],
"the_string": ["this is the first string",
"this is the 2nd string",
"nothing to see here",
"4th string has the date: today is the 8th",
"this has a typo10th"]})
def replace_ordinal_numbers(words):
nbr = re.findall('(\d+)[st|nd|rd|th]', words) #words.str.findall('(\d+)[st|nd|rd|th]')
newText = words
for n in nbr:
ordinal_words = num2words(n, ordinal=True)
newText = words.replace(r'\d+[st|nd|rd|th]', ordinal_words)
return newText
my_df['the_string_clean'] = replace_ordinal_numbers(str(my_df['the_string']))
错误:
当我在函数的“nbr =”行上 运行 words.str.findall
时,出现错误:AttributeError: 'str' object has no attribute 'str'
而当我 运行 re.findall
我能够获取数据框,但 'the_string_clean' 列不反映每行上的字符串。相反,我得到:
record the_string the_string_clean
0 47 This is the first string "0This is the first string 1This is the 2nd string 2nothing to
see here 3 4th string has the date: today is the 8th 4This has
a typo10th"
Name: the_string, dtype: object
1 56 This is the 2nd string "0This is the first string 1This is the 2nd string 2 nothing to
see here3 4th string has the date: today is the 8th 4This has a
typo10th"
Name: the_string, dtype: object
2 59 nothing to see here "0This is the first string 1This is the 2nd string 2 nothing to
see here3 4th string has the date: today is the 8th 4This has a
typo10th"
Name: the_string, dtype: object
3 134 4th string has the "0This is the first string 1This is the 2nd string 2 nothing to
date: today is the 8th see here3 4th string has the date: today is the 8th 4This has a
typo10th"
Name: the_string, dtype: object
4 454 this has a typo10th "0This is the first string 1This is the 2nd string 2 nothing to
see here3 4th string has the date: today is the 8th 4This has a
typo10th"
Name: the_string, dtype: object
预期输出:这是我期望的输出:
record the_string the_string_clean
47 this is the first string this is the first string
56 this is the 2nd string this is the second string
59 nothing to see here nothing to see here
134 4th string has the date: today is the 8th fourth string has the date: today is the eighth
454 this has a typo10th this has a typotenth
希望我说得够清楚了。我是 Python 的新手,如有任何帮助,我们将不胜感激。
您可以通过使用 re.sub
and calling num2words
in a lambda function as the replacement. Then just use DataFrame.apply
到 运行 列上的函数来简化 replace_ordinal_numbers
函数:
import pandas as pd
from num2words import num2words
import re
my_df = pd.DataFrame({"record": [47,56,59,134,454],
"the_string": ["this is the first string",
"this is the 2nd string",
"nothing to see here",
"4th string has the date: today is the 8th",
"this has a typo10th"]})
def replace_ordinal_numbers(words):
return re.sub(r'(\d+)(?:st|nd|rd|th)', lambda m: num2words(m.group(1), ordinal=True), words)
my_df['the_string'] = my_df['the_string'].apply(replace_ordinal_numbers)
my_df
输出
record the_string
0 47 this is the first string
1 56 this is the second string
2 59 nothing to see here
3 134 fourth string has the date: today is the eighth
4 454 this has a typotenth
请注意,您需要在正则表达式中使用交替 (?:st|nd|rd|th)
来匹配 st
、nd
、rd
或 th
之一;您正在使用的字符 class:[st|nd|rd|th]
将匹配任何包含 dnrst|
.
中任何字符的字符串
请原谅,我是python的新手。但是我正在构建一个功能,我可以用它来清理各种调查的文本。我觉得我快要将序数的数字版本转换为文本版本了,但还差得远。这是我要构建的函数(请注意,我尝试了两种方法来在函数的 *nbr = * 行上找到正则表达式模式,但我在下面解释了这两种方法时都遇到了错误):
import pandas as pd
from num2words import num2words
import re
my_df = pd.DataFrame({"record": [47,56,59,134,454],
"the_string": ["this is the first string",
"this is the 2nd string",
"nothing to see here",
"4th string has the date: today is the 8th",
"this has a typo10th"]})
def replace_ordinal_numbers(words):
nbr = re.findall('(\d+)[st|nd|rd|th]', words) #words.str.findall('(\d+)[st|nd|rd|th]')
newText = words
for n in nbr:
ordinal_words = num2words(n, ordinal=True)
newText = words.replace(r'\d+[st|nd|rd|th]', ordinal_words)
return newText
my_df['the_string_clean'] = replace_ordinal_numbers(str(my_df['the_string']))
错误:
当我在函数的“nbr =”行上 运行 words.str.findall
时,出现错误:AttributeError: 'str' object has no attribute 'str'
而当我 运行 re.findall
我能够获取数据框,但 'the_string_clean' 列不反映每行上的字符串。相反,我得到:
record the_string the_string_clean
0 47 This is the first string "0This is the first string 1This is the 2nd string 2nothing to
see here 3 4th string has the date: today is the 8th 4This has
a typo10th"
Name: the_string, dtype: object
1 56 This is the 2nd string "0This is the first string 1This is the 2nd string 2 nothing to
see here3 4th string has the date: today is the 8th 4This has a
typo10th"
Name: the_string, dtype: object
2 59 nothing to see here "0This is the first string 1This is the 2nd string 2 nothing to
see here3 4th string has the date: today is the 8th 4This has a
typo10th"
Name: the_string, dtype: object
3 134 4th string has the "0This is the first string 1This is the 2nd string 2 nothing to
date: today is the 8th see here3 4th string has the date: today is the 8th 4This has a
typo10th"
Name: the_string, dtype: object
4 454 this has a typo10th "0This is the first string 1This is the 2nd string 2 nothing to
see here3 4th string has the date: today is the 8th 4This has a
typo10th"
Name: the_string, dtype: object
预期输出:这是我期望的输出:
record the_string the_string_clean
47 this is the first string this is the first string
56 this is the 2nd string this is the second string
59 nothing to see here nothing to see here
134 4th string has the date: today is the 8th fourth string has the date: today is the eighth
454 this has a typo10th this has a typotenth
希望我说得够清楚了。我是 Python 的新手,如有任何帮助,我们将不胜感激。
您可以通过使用 re.sub
and calling num2words
in a lambda function as the replacement. Then just use DataFrame.apply
到 运行 列上的函数来简化 replace_ordinal_numbers
函数:
import pandas as pd
from num2words import num2words
import re
my_df = pd.DataFrame({"record": [47,56,59,134,454],
"the_string": ["this is the first string",
"this is the 2nd string",
"nothing to see here",
"4th string has the date: today is the 8th",
"this has a typo10th"]})
def replace_ordinal_numbers(words):
return re.sub(r'(\d+)(?:st|nd|rd|th)', lambda m: num2words(m.group(1), ordinal=True), words)
my_df['the_string'] = my_df['the_string'].apply(replace_ordinal_numbers)
my_df
输出
record the_string
0 47 this is the first string
1 56 this is the second string
2 59 nothing to see here
3 134 fourth string has the date: today is the eighth
4 454 this has a typotenth
请注意,您需要在正则表达式中使用交替 (?:st|nd|rd|th)
来匹配 st
、nd
、rd
或 th
之一;您正在使用的字符 class:[st|nd|rd|th]
将匹配任何包含 dnrst|
.