如何在一列字符串中找到特定的数字模式并将该值替换为该序号的文本版本？

Question

请原谅，我是python的新手。但是我正在构建一个功能，我可以用它来清理各种调查的文本。我觉得我快要将序数的数字版本转换为文本版本了，但还差得远。这是我要构建的函数（请注意，我尝试了两种方法来在函数的 *nbr = * 行上找到正则表达式模式，但我在下面解释了这两种方法时都遇到了错误）：

import pandas as pd
from num2words import num2words
import re

my_df = pd.DataFrame({"record": [47,56,59,134,454],
                      "the_string": ["this is the first string",
                                     "this is the 2nd string",
                                     "nothing to see here",
                                     "4th string has the date: today is the 8th",
                                     "this has a typo10th"]})

def replace_ordinal_numbers(words):
    nbr = re.findall('(\d+)[st|nd|rd|th]', words) #words.str.findall('(\d+)[st|nd|rd|th]')
    
    newText = words
    for n in nbr:
        ordinal_words = num2words(n, ordinal=True)
        newText = words.replace(r'\d+[st|nd|rd|th]', ordinal_words)
    return newText

my_df['the_string_clean'] = replace_ordinal_numbers(str(my_df['the_string']))

错误：当我在函数的“nbr =”行上运行 words.str.findall 时，出现错误：AttributeError: 'str' object has no attribute 'str' 而当我运行 re.findall 我能够获取数据框，但 'the_string_clean' 列不反映每行上的字符串。相反，我得到：

    record  the_string                  the_string_clean
0   47      This is the first string    "0This is the first string 1This is the 2nd string 2nothing to 
                                        see here 3 4th string has the date: today is the 8th 4This has 
                                        a typo10th"
Name: the_string, dtype: object
1   56      This is the 2nd string      "0This is the first string 1This is the 2nd string 2 nothing to 
                                        see here3 4th string has the date: today is the 8th 4This has a 
                                        typo10th"
Name: the_string, dtype: object
2   59       nothing to see here        "0This is the first string 1This is the 2nd string 2 nothing to 
                                        see here3 4th string has the date: today is the 8th 4This has a 
                                        typo10th"
Name: the_string, dtype: object
3   134      4th string has the         "0This is the first string 1This is the 2nd string 2 nothing to
             date: today is the 8th     see here3 4th string has the date: today is the 8th 4This has a 
                                        typo10th"
Name: the_string, dtype: object
4   454      this has a typo10th        "0This is the first string 1This is the 2nd string 2 nothing to 
                                        see here3 4th string has the date: today is the 8th 4This has a 
                                        typo10th"
Name: the_string, dtype: object

预期输出：这是我期望的输出：

record    the_string                                 the_string_clean
47        this is the first string                   this is the first string
56        this is the 2nd string                     this is the second string
59        nothing to see here                        nothing to see here
134       4th string has the date: today is the 8th  fourth string has the date: today is the eighth
454       this has a typo10th                        this has a typotenth

希望我说得够清楚了。我是 Python 的新手，如有任何帮助，我们将不胜感激。

Answer 1

您可以通过使用 re.sub and calling num2words in a lambda function as the replacement. Then just use DataFrame.apply 到运行列上的函数来简化 replace_ordinal_numbers 函数：

import pandas as pd
from num2words import num2words
import re

my_df = pd.DataFrame({"record": [47,56,59,134,454],
                      "the_string": ["this is the first string",
                                     "this is the 2nd string",
                                     "nothing to see here",
                                     "4th string has the date: today is the 8th",
                                     "this has a typo10th"]})

def replace_ordinal_numbers(words):
    return re.sub(r'(\d+)(?:st|nd|rd|th)', lambda m: num2words(m.group(1), ordinal=True), words)

my_df['the_string'] = my_df['the_string'].apply(replace_ordinal_numbers)

my_df

输出

   record                                       the_string
0      47                         this is the first string
1      56                        this is the second string
2      59                              nothing to see here
3     134  fourth string has the date: today is the eighth
4     454                             this has a typotenth

请注意，您需要在正则表达式中使用交替 (?:st|nd|rd|th) 来匹配 st、nd、rd 或 th 之一；您正在使用的字符 class：[st|nd|rd|th] 将匹配任何包含 dnrst|.

中任何字符的字符串

如何在一列字符串中找到特定的数字模式并将该值替换为该序号的文本版本？

How do I find a specific number pattern within a column of strings and replace that value with a text version of that ordinal number?

regex

python-3.x

pandas

spyder