使用正则表达式在数据框中的括号或数字之前提取前导子字符串

Extract leading substring before parenthesis or digits in dataframe with regex

我正在寻找一种解决方案来提取没有其他名称或数字的名称。

我的目标是将不在括号中、没有空格和数字的子字符串提取到新列中。

例如:

String                            New string
 Bolivia (Plurinational State of)  Bolivia
 United States of America20        United States of America

数据如下所示:

**Country**                               **Energy Supply** 
Antigua and Barbuda                           8000000   
Bolivia (Plurinational State of)              50000
Iran (Islamic Republic of)                    20000  
Sint Maarten (Dutch part)                     58000
United States of America20                    65000
China, Macao Special AdministrativeRegion4    52000
.....more cases....                        ....more cases....

我的代码如下所示:

df['newcontry']=df['Country'].str.extract(r'(\w*\s)')

并且 returns 是这样的:

**Country**                               **Energy Supply**   newcontry
    Antigua and Barbuda                           8000000      Antigua
    Bolivia (Plurinational State of)              50000        Bolivia
    Iran (Islamic Republic of)                    20000        Iran
    Sint Maarten (Dutch part)                     58000        Sint
    United States of America20                    65000        United
    China, Macao Special AdministrativeRegion4    52000        China

我可以更改什么来解决此错误?

假设您只需要字符串的前导块,您可以使用 \d\( 之间的交替组:r"^(.+?) ?(?:\d|\(|$)" 和惰性 (.+?) 来提取您感兴趣的区块。

>>> df = pd.DataFrame({"Country": ["Bolivia (Plurinational State of)", "United States of America20", "Antigua and Barbuda"]})
>>> df
                            Country
0  Bolivia (Plurinational State of)
1        United States of America20
2               Antigua and Barbuda
>>> df["Country"].str.extract(r"^(.+?) ?(?:\d|\(|$)")
                          0
0                   Bolivia
1  United States of America
2       Antigua and Barbuda

另一种选择是替换您不想要的最终内容

df['newcontry']=df['Country'].str.replace(r' ?(?:\(|\d).*', '')