当包含在列表中时从字符串中提取数据

Extracting data from strings when included in a list

我正在尝试对 Pandas DataFrame 的 4 列中包含的地理信息进行排序,以便将同类行政区划始终存储在同一列中。

我已经构建了 5 个字符串列表,其中包含有关我要存储的 5 个地理级别的信息。

我尝试填充一致的列,将原始的 4 个不一致的列与我的 5 个一致的列表进行比较,但原始列中存在 nan 值会在我的代码中触发错误或在结果列中返回太多 nan .下面我提供了一个最小的代码示例。

import pandas as pd
df = pd.DataFrame (np.array([['nan', 'Rome', 'Civitavecchia'],
                             ['Asti', 'nan', 'Piedmont'],
                             ['Bozen', 'Sudtirol', 'nan']]),
 columns=['a','b','c'])


town = ['Civitavecchia']
province = ['Rome', 'Asti', 'Bozen']
region = ['Piedmont', 'Sudtirol']

#first attempt returns a ValueError: pattern contains no capture groups:
df['a'].str.extractall ('|'.join(town))#

#second attempt:
#this only yields two out of six not-nan results expected

df['geo1'] = np.where(df.a.isin(town), df.a, np.nan)
df['geo1'] = np.where(df.b.isin(town), df.b, np.nan)
df['geo1'] = np.where(df.c.isin(town), df.c, np.nan)

df['geo2'] = np.where(df.a.isin(province), df.a, np.nan)
df['geo2'] = np.where(df.b.isin(province), df.b, np.nan)
df['geo2'] = np.where(df.c.isin(province), df.c, np.nan)

df['geo3'] = np.where(df.a.isin(region), df.a, np.nan)
df['geo3'] = np.where(df.b.isin(region), df.b, np.nan)
df['geo3'] = np.where(df.c.isin(region), df.c, np.nan)


dftarget = pd.DataFrame (np.array([['Civitavecchia', 'Rome', 'nan'],
                             ['nan', 'Asti', 'Piedmont'],
                             ['nan', 'Bozen', 'Sudtirol']]),
 columns=['geo1','geo2','geo3'])

我的目标输出在 dftarget

中有描述

试试这个方法,使用 f 字符串格式。您需要里面的括号来定义您的捕获组。如果没有内括号,您会得到 no capture group defined 错误。

df['c'].str.extract(f'({"|".join(town)})')

输出:

               0
0  Civitavecchia
1            NaN
2            NaN

IIUC,你可以堆叠数据,映射它,然后透视:

# create a common mapping
d = {}
for t in town: d[t] = 'geo1'
for p in province: d[p] = 'geo2'
for r in region: d[r] = 'geo3'    

# stack data for one-go map
a = (df.stack().to_frame(name='data')
         .reset_index(level=1, drop=True)
    )

# return data
a.dropna().pivot(values='data', columns='col')

输出:

col           geo1   geo2      geo3
0    Civitavecchia   Rome       NaN
1              NaN   Asti  Piedmont
2              NaN  Bozen  Sudtirol