当包含在列表中时从字符串中提取数据
Extracting data from strings when included in a list
我正在尝试对 Pandas DataFrame 的 4 列中包含的地理信息进行排序,以便将同类行政区划始终存储在同一列中。
我已经构建了 5 个字符串列表,其中包含有关我要存储的 5 个地理级别的信息。
我尝试填充一致的列,将原始的 4 个不一致的列与我的 5 个一致的列表进行比较,但原始列中存在 nan 值会在我的代码中触发错误或在结果列中返回太多 nan .下面我提供了一个最小的代码示例。
import pandas as pd
df = pd.DataFrame (np.array([['nan', 'Rome', 'Civitavecchia'],
['Asti', 'nan', 'Piedmont'],
['Bozen', 'Sudtirol', 'nan']]),
columns=['a','b','c'])
town = ['Civitavecchia']
province = ['Rome', 'Asti', 'Bozen']
region = ['Piedmont', 'Sudtirol']
#first attempt returns a ValueError: pattern contains no capture groups:
df['a'].str.extractall ('|'.join(town))#
#second attempt:
#this only yields two out of six not-nan results expected
df['geo1'] = np.where(df.a.isin(town), df.a, np.nan)
df['geo1'] = np.where(df.b.isin(town), df.b, np.nan)
df['geo1'] = np.where(df.c.isin(town), df.c, np.nan)
df['geo2'] = np.where(df.a.isin(province), df.a, np.nan)
df['geo2'] = np.where(df.b.isin(province), df.b, np.nan)
df['geo2'] = np.where(df.c.isin(province), df.c, np.nan)
df['geo3'] = np.where(df.a.isin(region), df.a, np.nan)
df['geo3'] = np.where(df.b.isin(region), df.b, np.nan)
df['geo3'] = np.where(df.c.isin(region), df.c, np.nan)
dftarget = pd.DataFrame (np.array([['Civitavecchia', 'Rome', 'nan'],
['nan', 'Asti', 'Piedmont'],
['nan', 'Bozen', 'Sudtirol']]),
columns=['geo1','geo2','geo3'])
我的目标输出在 dftarget
中有描述
试试这个方法,使用 f 字符串格式。您需要里面的括号来定义您的捕获组。如果没有内括号,您会得到 no capture group defined 错误。
df['c'].str.extract(f'({"|".join(town)})')
输出:
0
0 Civitavecchia
1 NaN
2 NaN
IIUC,你可以堆叠数据,映射它,然后透视:
# create a common mapping
d = {}
for t in town: d[t] = 'geo1'
for p in province: d[p] = 'geo2'
for r in region: d[r] = 'geo3'
# stack data for one-go map
a = (df.stack().to_frame(name='data')
.reset_index(level=1, drop=True)
)
# return data
a.dropna().pivot(values='data', columns='col')
输出:
col geo1 geo2 geo3
0 Civitavecchia Rome NaN
1 NaN Asti Piedmont
2 NaN Bozen Sudtirol
我正在尝试对 Pandas DataFrame 的 4 列中包含的地理信息进行排序,以便将同类行政区划始终存储在同一列中。
我已经构建了 5 个字符串列表,其中包含有关我要存储的 5 个地理级别的信息。
我尝试填充一致的列,将原始的 4 个不一致的列与我的 5 个一致的列表进行比较,但原始列中存在 nan 值会在我的代码中触发错误或在结果列中返回太多 nan .下面我提供了一个最小的代码示例。
import pandas as pd
df = pd.DataFrame (np.array([['nan', 'Rome', 'Civitavecchia'],
['Asti', 'nan', 'Piedmont'],
['Bozen', 'Sudtirol', 'nan']]),
columns=['a','b','c'])
town = ['Civitavecchia']
province = ['Rome', 'Asti', 'Bozen']
region = ['Piedmont', 'Sudtirol']
#first attempt returns a ValueError: pattern contains no capture groups:
df['a'].str.extractall ('|'.join(town))#
#second attempt:
#this only yields two out of six not-nan results expected
df['geo1'] = np.where(df.a.isin(town), df.a, np.nan)
df['geo1'] = np.where(df.b.isin(town), df.b, np.nan)
df['geo1'] = np.where(df.c.isin(town), df.c, np.nan)
df['geo2'] = np.where(df.a.isin(province), df.a, np.nan)
df['geo2'] = np.where(df.b.isin(province), df.b, np.nan)
df['geo2'] = np.where(df.c.isin(province), df.c, np.nan)
df['geo3'] = np.where(df.a.isin(region), df.a, np.nan)
df['geo3'] = np.where(df.b.isin(region), df.b, np.nan)
df['geo3'] = np.where(df.c.isin(region), df.c, np.nan)
dftarget = pd.DataFrame (np.array([['Civitavecchia', 'Rome', 'nan'],
['nan', 'Asti', 'Piedmont'],
['nan', 'Bozen', 'Sudtirol']]),
columns=['geo1','geo2','geo3'])
我的目标输出在 dftarget
中有描述试试这个方法,使用 f 字符串格式。您需要里面的括号来定义您的捕获组。如果没有内括号,您会得到 no capture group defined 错误。
df['c'].str.extract(f'({"|".join(town)})')
输出:
0
0 Civitavecchia
1 NaN
2 NaN
IIUC,你可以堆叠数据,映射它,然后透视:
# create a common mapping
d = {}
for t in town: d[t] = 'geo1'
for p in province: d[p] = 'geo2'
for r in region: d[r] = 'geo3'
# stack data for one-go map
a = (df.stack().to_frame(name='data')
.reset_index(level=1, drop=True)
)
# return data
a.dropna().pivot(values='data', columns='col')
输出:
col geo1 geo2 geo3
0 Civitavecchia Rome NaN
1 NaN Asti Piedmont
2 NaN Bozen Sudtirol