Pandas:提取不一致分隔符&不规则顺序的关键字子串后的数值,将关键字&数字转化为列

Pandas: Extract numerical values after keywords substrings of inconsistent delimiters & irregular orders and transform keywords & numbers to columns

我有一个格式不规则的数据库,其数据框如下所示:

Area Dimensions
foo Length: 2m; Width: 3m; Height: 4m; Slope- 3
bar Width: 6m; Length: 4m; Height: 3m; Slope: 6
baz Height: 4m; Slope: 4; Volume = 24m3
qux Vol: 42m3

分隔符始终是分号,但冒号可能会被一些其他符号代替,例如破折号或等号。值的顺序也不一致,所以 str.split 没有生效。我想从 Dimensions 列中提取尽可能多的信息,并为未指定的值保留 0/Null 值。

我希望它看起来像这样:

Area Length Width Height Slope Volume
foo 2 3 4 3 NULL
bar 4 6 3 6 NULL
baz NULL NULL 4 4 24
qux NULL NULL NULL NULL 42

新版本:

新版本的主要改进是大大简化了关键字值的创建table。文本提取正则表达式也被简化为无需指定一组预定义关键字。

使用str.findall() + map,如下:

  1. 通过str.findall()Dimensions个关键字和值提取到一个键值对元组列表中
  2. map 这些键值对元组到 dict 并创建一个数据框
  3. Area 列与 .join()
  4. 新创建的关键字值数据框连接起来
# replace 'Vol' to 'Volume` 
# extract `Dimensions` keywords and numeric values into tuples of paired values
dim_extract = (df['Dimensions'].str.replace(r'Vol\b', 'Volume', regex=True)
                               .str.findall(r'(\w+)\W+(\d+(?:\.\d+)?)\w*(?:;|$)')
              )

# map key-value pairs to `dict` and create a dataframe
keyword_df = pd.DataFrame(map(dict, dim_extract))

# Optionally convert the extracted dimension values from string to float or integer format 
#keyword_df = keyword_df.apply(pd.to_numeric)                    # convert to float
#keyword_df = keyword_df.apply(pd.to_numeric).astype('Int64')    # convert to integer

# join `Area` column with newly created keyword dataframe
df_out = df[['Area']].join(keyword_df)

结果:

print(df_out)

  Area Length Width Height Slope Volume
0  foo      2     3      4     3    NaN
1  bar      4     6      3     6    NaN
2  baz    NaN   NaN      4     4     24
3  qux    NaN   NaN    NaN   NaN     42

旧版本:

使用str.findall()+.explode()+.pivot(),如下:

  1. 通过str.findall()Dimensions个关键字和值提取到一个键值对元组列表中
  2. 通过.explode()
  3. 将列表中的每个元素转换为一行
  4. 进一步将 Dimensions 个关键字的成对值和元组中的值分成单独的列
  5. 通过.pivot()
  6. Dimensions关键字转换为列
# replace 'Vol' to 'Volume` 
# extract `Dimensions` keywords and numeric values into tuples of paired values
df['extract'] = (df['Dimensions'].str.replace(r'Vol\b', 'Volume', regex=True)
                                 .str.findall(r'(Length|Width|Height|Slope|Volume)\W+(\d+(?:\.\d+)?)\w*(?:;|$)')
                )

# Transform each element in the list to a row
df2 = df.explode('extract')

# Separate the `Dimensions` keywords and values from a tuple into individual columns 
df2['col_name'], df2['col_val'] = zip(*df2['extract'])

# Optionally convert the extracted dimension values from string to float or integer format 
#df2['col_val'] = df2['col_val'].astype(float)
#df2['col_val'] = df2['col_val'].astype(int)

# Transform the `Dimensions` keywords into columns 
df_out = df2.pivot(index='Area', columns='col_name', values='col_val').rename_axis(columns=None).reset_index()

结果:

print(df_out)

  Area Height Length Slope Volume Width
0  bar      3      4     6    NaN     6
1  baz      4    NaN     4     24   NaN
2  foo      4      2     3    NaN     3
3  qux    NaN    NaN   NaN     42   NaN