Pandas：提取不一致分隔符&不规则顺序的关键字子串后的数值，将关键字&数字转化为列

Question

我有一个格式不规则的数据库，其数据框如下所示：

Area	Dimensions
foo	Length: 2m; Width: 3m; Height: 4m; Slope- 3
bar	Width: 6m; Length: 4m; Height: 3m; Slope: 6
baz	Height: 4m; Slope: 4; Volume = 24m3
qux	Vol: 42m3

分隔符始终是分号，但冒号可能会被一些其他符号代替，例如破折号或等号。值的顺序也不一致，所以 str.split 没有生效。我想从 Dimensions 列中提取尽可能多的信息，并为未指定的值保留 0/Null 值。

我希望它看起来像这样：

Area	Length	Width	Height	Slope	Volume
foo	2	3	4	3	NULL
bar	4	6	3	6	NULL
baz	NULL	NULL	4	4	24
qux	NULL	NULL	NULL	NULL	42

Answer 1

新版本：

新版本的主要改进是大大简化了关键字值的创建table。文本提取正则表达式也被简化为无需指定一组预定义关键字。

使用`str.findall()` + `map`，如下：

通过str.findall()将Dimensions个关键字和值提取到一个键值对元组列表中
map 这些键值对元组到 dict 并创建一个数据框
将 Area 列与 .join()

# replace 'Vol' to 'Volume` 
# extract `Dimensions` keywords and numeric values into tuples of paired values
dim_extract = (df['Dimensions'].str.replace(r'Vol\b', 'Volume', regex=True)
                               .str.findall(r'(\w+)\W+(\d+(?:\.\d+)?)\w*(?:;|$)')
              )

# map key-value pairs to `dict` and create a dataframe
keyword_df = pd.DataFrame(map(dict, dim_extract))

# Optionally convert the extracted dimension values from string to float or integer format 
#keyword_df = keyword_df.apply(pd.to_numeric)                    # convert to float
#keyword_df = keyword_df.apply(pd.to_numeric).astype('Int64')    # convert to integer

# join `Area` column with newly created keyword dataframe
df_out = df[['Area']].join(keyword_df)

结果：

print(df_out)

  Area Length Width Height Slope Volume
0  foo      2     3      4     3    NaN
1  bar      4     6      3     6    NaN
2  baz    NaN   NaN      4     4     24
3  qux    NaN   NaN    NaN   NaN     42

旧版本：

使用`str.findall()`+`.explode()`+`.pivot()`，如下：

通过str.findall()将Dimensions个关键字和值提取到一个键值对元组列表中
通过.explode()
进一步将 Dimensions 个关键字的成对值和元组中的值分成单独的列
通过.pivot()

Dimensions

# replace 'Vol' to 'Volume` 
# extract `Dimensions` keywords and numeric values into tuples of paired values
df['extract'] = (df['Dimensions'].str.replace(r'Vol\b', 'Volume', regex=True)
                                 .str.findall(r'(Length|Width|Height|Slope|Volume)\W+(\d+(?:\.\d+)?)\w*(?:;|$)')
                )

# Transform each element in the list to a row
df2 = df.explode('extract')

# Separate the `Dimensions` keywords and values from a tuple into individual columns 
df2['col_name'], df2['col_val'] = zip(*df2['extract'])

# Optionally convert the extracted dimension values from string to float or integer format 
#df2['col_val'] = df2['col_val'].astype(float)
#df2['col_val'] = df2['col_val'].astype(int)

# Transform the `Dimensions` keywords into columns 
df_out = df2.pivot(index='Area', columns='col_name', values='col_val').rename_axis(columns=None).reset_index()

结果：

print(df_out)

  Area Height Length Slope Volume Width
0  bar      3      4     6    NaN     6
1  baz      4    NaN     4     24   NaN
2  foo      4      2     3    NaN     3
3  qux    NaN    NaN   NaN     42   NaN

Pandas：提取不一致分隔符&不规则顺序的关键字子串后的数值，将关键字&数字转化为列

Pandas: Extract numerical values after keywords substrings of inconsistent delimiters & irregular orders and transform keywords & numbers to columns

python

text-extraction

python-3.x

pandas

新版本：

使用`str.findall()` + `map`，如下：

旧版本：

使用`str.findall()`+`.explode()`+`.pivot()`，如下：

Pandas：提取不一致分隔符&不规则顺序的关键字子串后的数值，将关键字&数字转化为列

Pandas: Extract numerical values after keywords substrings of inconsistent delimiters & irregular orders and transform keywords & numbers to columns

python

text-extraction

python-3.x

pandas

新版本：

使用str.findall() + map，如下：

旧版本：

使用str.findall()+.explode()+.pivot()，如下：

使用`str.findall()` + `map`，如下：

使用`str.findall()`+`.explode()`+`.pivot()`，如下：