Pandas:提取不一致分隔符&不规则顺序的关键字子串后的数值,将关键字&数字转化为列
Pandas: Extract numerical values after keywords substrings of inconsistent delimiters & irregular orders and transform keywords & numbers to columns
我有一个格式不规则的数据库,其数据框如下所示:
Area
Dimensions
foo
Length: 2m; Width: 3m; Height: 4m; Slope- 3
bar
Width: 6m; Length: 4m; Height: 3m; Slope: 6
baz
Height: 4m; Slope: 4; Volume = 24m3
qux
Vol: 42m3
分隔符始终是分号,但冒号可能会被一些其他符号代替,例如破折号或等号。值的顺序也不一致,所以 str.split
没有生效。我想从 Dimensions
列中提取尽可能多的信息,并为未指定的值保留 0/Null 值。
我希望它看起来像这样:
Area
Length
Width
Height
Slope
Volume
foo
2
3
4
3
NULL
bar
4
6
3
6
NULL
baz
NULL
NULL
4
4
24
qux
NULL
NULL
NULL
NULL
42
新版本:
新版本的主要改进是大大简化了关键字值的创建table。文本提取正则表达式也被简化为无需指定一组预定义关键字。
使用str.findall()
+ map
,如下:
- 通过
str.findall()
将Dimensions
个关键字和值提取到一个键值对元组列表中
map
这些键值对元组到 dict
并创建一个数据框
- 将
Area
列与 .join()
新创建的关键字值数据框连接起来
# replace 'Vol' to 'Volume`
# extract `Dimensions` keywords and numeric values into tuples of paired values
dim_extract = (df['Dimensions'].str.replace(r'Vol\b', 'Volume', regex=True)
.str.findall(r'(\w+)\W+(\d+(?:\.\d+)?)\w*(?:;|$)')
)
# map key-value pairs to `dict` and create a dataframe
keyword_df = pd.DataFrame(map(dict, dim_extract))
# Optionally convert the extracted dimension values from string to float or integer format
#keyword_df = keyword_df.apply(pd.to_numeric) # convert to float
#keyword_df = keyword_df.apply(pd.to_numeric).astype('Int64') # convert to integer
# join `Area` column with newly created keyword dataframe
df_out = df[['Area']].join(keyword_df)
结果:
print(df_out)
Area Length Width Height Slope Volume
0 foo 2 3 4 3 NaN
1 bar 4 6 3 6 NaN
2 baz NaN NaN 4 4 24
3 qux NaN NaN NaN NaN 42
旧版本:
使用str.findall()
+.explode()
+.pivot()
,如下:
- 通过
str.findall()
将Dimensions
个关键字和值提取到一个键值对元组列表中
- 通过
.explode()
将列表中的每个元素转换为一行
- 进一步将
Dimensions
个关键字的成对值和元组中的值分成单独的列
- 通过
.pivot()
将Dimensions
关键字转换为列
# replace 'Vol' to 'Volume`
# extract `Dimensions` keywords and numeric values into tuples of paired values
df['extract'] = (df['Dimensions'].str.replace(r'Vol\b', 'Volume', regex=True)
.str.findall(r'(Length|Width|Height|Slope|Volume)\W+(\d+(?:\.\d+)?)\w*(?:;|$)')
)
# Transform each element in the list to a row
df2 = df.explode('extract')
# Separate the `Dimensions` keywords and values from a tuple into individual columns
df2['col_name'], df2['col_val'] = zip(*df2['extract'])
# Optionally convert the extracted dimension values from string to float or integer format
#df2['col_val'] = df2['col_val'].astype(float)
#df2['col_val'] = df2['col_val'].astype(int)
# Transform the `Dimensions` keywords into columns
df_out = df2.pivot(index='Area', columns='col_name', values='col_val').rename_axis(columns=None).reset_index()
结果:
print(df_out)
Area Height Length Slope Volume Width
0 bar 3 4 6 NaN 6
1 baz 4 NaN 4 24 NaN
2 foo 4 2 3 NaN 3
3 qux NaN NaN NaN 42 NaN
我有一个格式不规则的数据库,其数据框如下所示:
Area | Dimensions |
---|---|
foo | Length: 2m; Width: 3m; Height: 4m; Slope- 3 |
bar | Width: 6m; Length: 4m; Height: 3m; Slope: 6 |
baz | Height: 4m; Slope: 4; Volume = 24m3 |
qux | Vol: 42m3 |
分隔符始终是分号,但冒号可能会被一些其他符号代替,例如破折号或等号。值的顺序也不一致,所以 str.split
没有生效。我想从 Dimensions
列中提取尽可能多的信息,并为未指定的值保留 0/Null 值。
我希望它看起来像这样:
Area | Length | Width | Height | Slope | Volume |
---|---|---|---|---|---|
foo | 2 | 3 | 4 | 3 | NULL |
bar | 4 | 6 | 3 | 6 | NULL |
baz | NULL | NULL | 4 | 4 | 24 |
qux | NULL | NULL | NULL | NULL | 42 |
新版本:
新版本的主要改进是大大简化了关键字值的创建table。文本提取正则表达式也被简化为无需指定一组预定义关键字。
使用str.findall()
+ map
,如下:
- 通过
str.findall()
将Dimensions
个关键字和值提取到一个键值对元组列表中 map
这些键值对元组到dict
并创建一个数据框- 将
Area
列与.join()
新创建的关键字值数据框连接起来
# replace 'Vol' to 'Volume`
# extract `Dimensions` keywords and numeric values into tuples of paired values
dim_extract = (df['Dimensions'].str.replace(r'Vol\b', 'Volume', regex=True)
.str.findall(r'(\w+)\W+(\d+(?:\.\d+)?)\w*(?:;|$)')
)
# map key-value pairs to `dict` and create a dataframe
keyword_df = pd.DataFrame(map(dict, dim_extract))
# Optionally convert the extracted dimension values from string to float or integer format
#keyword_df = keyword_df.apply(pd.to_numeric) # convert to float
#keyword_df = keyword_df.apply(pd.to_numeric).astype('Int64') # convert to integer
# join `Area` column with newly created keyword dataframe
df_out = df[['Area']].join(keyword_df)
结果:
print(df_out)
Area Length Width Height Slope Volume
0 foo 2 3 4 3 NaN
1 bar 4 6 3 6 NaN
2 baz NaN NaN 4 4 24
3 qux NaN NaN NaN NaN 42
旧版本:
使用str.findall()
+.explode()
+.pivot()
,如下:
- 通过
str.findall()
将Dimensions
个关键字和值提取到一个键值对元组列表中 - 通过
.explode()
将列表中的每个元素转换为一行
- 进一步将
Dimensions
个关键字的成对值和元组中的值分成单独的列 - 通过
.pivot()
将
Dimensions
关键字转换为列
# replace 'Vol' to 'Volume`
# extract `Dimensions` keywords and numeric values into tuples of paired values
df['extract'] = (df['Dimensions'].str.replace(r'Vol\b', 'Volume', regex=True)
.str.findall(r'(Length|Width|Height|Slope|Volume)\W+(\d+(?:\.\d+)?)\w*(?:;|$)')
)
# Transform each element in the list to a row
df2 = df.explode('extract')
# Separate the `Dimensions` keywords and values from a tuple into individual columns
df2['col_name'], df2['col_val'] = zip(*df2['extract'])
# Optionally convert the extracted dimension values from string to float or integer format
#df2['col_val'] = df2['col_val'].astype(float)
#df2['col_val'] = df2['col_val'].astype(int)
# Transform the `Dimensions` keywords into columns
df_out = df2.pivot(index='Area', columns='col_name', values='col_val').rename_axis(columns=None).reset_index()
结果:
print(df_out)
Area Height Length Slope Volume Width
0 bar 3 4 6 NaN 6
1 baz 4 NaN 4 24 NaN
2 foo 4 2 3 NaN 3
3 qux NaN NaN NaN 42 NaN