如何在替换 nan 时将新列包含到数据框中并避免空事件

Question

我正在尝试从 df['http_path'] 中提取 2 个特征并丰富这些特征。问题是我使用了 ? 分隔符。我想替换 nan 以防 events/rows 中没有记录值以供进一步处理。然后我将为那些没有任何信息的事件替换 nan 并遍历行。为了避免重复事件，我想保留这些事件的信息 A、B 和 concat 到 df。我尝试了以下代码：

http_path = https://example.org/path/to/file?param=42#fragment
#http_path = ...A?B            ^^^^^^^^^^^^^ ^^^^^^^^

# new columns extracted from single column http_path
#api = A or /path/to/file
#param = B or param=42

http_path = df.http_path.str.split('?')   #The first ? seprator
api_param_df = pd.DataFrame([row if len(row) == 2 else row+[np.nan] for row in http_path.values], columns=["api", "param"])
df = pd.concat([df, api_param_df], axis=1)

示例如下：

http_path	API URL	URL 参数
https://example.org/path/to/file?param=42#fragment	path/to/file	参数=42#片段
https://example.org/path/to/file	path/to/file	南

有什么优雅的方法可以做到这一点吗？

Answer 1

您可以将 str.extract 与正则表达式一起使用 (?:https?://[^/]+/)?(?P<api>[^?]+)\??(?P<param>.+)?:

df = pd.DataFrame({'http_path': ['https://example.org/path/to/file?param=42#fragment', 'https://example.org/path/to/file']})
df
#                                           http_path
#0  https://example.org/path/to/file?param=42#frag...
#1                   https://example.org/path/to/file

df.http_path.str.extract('(?:https?://[^/]+/)?(?P<api>[^?]+)\??(?P<param>.+)?')

#            api              param
#0  path/to/file  param=42#fragment
#1  path/to/file                NaN

在正则表达式模式中：

(?:https?://[^/]+/)? 可选择匹配域但不捕获它
(?P<api>[^?]+) 匹配 ?
\? 按字面意思匹配 ?
(?P<param>.+) 匹配 ?

注意我们还使 \? 和第二个捕获组可选，这样当 http 路径中没有查询参数时，它 returns NaN.

如何在替换 nan 时将新列包含到数据框中并避免空事件

How can include the new columns to dataframe while replacing nan and avoid empty events

python

regex

feature-extraction

dataframe

pandas