pandas 中的复杂部分字符串匹配
Complex partial string matching in pandas
给定具有以下结构和值的数据框 json_path
-
json_path
Reporting Group
Entity/Grouping
data.attributes.total.children.[0]
Christian Family
Abraham Family
data.attributes.total.children.[0].children.[0]
Christian Family
In Estate
data.attributes.total.children.[0].children.[0].children.[0].children.[0]
Christian Family
Cash
data.attributes.total.children.[0].children.[0].children.[1].children.[0]
Christian Family
Investment Grade Fixed Income
我如何过滤包含 children
四次的 json_path
行?即,我想过滤索引位置 2-3 -
json_path
Reporting Group
Entity/Grouping
data.attributes.total.children.[0].children.[0].children.[0].children.[0]
Christian Family
Cash
data.attributes.total.children.[0].children.[0].children.[1].children.[0]
Christian Family
Investment Grade Fixed Income
我知道如何获得部分匹配,但是方括号中的整数会不一致,所以我的直觉告诉我以某种方式计算 children
实例的逻辑(即 children
出现 4x) 并以此为基础进行过滤。
关于如何实现此目标的任何建议或资源?
正如您所说,一种天真的方法是计算 .children
的出现次数并将计数与 4 进行比较以创建可用于过滤行的布尔掩码
df[df['json_path'].str.count(r'\.children').eq(4)]
一种更可靠的方法是检查连续出现的 4 children
df[df['json_path'].str.contains(r'(\.children\.\[\d+\]){4}')]
json_path Reporting Group Entity/Grouping
2 data.attributes.total.children.[0].children.[0].children.[0].children.[0] Christian Family Cash
3 data.attributes.total.children.[0].children.[0].children.[1].children.[0] Christian Family Investment Grade Fixed Income
给定具有以下结构和值的数据框 json_path
-
json_path | Reporting Group | Entity/Grouping |
---|---|---|
data.attributes.total.children.[0] | Christian Family | Abraham Family |
data.attributes.total.children.[0].children.[0] | Christian Family | In Estate |
data.attributes.total.children.[0].children.[0].children.[0].children.[0] | Christian Family | Cash |
data.attributes.total.children.[0].children.[0].children.[1].children.[0] | Christian Family | Investment Grade Fixed Income |
我如何过滤包含 children
四次的 json_path
行?即,我想过滤索引位置 2-3 -
json_path | Reporting Group | Entity/Grouping |
---|---|---|
data.attributes.total.children.[0].children.[0].children.[0].children.[0] | Christian Family | Cash |
data.attributes.total.children.[0].children.[0].children.[1].children.[0] | Christian Family | Investment Grade Fixed Income |
我知道如何获得部分匹配,但是方括号中的整数会不一致,所以我的直觉告诉我以某种方式计算 children
实例的逻辑(即 children
出现 4x) 并以此为基础进行过滤。
关于如何实现此目标的任何建议或资源?
正如您所说,一种天真的方法是计算 .children
的出现次数并将计数与 4 进行比较以创建可用于过滤行的布尔掩码
df[df['json_path'].str.count(r'\.children').eq(4)]
一种更可靠的方法是检查连续出现的 4 children
df[df['json_path'].str.contains(r'(\.children\.\[\d+\]){4}')]
json_path Reporting Group Entity/Grouping
2 data.attributes.total.children.[0].children.[0].children.[0].children.[0] Christian Family Cash
3 data.attributes.total.children.[0].children.[0].children.[1].children.[0] Christian Family Investment Grade Fixed Income