Split/explode 单元格根据 pandas 数据框中的条件分成多行
Split/explode cells into multiple rows based on conditions in pandas dataframe
输入数据帧的代码是
import pandas as pd
df = pd.DataFrame([{'Column1': '((CC ) + (A11/ABC/ZZ) + (!AAA))','Column2': 'XYZ + XXX/YYY'}])
输入数据帧:-
+---------------------------------+---------------------------------+
| Column1 | Column2 +
+---------------------------------+---------------------------------+
| ((CC ) + (A11/ABC/ZZ) + (!AAA)) | XYZ + XXX/YYY |
+---------------------------------+---------------------------------+
输入列表:-
list = [AAA,BBB,CCC]
条件:-
'+' should remain as such (similar to AND condition)
'/' means split the data into multiple cells (similar to OR condition)
'!' means replace with other elements in the corresponding list (similar to NOT condition)
因为!符号,该行变为
+------------------------------------+---------------------------------+
| Column1 | Column2 +
+------------------------------------+---------------------------------+
| ((CC ) + (A11/ABC/ZZ) + (BBB/CCC)) | XYZ + XXX/YYY |
+------------------------------------+---------------------------------+
请帮助我使用 Pandas
将单行拆分为多行,如下所示
+---------------------------------+---------------------------------+
| Column1 | Column2 +
+---------------------------------+---------------------------------+
| CC + A11 + BBB | XYZ + XXX |
+---------------------------------+---------------------------------+
| CC + ABC + BBB | XYZ + XXX |
+---------------------------------+---------------------------------+
| CC + ZZ + BBB | XYZ + XXX |
+---------------------------------+---------------------------------+
| CC + A11 + CCC | XYZ + XXX |
+---------------------------------+---------------------------------+
| CC + ABC + CCC | XYZ + XXX |
+---------------------------------+---------------------------------+
| CC + ZZ + CCC | XYZ + XXX |
+---------------------------------+---------------------------------+
| CC + A11 + BBB | XYZ + YYY |
+---------------------------------+---------------------------------+
| CC + ABC + BBB | XYZ + YYY |
+---------------------------------+---------------------------------+
| CC + ZZ + BBB | XYZ + YYY |
+---------------------------------+---------------------------------+
| CC + A11 + CCC | XYZ + YYY |
+---------------------------------+---------------------------------+
| CC + ABC + CCC | XYZ + YYY |
+---------------------------------+---------------------------------+
| CC + ZZ + CCC | XYZ + YYY |
+---------------------------------+---------------------------------+
看看这是否符合您的要求。评论解释了它是如何工作的。
#!/usr/bin/env python
import pandas as pd # tested with pd.__version__ 0.19.2
df = pd.DataFrame([{'Column1': '((CC ) + (A11/ABC/ZZ) + (!AAA))',
'Column2': 'XYZ + XXX/YYY'}]) # your input dataframe
list = ['AAA', 'BBB', 'CCC'] # your input list
to_replace = dict()
for item in list: # prepare the dictionary for the '!' replacements
to_replace["!"+item+'\b'] = '/'.join([i for i in list if i != item])
df = df.replace(to_replace, regex=True) # do all the '!' replacements
import re
def expanded(s): # expand series s to multiple string list around '/'
l = s.str.replace('[()]', '').tolist()
while True: # in each loop cycle, handle one A/B/C... expression
xl = [] # expanded list for this cycle
for s in l: # for each string in the list so far
m = re.search(r'\w+(/\w+)+', s) # look for a A/B/C... expression
if m: # if there is, add the individual expansions to the list
xl.extend([m.string[:m.start()]+i+m.string[m.end():]
for i in m.group().split('/')])
else: # if not, we're done
return l
l = xl # expanded list for this cycle is now the current list
def expand(c): # expands the column named c to multiple rows
new = expanded(df[c]) # get the new contents
xdf = pd.concat(len(new)/len(df[c])*[df]) # create required rows
xdf[c] = sorted(new) # set the new contents
return xdf # return new dataframe
df = expand('Column1')
df = expand('Column2')
print df
输出:
Column1 Column2
0 CC + A11 + BBB XYZ + XXX
0 CC + A11 + CCC XYZ + XXX
0 CC + ABC + BBB XYZ + XXX
0 CC + ABC + CCC XYZ + XXX
0 CC + ZZ + BBB XYZ + XXX
0 CC + ZZ + CCC XYZ + XXX
0 CC + A11 + BBB XYZ + YYY
0 CC + A11 + CCC XYZ + YYY
0 CC + ABC + BBB XYZ + YYY
0 CC + ABC + CCC XYZ + YYY
0 CC + ZZ + BBB XYZ + YYY
0 CC + ZZ + CCC XYZ + YYY
输入数据帧的代码是
import pandas as pd
df = pd.DataFrame([{'Column1': '((CC ) + (A11/ABC/ZZ) + (!AAA))','Column2': 'XYZ + XXX/YYY'}])
输入数据帧:-
+---------------------------------+---------------------------------+
| Column1 | Column2 +
+---------------------------------+---------------------------------+
| ((CC ) + (A11/ABC/ZZ) + (!AAA)) | XYZ + XXX/YYY |
+---------------------------------+---------------------------------+
输入列表:-
list = [AAA,BBB,CCC]
条件:-
'+' should remain as such (similar to AND condition)
'/' means split the data into multiple cells (similar to OR condition)
'!' means replace with other elements in the corresponding list (similar to NOT condition)
因为!符号,该行变为
+------------------------------------+---------------------------------+
| Column1 | Column2 +
+------------------------------------+---------------------------------+
| ((CC ) + (A11/ABC/ZZ) + (BBB/CCC)) | XYZ + XXX/YYY |
+------------------------------------+---------------------------------+
请帮助我使用 Pandas
将单行拆分为多行,如下所示+---------------------------------+---------------------------------+
| Column1 | Column2 +
+---------------------------------+---------------------------------+
| CC + A11 + BBB | XYZ + XXX |
+---------------------------------+---------------------------------+
| CC + ABC + BBB | XYZ + XXX |
+---------------------------------+---------------------------------+
| CC + ZZ + BBB | XYZ + XXX |
+---------------------------------+---------------------------------+
| CC + A11 + CCC | XYZ + XXX |
+---------------------------------+---------------------------------+
| CC + ABC + CCC | XYZ + XXX |
+---------------------------------+---------------------------------+
| CC + ZZ + CCC | XYZ + XXX |
+---------------------------------+---------------------------------+
| CC + A11 + BBB | XYZ + YYY |
+---------------------------------+---------------------------------+
| CC + ABC + BBB | XYZ + YYY |
+---------------------------------+---------------------------------+
| CC + ZZ + BBB | XYZ + YYY |
+---------------------------------+---------------------------------+
| CC + A11 + CCC | XYZ + YYY |
+---------------------------------+---------------------------------+
| CC + ABC + CCC | XYZ + YYY |
+---------------------------------+---------------------------------+
| CC + ZZ + CCC | XYZ + YYY |
+---------------------------------+---------------------------------+
看看这是否符合您的要求。评论解释了它是如何工作的。
#!/usr/bin/env python
import pandas as pd # tested with pd.__version__ 0.19.2
df = pd.DataFrame([{'Column1': '((CC ) + (A11/ABC/ZZ) + (!AAA))',
'Column2': 'XYZ + XXX/YYY'}]) # your input dataframe
list = ['AAA', 'BBB', 'CCC'] # your input list
to_replace = dict()
for item in list: # prepare the dictionary for the '!' replacements
to_replace["!"+item+'\b'] = '/'.join([i for i in list if i != item])
df = df.replace(to_replace, regex=True) # do all the '!' replacements
import re
def expanded(s): # expand series s to multiple string list around '/'
l = s.str.replace('[()]', '').tolist()
while True: # in each loop cycle, handle one A/B/C... expression
xl = [] # expanded list for this cycle
for s in l: # for each string in the list so far
m = re.search(r'\w+(/\w+)+', s) # look for a A/B/C... expression
if m: # if there is, add the individual expansions to the list
xl.extend([m.string[:m.start()]+i+m.string[m.end():]
for i in m.group().split('/')])
else: # if not, we're done
return l
l = xl # expanded list for this cycle is now the current list
def expand(c): # expands the column named c to multiple rows
new = expanded(df[c]) # get the new contents
xdf = pd.concat(len(new)/len(df[c])*[df]) # create required rows
xdf[c] = sorted(new) # set the new contents
return xdf # return new dataframe
df = expand('Column1')
df = expand('Column2')
print df
输出:
Column1 Column2 0 CC + A11 + BBB XYZ + XXX 0 CC + A11 + CCC XYZ + XXX 0 CC + ABC + BBB XYZ + XXX 0 CC + ABC + CCC XYZ + XXX 0 CC + ZZ + BBB XYZ + XXX 0 CC + ZZ + CCC XYZ + XXX 0 CC + A11 + BBB XYZ + YYY 0 CC + A11 + CCC XYZ + YYY 0 CC + ABC + BBB XYZ + YYY 0 CC + ABC + CCC XYZ + YYY 0 CC + ZZ + BBB XYZ + YYY 0 CC + ZZ + CCC XYZ + YYY