使用匹配对数据帧进行切片以使用 Pandas 构建新数据帧?
Slicing a dataframe using matches to build a new dataframe with Pandas?
我试图让我的代码接受一个数据框,找到所有出现的“START:”,然后遍历每个出现以创建 'slices'(第一行是“START:”匹配,并捕获最后一行与“START:”之后的字符串匹配的“END:”之间的所有行。
我希望将其放入一个新的数据框中,其中每个 'slice' 由一个空行分隔。
在将我的 sheet 扩展到更大的尺寸(750,000 行)时,我似乎无法让它工作,除非它慢得离谱。
我不确定还有什么方法可以解决我的问题,或者我如何才能让它更快,所以大型数据帧不会减慢它的速度。
我的 df:
我认为我的问题或错误方法的代码是:
new_df = pd.DataFrame({}, columns = df.columns)
new_df = new_df.append(pd.Series(), ignore_index = True)
for value in list_of_commands:
if 'Start: ' in value:
value_to_match = value[6:]
idx_start = df[df[col_name_source].str.contains(value_to_match, na = False)].first_valid_index()
idx_end = df[df[col_name_source].str.contains(value_to_match, na = False)].last_valid_index()
new_df = pd.concat([new_df, df.loc[idx_start:idx_end, :]])
new_df = new_df.append(pd.Series(), ignore_index = True)
整个节目:
import pandas as pd
sheets_index = [
('Numbers_', '0'), ('Numbers_', '1'), ('Numbers_', '2'), ('Numbers_', '3'),
('Numbers_1', '0'), ('Numbers_1', '1'), ('Numbers_1', '2'), ('Numbers_1', '3'),
('Numbers_TEST', '0'), ('Numbers_TEST', '1'), ('Numbers_TEST', '2'), ('Numbers_TEST', '3'),
('Numbers_TEST', '4'), ('Numbers_TEST', '5'), ('Numbers_TEST', '6'), ('Numbers_TEST', '7'), ('Numbers_TEST', '8')
]
index = pd.MultiIndex.from_tuples(sheets_index, names=['Id1','Id2'])
df = pd.DataFrame(
{
'TYPE': ['AA','aa','Aa','aA','DD','dd','Dd','dD','11','AA','11','aa','11','Aa','11','aA','11'],
'DATE': ['BB','bb','Bb','bB','CC','cc','Cc','cC','22','BB','22','bb','22','Bb','22','bB','22'],
'OTHER': ['CC','cc','Cc','cC','BB','bb','Bb','bB','33','CC','33','cc','33','Cc','33','cC','33'],
'SOURCE': ['DD','dd','Dd','dD','AA','aa','Aa','aA','XX','Start: Test_function1','Start: Test_function2','dd','','End: Test_function1','','zz','End: Test_function2']
},
index=index
)
command_list = ["AA", "dd", "DD"]
warning_list = ["Dd", "dD"]
ingenium_list = ["CC", "BB"]
col_name_type = 'TYPE'
col_name_other = 'OTHER'
col_name_source = 'SOURCE'
df_filtered_command = df[df[col_name_type].isin(command_list)]
df_filtered_warnings = df[df[col_name_type].isin(warning_list)]
df_filtered_other = df[df[col_name_other].isin(ingenium_list)]
df_final_command = df_filtered_command[(df_filtered_command[col_name_source].str.endswith('001', na=False)) |
(df_filtered_command[col_name_source].str.contains("a"))]
list_of_commands = df[col_name_source].dropna().tolist()
new_df = pd.DataFrame({}, columns = df.columns)
new_df = new_df.append(pd.Series(), ignore_index = True)
for value in list_of_commands:
if 'Start: ' in value:
value_to_match = value[6:]
idx_start = df[df[col_name_source].str.contains(value_to_match, na = False)].first_valid_index()
idx_end = df[df[col_name_source].str.contains(value_to_match, na = False)].last_valid_index()
new_df = pd.concat([new_df, df.loc[idx_start:idx_end, :]])
new_df = new_df.append(pd.Series(), ignore_index = True)
print(f'\n {new_df} \n')
pd.concat()
可贵了!尝试填充 dfs 列表并在最后做一个 pd.concat()
。
我试图让我的代码接受一个数据框,找到所有出现的“START:”,然后遍历每个出现以创建 'slices'(第一行是“START:”匹配,并捕获最后一行与“START:”之后的字符串匹配的“END:”之间的所有行。
我希望将其放入一个新的数据框中,其中每个 'slice' 由一个空行分隔。
在将我的 sheet 扩展到更大的尺寸(750,000 行)时,我似乎无法让它工作,除非它慢得离谱。
我不确定还有什么方法可以解决我的问题,或者我如何才能让它更快,所以大型数据帧不会减慢它的速度。
我的 df:
我认为我的问题或错误方法的代码是:
new_df = pd.DataFrame({}, columns = df.columns)
new_df = new_df.append(pd.Series(), ignore_index = True)
for value in list_of_commands:
if 'Start: ' in value:
value_to_match = value[6:]
idx_start = df[df[col_name_source].str.contains(value_to_match, na = False)].first_valid_index()
idx_end = df[df[col_name_source].str.contains(value_to_match, na = False)].last_valid_index()
new_df = pd.concat([new_df, df.loc[idx_start:idx_end, :]])
new_df = new_df.append(pd.Series(), ignore_index = True)
整个节目:
import pandas as pd
sheets_index = [
('Numbers_', '0'), ('Numbers_', '1'), ('Numbers_', '2'), ('Numbers_', '3'),
('Numbers_1', '0'), ('Numbers_1', '1'), ('Numbers_1', '2'), ('Numbers_1', '3'),
('Numbers_TEST', '0'), ('Numbers_TEST', '1'), ('Numbers_TEST', '2'), ('Numbers_TEST', '3'),
('Numbers_TEST', '4'), ('Numbers_TEST', '5'), ('Numbers_TEST', '6'), ('Numbers_TEST', '7'), ('Numbers_TEST', '8')
]
index = pd.MultiIndex.from_tuples(sheets_index, names=['Id1','Id2'])
df = pd.DataFrame(
{
'TYPE': ['AA','aa','Aa','aA','DD','dd','Dd','dD','11','AA','11','aa','11','Aa','11','aA','11'],
'DATE': ['BB','bb','Bb','bB','CC','cc','Cc','cC','22','BB','22','bb','22','Bb','22','bB','22'],
'OTHER': ['CC','cc','Cc','cC','BB','bb','Bb','bB','33','CC','33','cc','33','Cc','33','cC','33'],
'SOURCE': ['DD','dd','Dd','dD','AA','aa','Aa','aA','XX','Start: Test_function1','Start: Test_function2','dd','','End: Test_function1','','zz','End: Test_function2']
},
index=index
)
command_list = ["AA", "dd", "DD"]
warning_list = ["Dd", "dD"]
ingenium_list = ["CC", "BB"]
col_name_type = 'TYPE'
col_name_other = 'OTHER'
col_name_source = 'SOURCE'
df_filtered_command = df[df[col_name_type].isin(command_list)]
df_filtered_warnings = df[df[col_name_type].isin(warning_list)]
df_filtered_other = df[df[col_name_other].isin(ingenium_list)]
df_final_command = df_filtered_command[(df_filtered_command[col_name_source].str.endswith('001', na=False)) |
(df_filtered_command[col_name_source].str.contains("a"))]
list_of_commands = df[col_name_source].dropna().tolist()
new_df = pd.DataFrame({}, columns = df.columns)
new_df = new_df.append(pd.Series(), ignore_index = True)
for value in list_of_commands:
if 'Start: ' in value:
value_to_match = value[6:]
idx_start = df[df[col_name_source].str.contains(value_to_match, na = False)].first_valid_index()
idx_end = df[df[col_name_source].str.contains(value_to_match, na = False)].last_valid_index()
new_df = pd.concat([new_df, df.loc[idx_start:idx_end, :]])
new_df = new_df.append(pd.Series(), ignore_index = True)
print(f'\n {new_df} \n')
pd.concat()
可贵了!尝试填充 dfs 列表并在最后做一个 pd.concat()
。