使用匹配对数据帧进行切片以使用 Pandas 构建新数据帧？

Question

我试图让我的代码接受一个数据框，找到所有出现的“START：”，然后遍历每个出现以创建 'slices'（第一行是“START：”匹配，并捕获最后一行与“START:”之后的字符串匹配的“END:”之间的所有行。

我希望将其放入一个新的数据框中，其中每个 'slice' 由一个空行分隔。

在将我的 sheet 扩展到更大的尺寸（750,000 行）时，我似乎无法让它工作，除非它慢得离谱。

我不确定还有什么方法可以解决我的问题，或者我如何才能让它更快，所以大型数据帧不会减慢它的速度。

我的 df:

我认为我的问题或错误方法的代码是：


new_df = pd.DataFrame({}, columns = df.columns)

new_df = new_df.append(pd.Series(), ignore_index = True)

for value in list_of_commands:
    if 'Start: ' in value:
        value_to_match = value[6:]
        idx_start = df[df[col_name_source].str.contains(value_to_match, na = False)].first_valid_index()
        idx_end = df[df[col_name_source].str.contains(value_to_match, na = False)].last_valid_index()
        new_df = pd.concat([new_df, df.loc[idx_start:idx_end, :]]) 
        new_df = new_df.append(pd.Series(), ignore_index = True)

整个节目：

import pandas as pd
sheets_index = [
    ('Numbers_', '0'), ('Numbers_', '1'), ('Numbers_', '2'), ('Numbers_', '3'),
    ('Numbers_1', '0'), ('Numbers_1', '1'), ('Numbers_1', '2'), ('Numbers_1', '3'),
    ('Numbers_TEST', '0'), ('Numbers_TEST', '1'), ('Numbers_TEST', '2'), ('Numbers_TEST', '3'),
    ('Numbers_TEST', '4'), ('Numbers_TEST', '5'), ('Numbers_TEST', '6'), ('Numbers_TEST', '7'), ('Numbers_TEST', '8')
]
index = pd.MultiIndex.from_tuples(sheets_index, names=['Id1','Id2'])

df = pd.DataFrame(
{
    'TYPE': ['AA','aa','Aa','aA','DD','dd','Dd','dD','11','AA','11','aa','11','Aa','11','aA','11'],
    'DATE': ['BB','bb','Bb','bB','CC','cc','Cc','cC','22','BB','22','bb','22','Bb','22','bB','22'],
    'OTHER': ['CC','cc','Cc','cC','BB','bb','Bb','bB','33','CC','33','cc','33','Cc','33','cC','33'],
    'SOURCE': ['DD','dd','Dd','dD','AA','aa','Aa','aA','XX','Start: Test_function1','Start: Test_function2','dd','','End: Test_function1','','zz','End: Test_function2']
},
    index=index
)

command_list = ["AA", "dd", "DD"]
warning_list = ["Dd", "dD"]
ingenium_list = ["CC", "BB"]

col_name_type = 'TYPE'
col_name_other = 'OTHER'
col_name_source = 'SOURCE'
df_filtered_command = df[df[col_name_type].isin(command_list)]
df_filtered_warnings = df[df[col_name_type].isin(warning_list)]
df_filtered_other = df[df[col_name_other].isin(ingenium_list)]

df_final_command = df_filtered_command[(df_filtered_command[col_name_source].str.endswith('001', na=False)) | 
(df_filtered_command[col_name_source].str.contains("a"))]

list_of_commands = df[col_name_source].dropna().tolist()

new_df = pd.DataFrame({}, columns = df.columns)

new_df = new_df.append(pd.Series(), ignore_index = True)

for value in list_of_commands:
    if 'Start: ' in value:
        value_to_match = value[6:]
        idx_start = df[df[col_name_source].str.contains(value_to_match, na = False)].first_valid_index()
        idx_end = df[df[col_name_source].str.contains(value_to_match, na = False)].last_valid_index()
        new_df = pd.concat([new_df, df.loc[idx_start:idx_end, :]]) 
        new_df = new_df.append(pd.Series(), ignore_index = True)

print(f'\n {new_df} \n')

Answer 1

pd.concat()可贵了！尝试填充 dfs 列表并在最后做一个 pd.concat()。

使用匹配对数据帧进行切片以使用 Pandas 构建新数据帧？

Slicing a dataframe using matches to build a new dataframe with Pandas?

filter

dataframe

python-3.x

pandas