zip_longest 过滤数据帧并输出多个 csv 文件

zip_longest to filter dataframe and output multiple csv files

我有一个如下所示的数据框

ID,Region,Supplier,year,output
1,ANZ,AB,2021,1
2,ANZ,ABC,2022,1
3,ANZ,ABC,2022,1
4,ANZ,ABE,2021,0
5,ANZ,ABE,2021,1
6,ANZ,ABQ,2021,1
7,ANZ,ABW,2021,1
8,AUS,ABO,2020,1
9,KOREA,ABR,2019,0

data = pd.read_clipboard(sep=',')

我的目标是

a) 按 year>=2021output==1

过滤数据帧

b) 为 regionsupplier 的每个唯一组合生成多个 csv 文件。例如,ANZAB 的数据应存储在单独的文件中。同样,KOREA 和 ABR 数据应存储在单独的文件中。必须针对区域和供应商的每个唯一组合完成此操作

所以,我尝试了以下

column_name = "region"
col_name = "supplier"
region_values = data[column_name].unique()
supplier_values = data[col_name].unique() 

for i in itertools.zip_longest(region_values,supplier_values,fillvalue="ANZ"):
    data_output = data.query(f"{column_name} == i[0] & Year>=2021 & output == 1 & {col_name} == i[1]")
    output_path = ATTACHMENT_DIR / f"{i}_ge_2021.csv"
    data_output.to_csv(output_path, index=False)

然而,这会导致如下所示的错误

KeyError: 'i'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last) ~\Anaconda3\lib\site-packages\pandas\core\computation\scope.py in resolve(self, key, is_local) 205 # e.g., df[df > 0] --> 206 return self.temps[key] 207 except KeyError as err:

KeyError: 'i'

The above exception was the direct cause of the following exception:

UndefinedVariableError Traceback (most recent call last) C:\Users\aksha~1\AppData\Local\Temp/ipykernel_31264/2689222803.py in 1 for i in itertools.zip_longest(subregion_values,disti_values,fillvalue="ANZ"): ----> 2 data_output = data.query(f"{column_name} == i[0] & Year>=2021 & output == 1 & {col_name} == i1")

我希望我的输出在文件夹中生成 5 个 csv 文件,因为有 5 个独特的区域和供应商组合满足年份和输出的筛选条件

更新 - zip_longest - 输出不正确

使用 @ 将变量传递给 query,因为列名正确 f-strings:

#i, j are same like i[0], i[1]
for i, j in itertools.zip_longest(region_values,supplier_values,fillvalue="ANZ"):
    data_output = data.query(f"{column_name} == @i & year>=2021 & output == 1 & {col_name} == @j")

您的解决方案也适用于 @:

for i in itertools.zip_longest(region_values,supplier_values,fillvalue="ANZ"):
    data_output = data.query(f"{column_name} == @i[0] & year>=2021 & output == 1 & {col_name} == @i[1]")

也可以使用 f-strings 作为变量,但需要传递 repr 来表示 i 变量:

for i in itertools.zip_longest(region_values,supplier_values,fillvalue="ANZ"):
    data_output = data.query(f"{column_name} == {repr(i[0])} & year>=2021 & output == 1 & {col_name} == {repr(i[1])}")