zip_longest 过滤数据帧并输出多个 csv 文件
zip_longest to filter dataframe and output multiple csv files
我有一个如下所示的数据框
ID,Region,Supplier,year,output
1,ANZ,AB,2021,1
2,ANZ,ABC,2022,1
3,ANZ,ABC,2022,1
4,ANZ,ABE,2021,0
5,ANZ,ABE,2021,1
6,ANZ,ABQ,2021,1
7,ANZ,ABW,2021,1
8,AUS,ABO,2020,1
9,KOREA,ABR,2019,0
data = pd.read_clipboard(sep=',')
我的目标是
a) 按 year>=2021
和 output==1
过滤数据帧
b) 为 region
和 supplier
的每个唯一组合生成多个 csv 文件。例如,ANZ
和 AB
的数据应存储在单独的文件中。同样,KOREA 和 ABR 数据应存储在单独的文件中。必须针对区域和供应商的每个唯一组合完成此操作
所以,我尝试了以下
column_name = "region"
col_name = "supplier"
region_values = data[column_name].unique()
supplier_values = data[col_name].unique()
for i in itertools.zip_longest(region_values,supplier_values,fillvalue="ANZ"):
data_output = data.query(f"{column_name} == i[0] & Year>=2021 & output == 1 & {col_name} == i[1]")
output_path = ATTACHMENT_DIR / f"{i}_ge_2021.csv"
data_output.to_csv(output_path, index=False)
然而,这会导致如下所示的错误
KeyError: 'i'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call
last) ~\Anaconda3\lib\site-packages\pandas\core\computation\scope.py
in resolve(self, key, is_local)
205 # e.g., df[df > 0]
--> 206 return self.temps[key]
207 except KeyError as err:
KeyError: 'i'
The above exception was the direct cause of the following exception:
UndefinedVariableError Traceback (most recent call
last)
C:\Users\aksha~1\AppData\Local\Temp/ipykernel_31264/2689222803.py in
1 for i in itertools.zip_longest(subregion_values,disti_values,fillvalue="ANZ"):
----> 2 data_output = data.query(f"{column_name} == i[0] & Year>=2021 & output == 1 & {col_name} == i1")
我希望我的输出在文件夹中生成 5 个 csv 文件,因为有 5 个独特的区域和供应商组合满足年份和输出的筛选条件
更新 - zip_longest - 输出不正确
使用 @
将变量传递给 query
,因为列名正确 f-string
s:
#i, j are same like i[0], i[1]
for i, j in itertools.zip_longest(region_values,supplier_values,fillvalue="ANZ"):
data_output = data.query(f"{column_name} == @i & year>=2021 & output == 1 & {col_name} == @j")
您的解决方案也适用于 @
:
for i in itertools.zip_longest(region_values,supplier_values,fillvalue="ANZ"):
data_output = data.query(f"{column_name} == @i[0] & year>=2021 & output == 1 & {col_name} == @i[1]")
也可以使用 f-strings 作为变量,但需要传递 repr
来表示 i
变量:
for i in itertools.zip_longest(region_values,supplier_values,fillvalue="ANZ"):
data_output = data.query(f"{column_name} == {repr(i[0])} & year>=2021 & output == 1 & {col_name} == {repr(i[1])}")
我有一个如下所示的数据框
ID,Region,Supplier,year,output
1,ANZ,AB,2021,1
2,ANZ,ABC,2022,1
3,ANZ,ABC,2022,1
4,ANZ,ABE,2021,0
5,ANZ,ABE,2021,1
6,ANZ,ABQ,2021,1
7,ANZ,ABW,2021,1
8,AUS,ABO,2020,1
9,KOREA,ABR,2019,0
data = pd.read_clipboard(sep=',')
我的目标是
a) 按 year>=2021
和 output==1
b) 为 region
和 supplier
的每个唯一组合生成多个 csv 文件。例如,ANZ
和 AB
的数据应存储在单独的文件中。同样,KOREA 和 ABR 数据应存储在单独的文件中。必须针对区域和供应商的每个唯一组合完成此操作
所以,我尝试了以下
column_name = "region"
col_name = "supplier"
region_values = data[column_name].unique()
supplier_values = data[col_name].unique()
for i in itertools.zip_longest(region_values,supplier_values,fillvalue="ANZ"):
data_output = data.query(f"{column_name} == i[0] & Year>=2021 & output == 1 & {col_name} == i[1]")
output_path = ATTACHMENT_DIR / f"{i}_ge_2021.csv"
data_output.to_csv(output_path, index=False)
然而,这会导致如下所示的错误
KeyError: 'i'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last) ~\Anaconda3\lib\site-packages\pandas\core\computation\scope.py in resolve(self, key, is_local) 205 # e.g., df[df > 0] --> 206 return self.temps[key] 207 except KeyError as err:
KeyError: 'i'
The above exception was the direct cause of the following exception:
UndefinedVariableError Traceback (most recent call last) C:\Users\aksha~1\AppData\Local\Temp/ipykernel_31264/2689222803.py in 1 for i in itertools.zip_longest(subregion_values,disti_values,fillvalue="ANZ"): ----> 2 data_output = data.query(f"{column_name} == i[0] & Year>=2021 & output == 1 & {col_name} == i1")
我希望我的输出在文件夹中生成 5 个 csv 文件,因为有 5 个独特的区域和供应商组合满足年份和输出的筛选条件
更新 - zip_longest - 输出不正确
使用 @
将变量传递给 query
,因为列名正确 f-string
s:
#i, j are same like i[0], i[1]
for i, j in itertools.zip_longest(region_values,supplier_values,fillvalue="ANZ"):
data_output = data.query(f"{column_name} == @i & year>=2021 & output == 1 & {col_name} == @j")
您的解决方案也适用于 @
:
for i in itertools.zip_longest(region_values,supplier_values,fillvalue="ANZ"):
data_output = data.query(f"{column_name} == @i[0] & year>=2021 & output == 1 & {col_name} == @i[1]")
也可以使用 f-strings 作为变量,但需要传递 repr
来表示 i
变量:
for i in itertools.zip_longest(region_values,supplier_values,fillvalue="ANZ"):
data_output = data.query(f"{column_name} == {repr(i[0])} & year>=2021 & output == 1 & {col_name} == {repr(i[1])}")