Pandas: 需要从 1 开始递增重复的文件名
Pandas: Need to increment duplicate file names starting at 1
我有一个包含文件名的列 - 许多重复项 - 需要从 001、002 等开始递增。前任。 filename_001.pdf、filename_002.pdf
df_files = pd.DataFrame([[1000, 'filename.pdf'],
[1001, 'filename.pdf'],
[1002, 'a_file.txt'],
[1003, 'a_file.txt'],
[1004, 'a_file.txt']],
columns=['ID', 'filename'])
我找到的所有方法都从 2 开始。
首先提取扩展名和文件名减去扩展名:
df_files['ext'] = [os.path.splitext(f)[-1] for f in df_files['filename']]
df_files['Filestub'] = [os.path.splitext(f)[0] for f in df_files['filename']]
以下方法将成功递增,但不会从 1 开始,也不会使用允许三位数重复的约定(例如 00X)。
df_files['NumberedCopy'] = df_files['filename'].where(~df_files['filename'].duplicated(),
df_files['Filestub'] + "_"\
+ df_files.groupby('Filestub').cumcount().add(1).astype(str) + df_files['ext'])
输出[错误]:
ID filename Filestub ext NumberedCopy
0 1000 filename.pdf filename .pdf filename.pdf
1 1001 filename.pdf filename .pdf filename_2.pdf
2 1002 a_file.txt a_file .txt a_file.txt
3 1003 a_file.txt a_file .txt a_file_2.txt
4 1004 a_file.txt a_file .txt a_file_3.txt
期望的输出:
ID filename Filestub ext NumberedCopy
0 1000 filename.pdf filename .pdf filename_001.pdf
1 1001 filename.pdf filename .pdf filename_002.pdf
2 1002 a_file.txt a_file .txt a_file_001.txt
3 1003 a_file.txt a_file .txt a_file_002.txt
4 1004 a_file.txt a_file .txt a_file_003.txt
尝试:
numbered = df_files["Filestub"] + "_" + df_files.groupby("Filestub").cumcount().add(1).astype(str).str.zfill(3) + df_files["ext"]
df["NumberedCopy"] = numbered.where(df_files["Filestub"].duplicated(keep=False), df_files["filename"])
>>> df_files
ID filename ext Filestub NumberedCopy
0 1000 filename.pdf .pdf filename filename_001.pdf
1 1001 filename.pdf .pdf filename filename_002.pdf
2 1002 a_file.txt .txt a_file a_file_001.txt
3 1003 a_file.txt .txt a_file a_file_002.txt
4 1004 a_file.txt .txt a_file a_file_003.txt
我有一个包含文件名的列 - 许多重复项 - 需要从 001、002 等开始递增。前任。 filename_001.pdf、filename_002.pdf
df_files = pd.DataFrame([[1000, 'filename.pdf'],
[1001, 'filename.pdf'],
[1002, 'a_file.txt'],
[1003, 'a_file.txt'],
[1004, 'a_file.txt']],
columns=['ID', 'filename'])
我找到的所有方法都从 2 开始。
首先提取扩展名和文件名减去扩展名:
df_files['ext'] = [os.path.splitext(f)[-1] for f in df_files['filename']]
df_files['Filestub'] = [os.path.splitext(f)[0] for f in df_files['filename']]
以下方法将成功递增,但不会从 1 开始,也不会使用允许三位数重复的约定(例如 00X)。
df_files['NumberedCopy'] = df_files['filename'].where(~df_files['filename'].duplicated(),
df_files['Filestub'] + "_"\
+ df_files.groupby('Filestub').cumcount().add(1).astype(str) + df_files['ext'])
输出[错误]:
ID filename Filestub ext NumberedCopy
0 1000 filename.pdf filename .pdf filename.pdf
1 1001 filename.pdf filename .pdf filename_2.pdf
2 1002 a_file.txt a_file .txt a_file.txt
3 1003 a_file.txt a_file .txt a_file_2.txt
4 1004 a_file.txt a_file .txt a_file_3.txt
期望的输出:
ID filename Filestub ext NumberedCopy
0 1000 filename.pdf filename .pdf filename_001.pdf
1 1001 filename.pdf filename .pdf filename_002.pdf
2 1002 a_file.txt a_file .txt a_file_001.txt
3 1003 a_file.txt a_file .txt a_file_002.txt
4 1004 a_file.txt a_file .txt a_file_003.txt
尝试:
numbered = df_files["Filestub"] + "_" + df_files.groupby("Filestub").cumcount().add(1).astype(str).str.zfill(3) + df_files["ext"]
df["NumberedCopy"] = numbered.where(df_files["Filestub"].duplicated(keep=False), df_files["filename"])
>>> df_files
ID filename ext Filestub NumberedCopy
0 1000 filename.pdf .pdf filename filename_001.pdf
1 1001 filename.pdf .pdf filename filename_002.pdf
2 1002 a_file.txt .txt a_file a_file_001.txt
3 1003 a_file.txt .txt a_file a_file_002.txt
4 1004 a_file.txt .txt a_file a_file_003.txt