os.path.basename 出局

Question

对于处理的每个输入文件（见下面的代码），我正在尝试使用 "os.path.basename" 写入新的输出文件 - 我知道我遗漏了一些明显的东西......？

import os
import glob
import gzip

dbpath = '/home/university/Desktop/test'

for infile in glob.glob( os.path.join(dbpath, 'G[D|E]/????/*.gz') ):
print("current file is: " + infile)

**

   outfile=os.path.basename('/home/university/Desktop/test/G[D|E]                              
/????/??????.xaa.fastq.gz').rsplit('.xaa.fastq.gz')[0]

  file=open(outfile, 'w+')

**

  gzsuppl = Chem.ForwardSDMolSupplier(gzip.open(infile))
  for m in gzsuppl:
  if m is None: continue
...etc

file.close()
print(count)

我不清楚如何捕获变量 [0]（即 .xaa.fastq.gz 上游的所有内容）并用作新输出文件的基名？不幸的是，它只是将新的输出文件写为“??????”而不是 6 个字母的实际序列。感谢您提供的任何帮助。

Answer 1

这似乎会在您的示例代码中从 glob() 返回的路径中获取 .xaa.fastq.gz 上游的所有内容：

import os

filepath = '/home/university/Desktop/test/GD /AAML/DEAAML.xaa.fastq.gz'
filepath = os.path.normpath(filepath)  # Changes path separators for Windows.

# This section was adapted from answer 
folders = []
while 1:
    filepath, folder = os.path.split(filepath)
    if folder:
        folders.append(folder)
    else:
        if filepath:
            folders.append(filepath)
        break
folders.reverse()

if len(folders) > 1:
    # The last element of folders should contain the original filename.
    filename_prefix = os.path.basename(folders[-1]).split('.')[0]
    outfile = os.path.join(*(folders[:-1] + [filename_prefix + '.rest_of_filename']))
    print(outfile)  # -> \home\university\Desktop\test\GD \AAML\DEAAML.rest_of_filename

当然，outfile 中的最终结果不是最终路径加上文件名，因为我不知道文件名的其余部分是什么，只是在其中放置一个占位符（'.rest_of_filename').

Answer 2

我不熟悉您正在使用的输入数据类型，但我可以告诉您：

您缺少的 "something obvious" 是 outfile 与 infile 没有联系。您的 outfile 行使用 ?????? 而不是实际的文件名，因为那是您要求的。 glob.glob 将其变成匹配列表。

下面是我如何编写 outfile 行的那个方面：
```
outfile = infile.rsplit('.xaa.fastq.gz', 1)[0]
```
(, 1 确保它永远不会分裂超过一次，无论文件名变得多么疯狂。这只是在使用 split 或 [=20 时养成的好习惯=] 像这样。)
您正在为自己设置一个错误，因为 glob 模式可以匹配 *.gz 个不以 .xaa.fastq.gz 结尾的文件，这意味着随机恰好出现在文件夹列表中的 .gz 文件会导致 outfile 与 infile 具有相同的路径，并且您最终会写入输入文件。

此问题有三种适用于您的用例的解决方案：
1. 在您的 glob 中使用 *.xaa.fastq.gz 而不是 *.gz。我不推荐这个，因为打字错误很容易潜入并使它们再次不同，这会悄悄地重新引入错误。
2. 将输出写入不同于输入的文件夹。
```
outfile = os.path.join(outpath, os.path.relpath(infile, dbpath))

outparent = os.path.dirname(outfile)
if not os.path.exists(outparent):
    os.makedirs(outparent)
```
3. 添加一个 assert outfile != infile 行，这样在 "this should never actually happen" 情况下程序将结束并显示有意义的错误消息，而不是默默地做不正确的事情。
您发布的缩进可能有误，但看起来您打开了一堆文件，然后只关闭了最后一个。我的建议是改用这个，所以不可能弄错：
```
with open(outfile, 'w+') as file:
    # put things which use `file` here
```
名称 file 已经存在于标准库中，您选择的变量名称没有帮助。我会将 infile 重命名为 inpath，将 outfile 重命名为 outpath，并将 file 重命名为 outfile。这样，您可以仅从变量名判断每个是路径（即字符串）还是 Python 文件对象，并且在您（重新）定义它之前没有访问 file 的风险并收到一条非常令人困惑的错误消息。

os.path.basename 出局

os.path.basename to outfile

python

os.path