只能将 str（不是 "bytes"）连接到 str

Question

我正在使用 cellranger mkref 并遇到一个奇怪的 python GTF（自定义 gtf 文件）问题：

Traceback (most recent call last):
  File "/home/user/cellranger-6.0.1/lib/python/cellranger/reference.py", line 750, in validate_gtf
    subprocess.check_output(cmd, stderr=subprocess.STDOUT)
  File "/home/user/cellranger-6.0.1/external/anaconda/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/home/user/cellranger-6.0.1/external/anaconda/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['gtf_to_gene_index', '/home/user/cellranger-6.0.1/indexes', '/home/user/cellranger-6.0.1/indexes/tmp74f_vsxg.json']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/cellranger-6.0.1/bin/rna/mkref", line 139, in <module>
    main()
  File "/home/user/cellranger-6.0.1/bin/rna/mkref", line 130, in main
    reference_builder.build_gex_reference()
  File "/home/user/cellranger-6.0.1/lib/python/cellranger/reference.py", line 613, in build_gex_reference
    self.validate_gtf()
  File "/home/user/cellranger-6.0.1/lib/python/cellranger/reference.py", line 753, in validate_gtf
    raise GexReferenceError("Error detected in GTF file: " + exc.output) from exc
TypeError: can only concatenate str (not "bytes") to str

另外，我有一个类似的 gtf 文件，cellranger 可以毫无问题地接受它。我比较了这些文件（此外，我从第二个文件制作的第一个文件）：

文件 1：text/plain; charset=us-ascii 文件 2：text/plain; charset=us-ascii

此外，我检查了 cat -vE，文件是相同的，并尝试转换为 utf-8 或找到一些词，如 b'word'，但没有任何结果

如何更改文件？提前致谢！

Answer 1

Python 3 将字节的字符串视为与字符的字符串不同的对象。区别很重要，因为给定的字符串可以用不同的方式编码为字节。例如。在 UTF-8 中，ä 是两个字节 c3 a4（十六进制），而在 ISO-8859-1（拉丁语 1）中，它只是一个字节 e4.

正如@Theophrastus 的评论所说，subprocess.check_output() returns bytes，匹配低级别API。您需要根据预期的编码将其解码为字符。例如

>>> raw = subprocess.check_output("ls")
>>> raw
b'\xc3\xa4iti\n'
>>> out = raw.decode('utf-8')
>>> out
'äiti\n'

注意字节串被标记为b''而字符串只是''没有字母。

在最近的版本中，您似乎也应该能够将 encoding="utf-8" 直接传递给 check_output()。

如果数据只包含 ASCII 字符，您当然可以使用 .decode('ascii')。如果输入包含设置了高位的字节，它将抛出异常。

只能将 str（不是 "bytes"）连接到 str

can only concatenate str (not "bytes") to str

python

bioinformatics