Python：如何使用子进程解码从 'dir' 命令检索到的文件名？

Question

我正在尝试使用 Python 3.8.2 中的 subprocess.Popen 函数和 dir 命令获取 Windows 10 文件系统上的目录列表。更具体地说，我有这段代码：

import subprocess

process = subprocess.Popen(['dir'], shell = True, stdout = subprocess.PIPE, stderr = subprocess.STDOUT)
for line in iter(process.stdout.readline, b''):
  print(line.decode('utf-16'))
process.stdout.close()

当我运行在文件名带有 Unicode 字符（例如“háčky a čárky.txt”）的目录中执行上述操作时，出现以下错误：

Traceback (most recent call last):
  File "error.py", line 5, in <module>
    print(line.decode('utf-16'))
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x0a in position 42: truncated data

很明显，问题出在编码上。我曾尝试使用 'utf-8' 而不是 'utf-16'，但没有成功。当我删除 decode('utf-16') 调用并仅使用 print(line) 时，我得到以下输出：

b' Volume in drive C is OSDisk\r\n'
b' Volume Serial Number is 9E2B-67E3\r\n'
b'\r\n'
b' Directory of C:\Users\asamec\Dropbox\DIY\Python\AccessibleRunner\AccessibleRunner\r\n'
b'\r\n'
b'05/14/2021  09:19 AM    <DIR>          .\r\n'
b'05/14/2021  09:19 AM    <DIR>          ..\r\n'
b'05/13/2021  09:46 PM             5,697 AccessibleRunner.py\r\n'
b'05/14/2021  09:18 AM               214 error.py\r\n'
b'05/13/2021  05:48 PM             5,642 h\xa0cky a c\xa0rky.txt.py\r\n'
b'               3 File(s)         11,553 bytes\r\n'
b'               2 Dir(s)  230,706,778,112 bytes free\r\n'

当我删除 'utf-16' 参数并只保留 print(line.decode()) 时，出现以下错误：

Traceback (most recent call last):
  File "error.py", line 5, in <module>
    print(line.decode())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 40: invalid start byte

所以问题是我应该如何解码进程的标准输出以便打印正确的字符？

更新

运行 Windows 命令行中的 chcp 65001 命令在运行宁 python 脚本之前是解决方案。但是，下面给出了与上面相同的错误：

import subprocess

process = subprocess.Popen(['cmd', '/c', 'chcp 65001 & dir'], shell = True, stdout = subprocess.PIPE, stderr = subprocess.STDOUT)
for line in iter(process.stdout.readline, b''):
  print(line.decode('utf-16'))
process.stdout.close()

但是，当运行第二次使用相同的 Python 脚本时，它开始工作，因为代码页已经设置为 65001。所以现在的问题是我如何设置Windows 控制台代码页不在运行宁 Python 脚本之前，而是在那个 Python 脚本中？

Answer 1

Set console to UTF-8 之前运行脚本（使用CHCP 65001）：

脚本顺利运行：.\SO524114.py

Active code page: 65001
HL~Real~Def.txt
html.txt
háčky a čárky.txt

我可以使用以下调用重现该问题：

>NUL chcp 852
.\SO524114.py

Active code page: 852
HL~Real~Def.txt
html.txt
Traceback (most recent call last):
  File "D:\bat\SO524114.py", line 7, in <module>
    print(line.decode('utf-8').strip())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 1: invalid start byte

用于测试的修改脚本：

import subprocess

process = subprocess.Popen(['cmd', '/c', 'chcp&dir /B h*.txt'], shell = True, stdout = subprocess.PIPE, stderr = subprocess.STDOUT)
for line in iter(process.stdout.readline, b''):
  print(line.decode('utf-8').strip())

process.stdout.close()

Answer 2

作为 , the UTF-8 code page must be set in the Windows command line 在运行之前 dir 命令。以下是我的问题的完整解决方案：

import subprocess

subprocess.call(['chcp', '65001'], shell = True)
process = subprocess.Popen(['dir'], shell = True, stdout = subprocess.PIPE, stderr = subprocess.STDOUT)
for line in iter(process.stdout.readline, b''):
  print(line.decode('utf-8'))
process.stdout.close()

Answer 3

自2016.9以来，模块subprocess version 3.6在函数subprocess.run()中有encoding参数，以便您可以设置指定的编码。

因此，如果您不想更改 CMD 的编码：

在您的 CMD 中键入 chcp 并获取活动代码页。
例如936.
从Code Page Identifiers获取编码。
标识符（936）：.NET 名称（gb2312）
gb2312是大多数情况下python可以识别的编码名称。但是你可以查看 Standard Encodings of Python 3.10 to be sure, thanks to Mark Amery.
将 encoding='gb2312' 添加到您的 subprocess.run() 函数中。
process_list = subprocess.run('dir', shell = True, stdout = subprocess.PIPE, stderr = subprocess.STDOUT, text=True , <b>编码</b>='gb2312').stdout.split('\n')[:-1]
subprocess.Popen 构造函数也有 encoding 参数，如果你真的想坚持使用 Popen，而 recommended 是“推荐的方法调用子流程是对它可以处理的所有用例使用 run() 函数。"

如果您想更改CMD的编码，请参考的回答。

Python：如何使用子进程解码从 'dir' 命令检索到的文件名？

Python: How to decode file names retrieved from 'dir' command using subprocess?

python

encoding

subprocess

popen

更新