如何使用 argparse 将二进制文件作为标准输入传递给 Docker 容器化 Python 脚本?
How to pass a binary file as stdin to a Docker containerized Python script using argparse?
更新基于
我重新实现了他的解决方案来简化问题。让我们把 Docker 和 Django 排除在外。目标是通过以下两种方法使用Pandas读取excel:
python example.py - < /path/to/file.xlsx
cat /path/to/file.xlsx | python example.py -
其中 example.py 转载如下:
import argparse
import contextlib
from typing import IO
import sys
import pandas as pd
@contextlib.contextmanager
def file_ctx(filename: str) -> IO[bytes]:
if filename == '-':
yield sys.stdin.buffer
else:
with open(filename, 'rb') as f:
yield f
def main():
parser = argparse.ArgumentParser()
parser.add_argument('FILE')
args = parser.parse_args()
with file_ctx(args.FILE) as input_file:
print(input_file.read())
df = pd.read_excel(input_file)
print(df)
if __name__ == "__main__":
main()
问题是 Pandas(见下面的回溯)不接受 2。但是它与 1 一起工作正常。
而简单地打印 excel 文件的文本表示在 1. 和 2.
中都有效
如果您想轻松重现Docker环境:
首先构建名为 pandas 的 Docker 图像:
docker build --pull -t pandas - <<EOF
FROM python:latest
RUN pip install pandas xlrd
EOF
然后用pandasDocker图片转运行:
docker run --rm -i -v /path/to/example.py:/example.py pandas python example.py - < /path/to/file.xlsx
注意它是如何正确地打印出 excel 文件的纯文本表示,但是 pandas 无法读取它。
更简洁的回溯,类似下面:
Traceback (most recent call last):
File "example.py", line 29, in <module>
main()
File "example.py", line 24, in main
df = pd.read_excel(input_file)
File "/usr/local/lib/python3.8/site-packages/pandas/util/_decorators.py", line 208, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 310, in read_excel
io = ExcelFile(io, engine=engine)
File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 819, in __init__
self._reader = self._engines[engine](self._io)
File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_xlrd.py", line 21, in __init__
super().__init__(filepath_or_buffer)
File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 356, in __init__
filepath_or_buffer.seek(0)
io.UnsupportedOperation: File or stream is not seekable.
显示代码在安装 excel 文件时有效(即不被标准输入传递):
docker run --rm -i -v /path/to/example.py:/example.py -v /path/to/file.xlsx:/file.xlsx pandas python example.py file.xlsx
原始问题描述(附加上下文)
假设在主机系统上,您在 /tmp/test.txt
处有一个文件,您想要在其上使用 head
,但在 Docker 容器中(echo 'Hello World!' > /tmp/test.txt
重现我拥有的示例数据):
你可以运行:
docker run -i busybox head -1 - < /tmp/test.txt
将第一行打印到屏幕:
或
cat /tmp/test.txt | docker run -i busybox head -1 -
输出为:
Hello World!
即使使用像 .xlsx 这样的二进制格式而不是纯文本,也可以完成上述操作,您会得到一些类似于以下的奇怪输出:
�Oxl/_rels/workbook.xml.rels���j�0
��}
上面的要点是,即使通过 Docker 的抽象,head 也可以使用二进制和文本格式。
但是在我自己的基于 argparse 的 CLI(Actually custom Django management command,我相信它使用了 argparse)中,当我尝试在 Docker 中使用 panda 的 read_excel
时出现以下错误上下文。
打印的错误如下:
Traceback (most recent call last):
File "./manage.py", line 15, in <module>
execute_from_command_line(sys.argv)
File "/opt/conda/lib/python3.7/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
utility.execute()
File "/opt/conda/lib/python3.7/site-packages/django/core/management/__init__.py", line 375, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/opt/conda/lib/python3.7/site-packages/django/core/management/base.py", line 323, in run_from_argv
self.execute(*args, **cmd_options)
File "/opt/conda/lib/python3.7/site-packages/django/core/management/base.py", line 364, in execute
output = self.handle(*args, **options)
File "/home/jovyan/sequence_databaseApp/management/commands/seq_db.py", line 54, in handle
df_snapshot = pd.read_excel(options['FILE'].buffer, sheet_name='Snapshot', header=0, dtype=dtype)
File "/opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py", line 208, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 310, in read_excel
io = ExcelFile(io, engine=engine)
File "/opt/conda/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 819, in __init__
self._reader = self._engines[engine](self._io)
File "/opt/conda/lib/python3.7/site-packages/pandas/io/excel/_xlrd.py", line 21, in __init__
super().__init__(filepath_or_buffer)
File "/opt/conda/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 356, in __init__
filepath_or_buffer.seek(0)
io.UnsupportedOperation: File or stream is not seekable.
具体来说,
docker run -i <IMAGE> ./manage.py my_cli import - < /path/to/file.xlsx
不起作用,
但是 ./manage.py my_cli import - < /path/to/file.xlsx
确实有效!
Docker 上下文中存在某种差异。
但是我也注意到,即使将 Docker 排除在等式之外:
cat /path/to/file.xlsx | ./manage.py my_cli import -
不起作用
虽然:
./manage.py my_cli import - < /path/to/file.xlsx
有效(如前所述)
最后,我正在使用的代码(您应该能够将其保存为 my_cli.py 在 management/commands 下以使其在 Django 项目中工作):
import argparse
import sys
from django.core.management.base import BaseCommand
class Command(BaseCommand):
help = 'my_cli help'
def add_arguments(self, parser):
subparsers = parser.add_subparsers(
title='commands', dest='command', help='command help')
subparsers.required = True
parser_import = subparsers.add_parser('import', help='import help')
parser_import.add_argument('FILE', type=argparse.FileType('r'), default=sys.stdin)
def handle(self, *args, **options):
import pandas as pd
df = pd.read_excel(options['FILE'].buffer, header=0)
print(df)
您似乎正在以文本模式阅读文件(FileType('r')
/ sys.stdin
)
根据this bpo issue argparse 不支持直接打开二进制文件
我建议使用与此类似的代码自己处理文件类型(我不熟悉 django / pandas 方式,所以我将其简化为简单的 python )
import argparse
import contextlib
import io
from typing import IO
@contextlib.contextmanager
def file_ctx(filename: str) -> IO[bytes]:
if filename == '-':
yield io.BytesIO(sys.stdin.buffer.read())
else:
with open(filename, 'rb') as f:
yield f
def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument('FILE')
args = parser.parse_args()
with file_ctx(args.FILE) as input_file:
# do whatever you need with that input file
主要基于 ,但稍作修改即可完全解决问题:
import argparse
import contextlib
import io
from typing import IO
import sys
import pandas as pd
@contextlib.contextmanager
def file_ctx(filename: str) -> IO[bytes]:
if filename == '-':
yield io.BytesIO(sys.stdin.buffer.read())
else:
with open(filename, 'rb') as f:
yield f
def main():
parser = argparse.ArgumentParser()
parser.add_argument('FILE')
args = parser.parse_args()
with file_ctx(args.FILE) as input_file:
print(input_file.read())
df = pd.read_excel(input_file)
print(df)
if __name__ == "__main__":
main()
看了之后有了想法 to
就原始问题的基于 Django 的上下文而言,这看起来如何:
import contextlib
import io
import sys
from typing import IO
import pandas as pd
from django.core.management.base import BaseCommand
@contextlib.contextmanager
def file_ctx(filename: str) -> IO[bytes]:
if filename == '-':
yield io.BytesIO(sys.stdin.buffer.read())
else:
with open(filename, 'rb') as f:
yield f
class Command(BaseCommand):
help = 'my_cli help'
def add_arguments(self, parser):
subparsers = parser.add_subparsers(
title='commands', dest='command', help='command help')
subparsers.required = True
parser_import = subparsers.add_parser('import', help='import help')
parser_import.add_argument('FILE')
def handle(self, *args, **options):
with file_ctx(options['FILE']) as input_file:
df = pd.read_excel(input_file)
print(df)
更新基于
我重新实现了他的解决方案来简化问题。让我们把 Docker 和 Django 排除在外。目标是通过以下两种方法使用Pandas读取excel:
python example.py - < /path/to/file.xlsx
cat /path/to/file.xlsx | python example.py -
其中 example.py 转载如下:
import argparse
import contextlib
from typing import IO
import sys
import pandas as pd
@contextlib.contextmanager
def file_ctx(filename: str) -> IO[bytes]:
if filename == '-':
yield sys.stdin.buffer
else:
with open(filename, 'rb') as f:
yield f
def main():
parser = argparse.ArgumentParser()
parser.add_argument('FILE')
args = parser.parse_args()
with file_ctx(args.FILE) as input_file:
print(input_file.read())
df = pd.read_excel(input_file)
print(df)
if __name__ == "__main__":
main()
问题是 Pandas(见下面的回溯)不接受 2。但是它与 1 一起工作正常。
而简单地打印 excel 文件的文本表示在 1. 和 2.
中都有效如果您想轻松重现Docker环境:
首先构建名为 pandas 的 Docker 图像:
docker build --pull -t pandas - <<EOF
FROM python:latest
RUN pip install pandas xlrd
EOF
然后用pandasDocker图片转运行:
docker run --rm -i -v /path/to/example.py:/example.py pandas python example.py - < /path/to/file.xlsx
注意它是如何正确地打印出 excel 文件的纯文本表示,但是 pandas 无法读取它。
更简洁的回溯,类似下面:
Traceback (most recent call last):
File "example.py", line 29, in <module>
main()
File "example.py", line 24, in main
df = pd.read_excel(input_file)
File "/usr/local/lib/python3.8/site-packages/pandas/util/_decorators.py", line 208, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 310, in read_excel
io = ExcelFile(io, engine=engine)
File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 819, in __init__
self._reader = self._engines[engine](self._io)
File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_xlrd.py", line 21, in __init__
super().__init__(filepath_or_buffer)
File "/usr/local/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 356, in __init__
filepath_or_buffer.seek(0)
io.UnsupportedOperation: File or stream is not seekable.
显示代码在安装 excel 文件时有效(即不被标准输入传递):
docker run --rm -i -v /path/to/example.py:/example.py -v /path/to/file.xlsx:/file.xlsx pandas python example.py file.xlsx
原始问题描述(附加上下文)
假设在主机系统上,您在 /tmp/test.txt
处有一个文件,您想要在其上使用 head
,但在 Docker 容器中(echo 'Hello World!' > /tmp/test.txt
重现我拥有的示例数据):
你可以运行:
docker run -i busybox head -1 - < /tmp/test.txt
将第一行打印到屏幕:
或
cat /tmp/test.txt | docker run -i busybox head -1 -
输出为:
Hello World!
即使使用像 .xlsx 这样的二进制格式而不是纯文本,也可以完成上述操作,您会得到一些类似于以下的奇怪输出:
�Oxl/_rels/workbook.xml.rels���j�0
��}
上面的要点是,即使通过 Docker 的抽象,head 也可以使用二进制和文本格式。
但是在我自己的基于 argparse 的 CLI(Actually custom Django management command,我相信它使用了 argparse)中,当我尝试在 Docker 中使用 panda 的 read_excel
时出现以下错误上下文。
打印的错误如下:
Traceback (most recent call last):
File "./manage.py", line 15, in <module>
execute_from_command_line(sys.argv)
File "/opt/conda/lib/python3.7/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
utility.execute()
File "/opt/conda/lib/python3.7/site-packages/django/core/management/__init__.py", line 375, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/opt/conda/lib/python3.7/site-packages/django/core/management/base.py", line 323, in run_from_argv
self.execute(*args, **cmd_options)
File "/opt/conda/lib/python3.7/site-packages/django/core/management/base.py", line 364, in execute
output = self.handle(*args, **options)
File "/home/jovyan/sequence_databaseApp/management/commands/seq_db.py", line 54, in handle
df_snapshot = pd.read_excel(options['FILE'].buffer, sheet_name='Snapshot', header=0, dtype=dtype)
File "/opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py", line 208, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 310, in read_excel
io = ExcelFile(io, engine=engine)
File "/opt/conda/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 819, in __init__
self._reader = self._engines[engine](self._io)
File "/opt/conda/lib/python3.7/site-packages/pandas/io/excel/_xlrd.py", line 21, in __init__
super().__init__(filepath_or_buffer)
File "/opt/conda/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 356, in __init__
filepath_or_buffer.seek(0)
io.UnsupportedOperation: File or stream is not seekable.
具体来说,
docker run -i <IMAGE> ./manage.py my_cli import - < /path/to/file.xlsx
不起作用,
但是 ./manage.py my_cli import - < /path/to/file.xlsx
确实有效!
Docker 上下文中存在某种差异。
但是我也注意到,即使将 Docker 排除在等式之外:
cat /path/to/file.xlsx | ./manage.py my_cli import -
不起作用
虽然:
./manage.py my_cli import - < /path/to/file.xlsx
有效(如前所述)
最后,我正在使用的代码(您应该能够将其保存为 my_cli.py 在 management/commands 下以使其在 Django 项目中工作):
import argparse
import sys
from django.core.management.base import BaseCommand
class Command(BaseCommand):
help = 'my_cli help'
def add_arguments(self, parser):
subparsers = parser.add_subparsers(
title='commands', dest='command', help='command help')
subparsers.required = True
parser_import = subparsers.add_parser('import', help='import help')
parser_import.add_argument('FILE', type=argparse.FileType('r'), default=sys.stdin)
def handle(self, *args, **options):
import pandas as pd
df = pd.read_excel(options['FILE'].buffer, header=0)
print(df)
您似乎正在以文本模式阅读文件(FileType('r')
/ sys.stdin
)
根据this bpo issue argparse 不支持直接打开二进制文件
我建议使用与此类似的代码自己处理文件类型(我不熟悉 django / pandas 方式,所以我将其简化为简单的 python )
import argparse
import contextlib
import io
from typing import IO
@contextlib.contextmanager
def file_ctx(filename: str) -> IO[bytes]:
if filename == '-':
yield io.BytesIO(sys.stdin.buffer.read())
else:
with open(filename, 'rb') as f:
yield f
def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument('FILE')
args = parser.parse_args()
with file_ctx(args.FILE) as input_file:
# do whatever you need with that input file
主要基于
import argparse
import contextlib
import io
from typing import IO
import sys
import pandas as pd
@contextlib.contextmanager
def file_ctx(filename: str) -> IO[bytes]:
if filename == '-':
yield io.BytesIO(sys.stdin.buffer.read())
else:
with open(filename, 'rb') as f:
yield f
def main():
parser = argparse.ArgumentParser()
parser.add_argument('FILE')
args = parser.parse_args()
with file_ctx(args.FILE) as input_file:
print(input_file.read())
df = pd.read_excel(input_file)
print(df)
if __name__ == "__main__":
main()
看了之后有了想法
就原始问题的基于 Django 的上下文而言,这看起来如何:
import contextlib
import io
import sys
from typing import IO
import pandas as pd
from django.core.management.base import BaseCommand
@contextlib.contextmanager
def file_ctx(filename: str) -> IO[bytes]:
if filename == '-':
yield io.BytesIO(sys.stdin.buffer.read())
else:
with open(filename, 'rb') as f:
yield f
class Command(BaseCommand):
help = 'my_cli help'
def add_arguments(self, parser):
subparsers = parser.add_subparsers(
title='commands', dest='command', help='command help')
subparsers.required = True
parser_import = subparsers.add_parser('import', help='import help')
parser_import.add_argument('FILE')
def handle(self, *args, **options):
with file_ctx(options['FILE']) as input_file:
df = pd.read_excel(input_file)
print(df)