在 python 3.7.4 中将 tar.z 文件作为 pandas 数据框读取？

Question

我想从 UCI 存储库下载数据集。

数据集采用 tar.Z 格式，理想情况下我想将其作为 pandas 数据框读取。

我已经检查过 uncompressing tar.Z file with python? which suggested the zgip library, so from https://docs.python.org/3/library/gzip.html 我尝试使用以下代码，但收到错误消息。

感谢您的帮助！

import gzip
with gzip.open('https://archive.ics.uci.edu/ml/machine-learning-databases/diabetes/diabetes-data.tar.Z', 'rb') as f:
file_content = f.read()  

ERROR MESSAGE:
OSError: [Errno 22] Invalid argument: 'https://archive.ics.uci.edu/ml/machine-learning-databases/diabetes/diabetes-data.tar.Z'

Answer 1

我认为您无法使用 Python 中的任何模块读取 .Z 数据；你可以浏览 Pypi，看看是否有 .Z 扩展的模块。但是，您可以使用命令行来处理数据。

import subprocess
from io import StringIO

data = subprocess.run(
    """curl https://archive.ics.uci.edu/ml/machine-learning-databases/diabetes/diabetes-data.tar.Z | 
    tar -xOvf diabetes-data.tar.Z --wildcards 'Diabetes-Data/data-*' """,
    shell=True,
    capture_output=True,
    text=True,
).stdout


df = pd.read_csv(StringIO(data), sep="\t", header=None)

df.head()

        0       1        2  3
0   04-21-1991  9:09    58  100
1   04-21-1991  9:09    33  009
2   04-21-1991  9:09    34  013
3   04-21-1991  17:08   62  119
4   04-21-1991  17:08   33  007

您可以阅读此 ebook 以了解有关命令行选项的更多信息。

在 python 3.7.4 中将 tar.z 文件作为 pandas 数据框读取？

reading in tar.z file as pandas data frame in python 3.7.4?

python

tar

pandas

data-science