如何无误地连接具有长索引的多个数据帧？

Question

我有一个目录“.../dados”，其中有多个子目录，其名称是一个序列号加上一些无用的信息——例如“17448_2017_Jul_2017_Oct”，其中第一个数字是序列号。在每个子目录中，我有四个“.txt”文件，其中 lines/rows 包含日期和时间信息，以及特定类型的属性，例如湿度，每个子目录中的所有文件都以相同的方式命名 - 例如“2019-01-29 03:11:26 54.7”。

我想连接所有这些以生成具有日期索引的数据集。

path = "/.../dados/"

df = pd.DataFrame()

for fld in os.listdir(path):
    subfld = path + fld
    if os.path.isdir(subfld):
        aux = pd.DataFrame()
        sn = fld.split('_')[0]
        for file in os.listdir(subfld):
            filepath = os.path.join(subfld, file)
            if os.path.isfile(filepath):
                new_col = pd.read_fwf(filepath, colspecs=[(0, 19), (20, -1)], skiprows=8, names=[file.split('_')[2][:-4]], parse_dates=[0], nrows=9999999)
                aux = pd.concat([aux, new_col], axis=1,  sort=False)
        aux['Machine'] = sn
        df = df.append(aux)

这是 df.head(10) 的打印：

HumTechRoom  TempTechRoom  TempExamRoom  HumExamRoom Machine
2018-03-04 00:45:11         82.6           NaN           NaN          NaN   22162
2018-03-04 00:45:47         80.0           NaN           NaN          NaN   22162
2018-03-04 00:45:53         78.0           NaN           NaN          NaN   22162
2018-03-04 00:46:04         75.9           NaN           NaN          NaN   22162
2018-03-04 00:46:20         73.7           NaN           NaN         51.3   22162
2018-03-04 00:46:58         71.7           NaN           NaN          NaN   22162
2018-03-04 00:47:40          NaN           NaN           NaN         53.4   22162
2018-03-04 00:47:41          NaN          14.5           NaN          NaN   22162
2018-03-04 00:47:54         74.3           NaN           NaN          NaN   22162
2018-03-04 00:47:59         76.6           NaN           NaN          NaN   22162

这是我收到的错误消息：

...
line 31, in <module>
    aux = pd.concat([aux, new_col], axis=1,  sort=False)

  File ".../concat.py", line 226, in concat
    return op.get_result()

  File ".../concat.py", line 423, in get_result
    copy=self.copy)

  File ".../internals.py", line 5425, in concatenate_block_managers
    return BlockManager(blocks, axes)

  File ".../internals.py", line 3282, in __init__
    self._verify_integrity()

  File ".../internals.py", line 3493, in _verify_integrity
    construction_error(tot_items, block.shape[1:], self.axes)

  File ".../internals.py", line 4843, in construction_error
    passed, implied))

ValueError: Shape of passed values is (2, 19687), indices imply (2, 19685)

Answer 1

您的 DataFrame 的形状不兼容：

ValueError: Shape of passed values is (2, 19687), indices imply (2, 19685)

换句话说，问题是 19687 != 19685。无论您遇到什么答案，都将根据数据的具体情况得出，考虑到数据的大小，共享数据可能不切实际。您至少需要在某处添加或删除 2 行。您需要进行调查以确定内容和地点。

Answer 2

看来您在错误的轴上使用了 pd.concat。从 pd.concat.. 行中删除 axis=1，因为 axis=0 是默认值，可以在 docs

中找到

只是为了您的方便。要获得更清晰的数据框，请同时使用 ignore_index=True：

aux = pd.concat([aux, new_col], ignore_index=True,  sort=False)

返回重置索引。

如何无误地连接具有长索引的多个数据帧？

How to concatenate multiple data frames with long index without error?

python

concat

dataframe

pandas

valueerror