加载数据集,其中每个观察被分成 2 行 python pandas

Loading dataset where each observation is split over 2 lines with python pandas

我正在尝试导入一个数据集,其中每个观察被分成两行。我读取了数据并命名了一些列,数据框如下所示:

    crim    zn     indus    chas  nox     rm    age      dis    rad  tax    ptratio
20  0.00632 18.00   2.310   0   0.5380  6.5750  65.20   4.0900  1   296.0   15.30
21  396.90  4.98    24.00   None    None    None    None    None    None    None    None
22  0.02731 0.00    7.070   0   0.4690  6.4210  78.90   4.9671  2   242.0   17.80
23  396.90  9.14    21.60   None    None    None    None    None    None    None    None
...

第 21 行中的三个值应该位于上一行的三个新列中。整个数据框也是如此。我怎么做? 谢谢

您可以使用 pandas shift 移动您想要移动的系列。然后可以将移动后的列添加到 DataFrame 中,您可以每隔一行从中删除一次。

您可以按如下方式构建数据:

import pandas as pd

headers = ["crim", "zn", "indus", "chas",  "nox", "rm", "age", "dis", "rad", "tax", "ptratio"]
vals = [[0.00632, 18.00, 2.310, 0, 0.5380, 6.5750, 65.20, 4.0900, 1, 296.0, 15.30],
    [396.90, 4.98, 24.00, None, None, None, None, None, None, None, None],
    [0.02731, 0.00, 7.070, 0, 0.4690, 6.4210, 78.90, 4.9671, 2, 242.0, 17.80],
    [396.90,  9.14, 21.60, None, None, None, None, None, None, None, None]]

df = pd.DataFrame(columns=headers, data=vals)

# add shifted rows
df['new_crim'] = df['crim'].shift(-1)
df['new_zn'] = df['zn'].shift(-1)
df['new_indus'] = df['indus'].shift(-1)

# remove every second row
df_cleaned = df.iloc[::2]
df_cleaned.reset_index(inplace=True, drop=True)

print(df_cleaned)

输出:

      crim    zn  indus  chas    nox     rm   age     dis  rad    tax  ptratio  new_crim  new_zn  new_indus
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0     15.3     396.9    4.98       24.0
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0     17.8     396.9    9.14       21.6

最初我不会使用 pandas 读取 file-content,而是 'manually' 读取文件 row-by-row,然后根据内容创建一个 DataFrame。建议:

import pandas as pd
import re

with open("path/to/file.txt") as f:
    # Read the first line, i.e. the headers, split at spaces and remove newlines
    headers = re.split(" +", f.readline().replace("\n", ""))
    content = []
    while True:
        # Read two lines at a time, since one line is spread across two
        line = f.readline() + f.readline()
        # If the line is an empty string it means that the file has ended.
        if line == "":
            break
        # Split the file at spaces and remove the newlines.
        content.append(re.split(" +", line.replace("\n", ""))

# The result is a list of lists which we can easily create a DataFrame from.
df = pd.DataFrame(content[1:])

DataFrame 没有 headers,但可以根据需要从 headers 变量轻松添加它们。

这种方法的优点在于它很灵活,而且在我看来,它易于理解并可扩展到不同的用例。