加载数据集,其中每个观察被分成 2 行 python pandas
Loading dataset where each observation is split over 2 lines with python pandas
我正在尝试导入一个数据集,其中每个观察被分成两行。我读取了数据并命名了一些列,数据框如下所示:
crim zn indus chas nox rm age dis rad tax ptratio
20 0.00632 18.00 2.310 0 0.5380 6.5750 65.20 4.0900 1 296.0 15.30
21 396.90 4.98 24.00 None None None None None None None None
22 0.02731 0.00 7.070 0 0.4690 6.4210 78.90 4.9671 2 242.0 17.80
23 396.90 9.14 21.60 None None None None None None None None
...
第 21 行中的三个值应该位于上一行的三个新列中。整个数据框也是如此。我怎么做?
谢谢
您可以使用 pandas shift 移动您想要移动的系列。然后可以将移动后的列添加到 DataFrame 中,您可以每隔一行从中删除一次。
您可以按如下方式构建数据:
import pandas as pd
headers = ["crim", "zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio"]
vals = [[0.00632, 18.00, 2.310, 0, 0.5380, 6.5750, 65.20, 4.0900, 1, 296.0, 15.30],
[396.90, 4.98, 24.00, None, None, None, None, None, None, None, None],
[0.02731, 0.00, 7.070, 0, 0.4690, 6.4210, 78.90, 4.9671, 2, 242.0, 17.80],
[396.90, 9.14, 21.60, None, None, None, None, None, None, None, None]]
df = pd.DataFrame(columns=headers, data=vals)
# add shifted rows
df['new_crim'] = df['crim'].shift(-1)
df['new_zn'] = df['zn'].shift(-1)
df['new_indus'] = df['indus'].shift(-1)
# remove every second row
df_cleaned = df.iloc[::2]
df_cleaned.reset_index(inplace=True, drop=True)
print(df_cleaned)
输出:
crim zn indus chas nox rm age dis rad tax ptratio new_crim new_zn new_indus
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.9 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.9 9.14 21.6
最初我不会使用 pandas 读取 file-content,而是 'manually' 读取文件 row-by-row,然后根据内容创建一个 DataFrame。建议:
import pandas as pd
import re
with open("path/to/file.txt") as f:
# Read the first line, i.e. the headers, split at spaces and remove newlines
headers = re.split(" +", f.readline().replace("\n", ""))
content = []
while True:
# Read two lines at a time, since one line is spread across two
line = f.readline() + f.readline()
# If the line is an empty string it means that the file has ended.
if line == "":
break
# Split the file at spaces and remove the newlines.
content.append(re.split(" +", line.replace("\n", ""))
# The result is a list of lists which we can easily create a DataFrame from.
df = pd.DataFrame(content[1:])
DataFrame 没有 headers,但可以根据需要从 headers
变量轻松添加它们。
这种方法的优点在于它很灵活,而且在我看来,它易于理解并可扩展到不同的用例。
我正在尝试导入一个数据集,其中每个观察被分成两行。我读取了数据并命名了一些列,数据框如下所示:
crim zn indus chas nox rm age dis rad tax ptratio
20 0.00632 18.00 2.310 0 0.5380 6.5750 65.20 4.0900 1 296.0 15.30
21 396.90 4.98 24.00 None None None None None None None None
22 0.02731 0.00 7.070 0 0.4690 6.4210 78.90 4.9671 2 242.0 17.80
23 396.90 9.14 21.60 None None None None None None None None
...
第 21 行中的三个值应该位于上一行的三个新列中。整个数据框也是如此。我怎么做? 谢谢
您可以使用 pandas shift 移动您想要移动的系列。然后可以将移动后的列添加到 DataFrame 中,您可以每隔一行从中删除一次。
您可以按如下方式构建数据:
import pandas as pd
headers = ["crim", "zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio"]
vals = [[0.00632, 18.00, 2.310, 0, 0.5380, 6.5750, 65.20, 4.0900, 1, 296.0, 15.30],
[396.90, 4.98, 24.00, None, None, None, None, None, None, None, None],
[0.02731, 0.00, 7.070, 0, 0.4690, 6.4210, 78.90, 4.9671, 2, 242.0, 17.80],
[396.90, 9.14, 21.60, None, None, None, None, None, None, None, None]]
df = pd.DataFrame(columns=headers, data=vals)
# add shifted rows
df['new_crim'] = df['crim'].shift(-1)
df['new_zn'] = df['zn'].shift(-1)
df['new_indus'] = df['indus'].shift(-1)
# remove every second row
df_cleaned = df.iloc[::2]
df_cleaned.reset_index(inplace=True, drop=True)
print(df_cleaned)
输出:
crim zn indus chas nox rm age dis rad tax ptratio new_crim new_zn new_indus
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.9 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.9 9.14 21.6
最初我不会使用 pandas 读取 file-content,而是 'manually' 读取文件 row-by-row,然后根据内容创建一个 DataFrame。建议:
import pandas as pd
import re
with open("path/to/file.txt") as f:
# Read the first line, i.e. the headers, split at spaces and remove newlines
headers = re.split(" +", f.readline().replace("\n", ""))
content = []
while True:
# Read two lines at a time, since one line is spread across two
line = f.readline() + f.readline()
# If the line is an empty string it means that the file has ended.
if line == "":
break
# Split the file at spaces and remove the newlines.
content.append(re.split(" +", line.replace("\n", ""))
# The result is a list of lists which we can easily create a DataFrame from.
df = pd.DataFrame(content[1:])
DataFrame 没有 headers,但可以根据需要从 headers
变量轻松添加它们。
这种方法的优点在于它很灵活,而且在我看来,它易于理解并可扩展到不同的用例。