在 Python 中转置人口普查平面文件
Transpose Census Flat Files in Python
我正在尝试转置此美国人口普查平面文件:http://www2.census.gov/govs/retire/2013indiv_unit_reported_data.txt 到 Python。
在第一列中,前14个字符代表一行,后三个字符代表一列。第二列是该列和行的值。似乎无法找到使用 Python 将其变成 table 的好方法。
旁注:我的最终目标是创建一个脚本,自动将这些文件导入 ArcGIS,这就是我在 Python 中尝试这样做的原因。
虽然您也可以在纯 Python 中执行此操作,但使用 pandas
会使这成为一个非常简单的问题,因为它是一个主元操作:
df = pd.read_csv("2013indiv_unit_reported_data.txt", delim_whitespace=True,
names=["rowcol", "data"])
df["row"] = df["rowcol"].str[:14]
df["col"] = df["rowcol"].str[14:]
df_new = df.pivot(index="row", columns="col", values="data")
df_new = df_new.fillna("")
df_new.to_csv("table.dat", index=False)
它产生一个左上角看起来像
的DataFrame
>>> df_new.iloc[:5,:5]
col V87 X01 X02 X04 X05
row
01000000003401 0131312 139748131312 82075131312 213456131312
01000000003402 01313NR 474241131312 01313NR 627892131312
01000000003403 01313NR 01313NR 3677131312 0131312
01000000003701 01313NR 578131312 01313NR 3309131312
01103703710000 122741313NR 119541313NR 27761313NR
和一个看起来像
的输出数据文件
>>> !head table.dat
V87,X01,X02,X04,X05,X06,X08,X11,X12,X21,X30,X33,X35,X42,X44,X46,X47,Z01,Z02,Z03,Z04,Z05,Z13,Z14,Z15,Z16,Z62,Z63,Z68,Z70,Z71,Z72,Z73,Z75,Z76,Z77,Z78,Z81,Z82,Z83,Z84,Z87,Z88,Z89,Z91,Z93,Z96,Z98,Z99
0131312,139748131312,82075131312,,213456131312,125363131312,1294714131312,895475131312,44837131312,393606131312,0131312,0131312,0131312,0131312,1309366131312,955067131312,3333131312,84169131312,10554131312,35773131312,3826131312,3498131312,780456131312,87838131312,27181131312,0131312,0131312,2266097131312,389145131312,1309366131312,172000131312,138000131312,0131312,53844131312,30325131312,2266097131312,5056820131312,9984289131312,958400131312,0131312,0131312,0131312,4461131312,0131312,01313NR,9767131312,984714131312,0131312,125363131312
01313NR,474241131312,01313NR,,627892131312,0131312,27384181313NR,1893321131312,55891131312,404296131312,932401131312,219743131312,01313NR,01313NR,29514461313NR,1963274131312,01313NR,133791131312,18568131312,69259131312,4990131312,4121131312,1720307131312,119270131312,53744131312,0131312,61902131312,3830519131312,378156131312,2951446131312,334155131312,304611131312,9006131312,01313NR,1337911313NR,38305191313NR,10514970131312,20596906131312,1963274131312,01313NR,01313NR,01313NR,26140131312,650756131312,01313NR,34803131312,2090646131312,01313NR,01313NR
如果你真的想手动完成,像这样的事情应该可行:
with open("2013indiv_unit_reported_data.txt") as fp:
all_data = {}
for line in fp:
rowcol, data = line.split()
row, col = rowcol[:14], rowcol[14:]
all_data[row, col] = data
import csv
rows, cols = [sorted({key[i] for key in all_data}) for i in range(2)]
with open("table2.dat", "wb") as fp: # python 2
writer = csv.writer(fp)
writer.writerow(cols)
for row in rows:
line = [all_data.get((row, col), '') for col in cols]
writer.writerow(line)
我正在尝试转置此美国人口普查平面文件:http://www2.census.gov/govs/retire/2013indiv_unit_reported_data.txt 到 Python。
在第一列中,前14个字符代表一行,后三个字符代表一列。第二列是该列和行的值。似乎无法找到使用 Python 将其变成 table 的好方法。
旁注:我的最终目标是创建一个脚本,自动将这些文件导入 ArcGIS,这就是我在 Python 中尝试这样做的原因。
虽然您也可以在纯 Python 中执行此操作,但使用 pandas
会使这成为一个非常简单的问题,因为它是一个主元操作:
df = pd.read_csv("2013indiv_unit_reported_data.txt", delim_whitespace=True,
names=["rowcol", "data"])
df["row"] = df["rowcol"].str[:14]
df["col"] = df["rowcol"].str[14:]
df_new = df.pivot(index="row", columns="col", values="data")
df_new = df_new.fillna("")
df_new.to_csv("table.dat", index=False)
它产生一个左上角看起来像
的DataFrame>>> df_new.iloc[:5,:5]
col V87 X01 X02 X04 X05
row
01000000003401 0131312 139748131312 82075131312 213456131312
01000000003402 01313NR 474241131312 01313NR 627892131312
01000000003403 01313NR 01313NR 3677131312 0131312
01000000003701 01313NR 578131312 01313NR 3309131312
01103703710000 122741313NR 119541313NR 27761313NR
和一个看起来像
的输出数据文件>>> !head table.dat
V87,X01,X02,X04,X05,X06,X08,X11,X12,X21,X30,X33,X35,X42,X44,X46,X47,Z01,Z02,Z03,Z04,Z05,Z13,Z14,Z15,Z16,Z62,Z63,Z68,Z70,Z71,Z72,Z73,Z75,Z76,Z77,Z78,Z81,Z82,Z83,Z84,Z87,Z88,Z89,Z91,Z93,Z96,Z98,Z99
0131312,139748131312,82075131312,,213456131312,125363131312,1294714131312,895475131312,44837131312,393606131312,0131312,0131312,0131312,0131312,1309366131312,955067131312,3333131312,84169131312,10554131312,35773131312,3826131312,3498131312,780456131312,87838131312,27181131312,0131312,0131312,2266097131312,389145131312,1309366131312,172000131312,138000131312,0131312,53844131312,30325131312,2266097131312,5056820131312,9984289131312,958400131312,0131312,0131312,0131312,4461131312,0131312,01313NR,9767131312,984714131312,0131312,125363131312
01313NR,474241131312,01313NR,,627892131312,0131312,27384181313NR,1893321131312,55891131312,404296131312,932401131312,219743131312,01313NR,01313NR,29514461313NR,1963274131312,01313NR,133791131312,18568131312,69259131312,4990131312,4121131312,1720307131312,119270131312,53744131312,0131312,61902131312,3830519131312,378156131312,2951446131312,334155131312,304611131312,9006131312,01313NR,1337911313NR,38305191313NR,10514970131312,20596906131312,1963274131312,01313NR,01313NR,01313NR,26140131312,650756131312,01313NR,34803131312,2090646131312,01313NR,01313NR
如果你真的想手动完成,像这样的事情应该可行:
with open("2013indiv_unit_reported_data.txt") as fp:
all_data = {}
for line in fp:
rowcol, data = line.split()
row, col = rowcol[:14], rowcol[14:]
all_data[row, col] = data
import csv
rows, cols = [sorted({key[i] for key in all_data}) for i in range(2)]
with open("table2.dat", "wb") as fp: # python 2
writer = csv.writer(fp)
writer.writerow(cols)
for row in rows:
line = [all_data.get((row, col), '') for col in cols]
writer.writerow(line)