Concat 数据框:为列赋予唯一名称并删除重复项
Concat dataframes: give unique names to columns and drop duplicates
我正在遍历每月的气象站数据。我可以按如下方式连接文件:
path = r"D:\NOAA\output\TEST"
all_files = glob.glob(path + "/*.csv")
for filename in all_files:
print filename # prints D:\NOAA\output\TEST9501.tave.conus.csv
df = (pd.read_csv(f) for f in all_files)
concatenated_df = pd.concat(df, axis=1, join='inner')
这会产生以下数据框:
lat lon temp lat lon temp lat lon temp
0 24.5625 -81.8125 21.06 24.5625 -81.8125 17.08 24.5625 -81.8125 22.42
1 24.5625 -81.7708 21.06 24.5625 -81.7708 17.08 24.5625 -81.7708 22.47
2 24.5625 -81.7292 21.06 24.5625 -81.7292 17.08 24.5625 -81.7292 22.47
3 24.5625 -81.6875 21.05 24.5625 -81.6875 17.04 24.5625 -81.6875 22.47
4 24.6042 -81.6458 21.06 24.6042 -81.6458 17.08 24.6042 -81.6458 22.45
lat
和 lon
列相同,所以我想删除那些重复的列。 temp
列对于每个月度 CSV 文件都是唯一的。我想保留所有这些,但也给它们取自文件名的有意义的列名,即:
lat lon temp185901 temp185902 temp185903
0 24.5625 -81.8125 21.06 17.08 22.42
1 24.5625 -81.7708 21.06 17.08 22.47
2 24.5625 -81.7292 21.06 17.08 22.47
3 24.5625 -81.6875 21.05 17.04 22.47
4 24.6042 -81.6458 21.06 17.08 22.45
我是 Pandas 的新手(看起来很棒,但要吸收的东西很多),我将不胜感激。我认为解决方案在我用于 .concat()
、.duplicate()
和 .loc()
.
的参数中
示例数据:ftp://ftp.commissions.leg.state.mn.us/pub/gis/Temp/NOAA/
您可以合并两列并为其他列设置后缀:
temp = df1.merge(df2, on=['lat','lon'], suffixes=('185901','185902'))
lat lon temp185901 temp185902
0 24.5625 -81.8125 21.06 17.08
1 24.5625 -81.7708 21.06 17.08
2 24.5625 -81.7292 21.06 17.08
3 24.5625 -81.6875 21.05 17.04
4 24.6042 -81.6458 21.06 17.08
或循环
temp.merge(df3, on=['lat','lon']).rename(columns={'temp':'temp185903'})
lat lon temp185901 temp185902 temp185903
0 24.5625 -81.8125 21.06 17.08 22.42
1 24.5625 -81.7708 21.06 17.08 22.47
2 24.5625 -81.7292 21.06 17.08 22.47
3 24.5625 -81.6875 21.05 17.04 22.47
4 24.6042 -81.6458 21.06 17.08 22.45
df = []
for filename in all_files:
df1 = pd.read_csv(filename)
# if the first loop
if not list(df):
df = df1
else:
df = df.merge(df1, on=['lat','lon'])
df.rename(columns={'temp':'temp'+put_numer_from_filename}, inplace=True)
这会将新数据作为新行附加到连接的数据框中,其中有一列 'date' 以指示数据来自哪个文件。修改逻辑以从文件名中获取日期。
import pandas as pd
import glob
path = r'D:\NOAA\output\TEST'
all_files = glob.glob(path + '/*.csv')
df_concat = pd.DataFrame()
for file in all_files:
df = pd.read_csv(file)
df['date'] = file #ammend the variable file to get the date from your file names
df_concat = df_concat.append(df)
我正在遍历每月的气象站数据。我可以按如下方式连接文件:
path = r"D:\NOAA\output\TEST"
all_files = glob.glob(path + "/*.csv")
for filename in all_files:
print filename # prints D:\NOAA\output\TEST9501.tave.conus.csv
df = (pd.read_csv(f) for f in all_files)
concatenated_df = pd.concat(df, axis=1, join='inner')
这会产生以下数据框:
lat lon temp lat lon temp lat lon temp
0 24.5625 -81.8125 21.06 24.5625 -81.8125 17.08 24.5625 -81.8125 22.42
1 24.5625 -81.7708 21.06 24.5625 -81.7708 17.08 24.5625 -81.7708 22.47
2 24.5625 -81.7292 21.06 24.5625 -81.7292 17.08 24.5625 -81.7292 22.47
3 24.5625 -81.6875 21.05 24.5625 -81.6875 17.04 24.5625 -81.6875 22.47
4 24.6042 -81.6458 21.06 24.6042 -81.6458 17.08 24.6042 -81.6458 22.45
lat
和 lon
列相同,所以我想删除那些重复的列。 temp
列对于每个月度 CSV 文件都是唯一的。我想保留所有这些,但也给它们取自文件名的有意义的列名,即:
lat lon temp185901 temp185902 temp185903
0 24.5625 -81.8125 21.06 17.08 22.42
1 24.5625 -81.7708 21.06 17.08 22.47
2 24.5625 -81.7292 21.06 17.08 22.47
3 24.5625 -81.6875 21.05 17.04 22.47
4 24.6042 -81.6458 21.06 17.08 22.45
我是 Pandas 的新手(看起来很棒,但要吸收的东西很多),我将不胜感激。我认为解决方案在我用于 .concat()
、.duplicate()
和 .loc()
.
示例数据:ftp://ftp.commissions.leg.state.mn.us/pub/gis/Temp/NOAA/
您可以合并两列并为其他列设置后缀:
temp = df1.merge(df2, on=['lat','lon'], suffixes=('185901','185902'))
lat lon temp185901 temp185902
0 24.5625 -81.8125 21.06 17.08
1 24.5625 -81.7708 21.06 17.08
2 24.5625 -81.7292 21.06 17.08
3 24.5625 -81.6875 21.05 17.04
4 24.6042 -81.6458 21.06 17.08
或循环
temp.merge(df3, on=['lat','lon']).rename(columns={'temp':'temp185903'})
lat lon temp185901 temp185902 temp185903
0 24.5625 -81.8125 21.06 17.08 22.42
1 24.5625 -81.7708 21.06 17.08 22.47
2 24.5625 -81.7292 21.06 17.08 22.47
3 24.5625 -81.6875 21.05 17.04 22.47
4 24.6042 -81.6458 21.06 17.08 22.45
df = []
for filename in all_files:
df1 = pd.read_csv(filename)
# if the first loop
if not list(df):
df = df1
else:
df = df.merge(df1, on=['lat','lon'])
df.rename(columns={'temp':'temp'+put_numer_from_filename}, inplace=True)
这会将新数据作为新行附加到连接的数据框中,其中有一列 'date' 以指示数据来自哪个文件。修改逻辑以从文件名中获取日期。
import pandas as pd
import glob
path = r'D:\NOAA\output\TEST'
all_files = glob.glob(path + '/*.csv')
df_concat = pd.DataFrame()
for file in all_files:
df = pd.read_csv(file)
df['date'] = file #ammend the variable file to get the date from your file names
df_concat = df_concat.append(df)