Pandas: 运行 基于多列的数据行计算 table 并将输出存储在新列中
Pandas: Running a calculation on rows of a data table based on multiple columns and storing the output in a new column
我正在尝试计算距 2 个位置的距离,并且已向我提供了两个目的地的经度和纬度。在我的 CSV 中,我有 4 列(lat1、lon1、lat2、lon2),如何应用下面的代码,以便创建名为 'Distance' 的第 5 列,其中使用下面的代码计算距离?
import math
from math import sin, cos, sqrt, atan2, radians
# approximate radius of earth in km
R = 6373.0
#Test
lat1 = radians(25.2296756)
lon1 = radians(36.0122287)
lat2 = radians(51.406374)
lon2 = radians(20.9251681)
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance = R * c
print("Result:", distance)
print("Should be:", 3181.11, "km")
数据框:
df = pd.DataFrame({'Normalised': {(0, 'London,', 'United', 'Kingdom'): '-',
(1, 'Johannesburg,', 'South', 'Africa'): '-',
(2, 'London,', 'United', 'Kingdom'): '-',
(3, 'Johannesburg,', 'South', 'Africa'): '-',
(4, 'London,', 'United', 'Kingdom'): '-'},
'City': {(0, 'London,', 'United', 'Kingdom'): 'New',
(1, 'Johannesburg,', 'South', 'Africa'): 'London,',
(2, 'London,', 'United', 'Kingdom'): 'New',
(3, 'Johannesburg,', 'South', 'Africa'): 'London,',
(4, 'London,', 'United', 'Kingdom'): 'Singapore,'},
'Pair': {(0, 'London,', 'United', 'Kingdom'): 'York,',
(1, 'Johannesburg,', 'South', 'Africa'): 'United',
(2, 'London,', 'United', 'Kingdom'): 'York,',
(3, 'Johannesburg,', 'South', 'Africa'): 'United',
(4, 'London,', 'United', 'Kingdom'): 'Singapore'},
'Departure': {(0, 'London,', 'United', 'Kingdom'): 'United',
(1, 'Johannesburg,', 'South', 'Africa'): 'Ki...',
(2, 'London,', 'United', 'Kingdom'): 'United',
(3, 'Johannesburg,', 'South', 'Africa'): 'Ki...',
(4, 'London,', 'United', 'Kingdom'): 'SIN'},
'Code': {(0, 'London,', 'United', 'Kingdom'): 'Stat.',
(1, 'Johannesburg,', 'South', 'Africa'): 'JNB',
(2, 'London,', 'United', 'Kingdom'): 'Stat',
(3, 'Johannesburg,', 'South', 'Africa'): 'JNB',
(4, 'London,', 'United', 'Kingdom'): 'LHR'},
'Arrival': {(0, 'London,', 'United', 'Kingdom'): 'LHR',
(1, 'Johannesburg,', 'South', 'Africa'): 'LHR',
(2, 'London,', 'United', 'Kingdom'): 'LHR',
(3, 'Johannesburg,', 'South', 'Africa'): 'LHR',
(4, 'London,', 'United', 'Kingdom'): '1.3'},
'Code.1': {(0, 'London,', 'United', 'Kingdom'): 'JFK',
(1, 'Johannesburg,', 'South', 'Africa'): '-26.1',
(2, 'London,', 'United', 'Kingdom'): 'JFK',
(3, 'Johannesburg,', 'South', 'Africa'): '-26.1',
(4, 'London,', 'United', 'Kingdom'): '103.98'},
'Departure_lat': {(0, 'London,', 'United', 'Kingdom'): 51.5,
(1, 'Johannesburg,', 'South', 'Africa'): 28.23,
(2, 'London,', 'United', 'Kingdom'): 51.5,
(3, 'Johannesburg,', 'South', 'Africa'): 28.23,
(4, 'London,', 'United', 'Kingdom'): 51.47},
'Departure_lon': {(0, 'London,', 'United', 'Kingdom'): -0.45,
(1, 'Johannesburg,', 'South', 'Africa'): 51.47,
(2, 'London,', 'United', 'Kingdom'): -0.45,
(3, 'Johannesburg,', 'South', 'Africa'): 51.47,
(4, 'London,', 'United', 'Kingdom'): -0.45},
'Arrival_lat': {(0, 'London,', 'United', 'Kingdom'): 40.64,
(1, 'Johannesburg,', 'South', 'Africa'): -0.45,
(2, 'London,', 'United', 'Kingdom'): 40.64,
(3, 'Johannesburg,', 'South', 'Africa'): -0.45,
(4, 'London,', 'United', 'Kingdom'): np.nan},
'Arrival_lon': {(0, 'London,', 'United', 'Kingdom'): -73.79,
(1, 'Johannesburg,', 'South', 'Africa'): np.nan,
(2, 'London,', 'United', 'Kingdom'): -73.79,
(3, 'Johannesburg,', 'South', 'Africa'): np.nan,
(4, 'London,', 'United', 'Kingdom'): np.nan}})
你没有提供数据,所以我根据你的问题自己编了一个;只需在您的专栏中使用这些函数的 numpy
版本。
import pandas as pd
import numpy as np
row = pd.Series({
"lat1": 25.2296756,
"lon1": 36.0122287,
"lat2": 51.406374,
"lon2": 20.9251681
})
df = pd.concat([row]*5, axis=1).T.apply(np.radians)
df["dlon"] = df.lon2 - df.lon1
df["dlat"] = df.lat2 - df.lat1
R = 6373
a = np.sin(df.dlat / 2)**2 + np.cos(df.lat1) * np.cos(df.lat2) * np.sin(df.dlon / 2)**2
c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
df["distance"] = R*c
生成的数据框如下所示:
lat1 lon1 lat2 lon2 dlon dlat distance
0 0.440341 0.628532 0.89721 0.365213 -0.263319 0.45687 3181.11039
1 0.440341 0.628532 0.89721 0.365213 -0.263319 0.45687 3181.11039
2 0.440341 0.628532 0.89721 0.365213 -0.263319 0.45687 3181.11039
3 0.440341 0.628532 0.89721 0.365213 -0.263319 0.45687 3181.11039
4 0.440341 0.628532 0.89721 0.365213 -0.263319 0.45687 3181.11039
您可以将 dlon
、dlat
、a
和 c
都设为一些临时列,然后从那里计算(或将其全部合并到一个列中)难以阅读的行)。
类似于:
df['dlon'] = df['Arrival_lon'] - df['Departure_lon']
df['dlat'] = df['Arrival_lat'] - df['Departure_lat']
df['a'] = sin(df['dlat'] / 2)**2 + cos(df['Departure_lat']) * cos(df['Arrival_lat']) * sin(df['dlon'] / 2)**2
df['c'] = 2 * atan2(sqrt(df['a']), sqrt(1 - df['a']))
df['distance'] = R * df['c']
然后您可以 .drop()
如果需要,所有这些额外的列,但这应该创建 df['distance'] 作为为每一行计算的新列。
如果我在该代码中有拼写错误,我不会感到惊讶,但希望您明白了。每 df[xxx] =
行构成新列。
您可以为距离计算定义自定义函数。然后,使用 .apply()
在每一行上调用和应用该函数以获得每一行的距离。
1.定义距离计算的自定义函数,如下:
import math
from math import sin, cos, sqrt, atan2, radians
def get_distance(in_lat1, in_lon1, in_lat2, in_lon2):
# approximate radius of earth in km
R = 6373.0
lat1 = radians(in_lat1)
lon1 = radians(in_lon1)
lat2 = radians(in_lat2)
lon2 = radians(in_lon2)
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance = R * c
return distance
2。使用 .apply()
在每一行上调用并应用该函数以获得每一行的距离,如下所示:
df['Distance'] = df.apply(lambda x: get_distance(x['Departure_lat'], x['Departure_lon'], x['Arrival_lat'], x['Arrival_lon']), axis=1)
演示
输入数据帧
City Departure_lat Departure_lon Arrival_lat Arrival_lon
0 CityName1 25.229676 36.012229 51.406374 20.925168
输出
City Departure_lat Departure_lon Arrival_lat Arrival_lon Distance
0 CityName1 25.229676 36.012229 51.406374 20.925168 3181.11039
您可以将您的计算代码放在一个函数中:
def calculate_distance(lat1,lon1,lat2,lon2):
# approximate radius of earth in km
R = 6373.0
lat1 = radians(lat1)
lon1 = radians(lon1)
lat2 = radians(lat2)
lon2 = radians(lon2)
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance = R * c
return distance
然后使用列表理解将其应用于每一行:
df['distance'] = [calculate_distance(row.lat1, row.lon1, row.lat2, row.lon2) for row in df.itertuples() ]
根据您的数据 csv 文件的格式,可以使用类似于以下内容的内容。
本质上,您需要将计算转换为可调用函数,然后在数据文件中的每一行上调用它,可以使用 csv 库将其导入 python。
import math
import csv # Added import for importing csv into python.
from math import sin, cos, sqrt, atan2, radians
# Import the data from the csv file.
with open('data.csv', newline='') as csvfile:
data = list(csv.reader(csvfile))
# Approximate radius of earth in km.
R = 6373.0
# Create a distance calculation function.
def calculate_distance(lat1_d, lon1_d, lat2_d, lon2_d):
# Convert from degrees to radians.
lat1 = radians(lat1_d)
lon1 = radians(lon1_d)
lat2 = radians(lat2_d)
lon2 = radians(lon2_d)
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance = R * c
return distance
# Use list comprehension to run function on every data row.
distances = [calculate_distance(row[0],row[1],row[2],row[3]) for row in data]
# Append distance column to original array to create output.
output = [row + [distance[index]] for index,row in enumerate(data)]
请注意,row[0],row[1],row[2],row[3]
指的是数据 array/csv 文件中列的顺序。这些可能需要根据需要重新排序,以符合函数声明的输入顺序,即:lat1_d, lon1_d, lat2_d, lon2_d
.
# Import the data from the csv file.
with open('data.csv', newline='') as csvfile:
data = list(csv.reader(csvfile))
还需要调整这些导入参数以说明 csv 文件的格式和名称。
我正在尝试计算距 2 个位置的距离,并且已向我提供了两个目的地的经度和纬度。在我的 CSV 中,我有 4 列(lat1、lon1、lat2、lon2),如何应用下面的代码,以便创建名为 'Distance' 的第 5 列,其中使用下面的代码计算距离?
import math
from math import sin, cos, sqrt, atan2, radians
# approximate radius of earth in km
R = 6373.0
#Test
lat1 = radians(25.2296756)
lon1 = radians(36.0122287)
lat2 = radians(51.406374)
lon2 = radians(20.9251681)
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance = R * c
print("Result:", distance)
print("Should be:", 3181.11, "km")
数据框:
df = pd.DataFrame({'Normalised': {(0, 'London,', 'United', 'Kingdom'): '-',
(1, 'Johannesburg,', 'South', 'Africa'): '-',
(2, 'London,', 'United', 'Kingdom'): '-',
(3, 'Johannesburg,', 'South', 'Africa'): '-',
(4, 'London,', 'United', 'Kingdom'): '-'},
'City': {(0, 'London,', 'United', 'Kingdom'): 'New',
(1, 'Johannesburg,', 'South', 'Africa'): 'London,',
(2, 'London,', 'United', 'Kingdom'): 'New',
(3, 'Johannesburg,', 'South', 'Africa'): 'London,',
(4, 'London,', 'United', 'Kingdom'): 'Singapore,'},
'Pair': {(0, 'London,', 'United', 'Kingdom'): 'York,',
(1, 'Johannesburg,', 'South', 'Africa'): 'United',
(2, 'London,', 'United', 'Kingdom'): 'York,',
(3, 'Johannesburg,', 'South', 'Africa'): 'United',
(4, 'London,', 'United', 'Kingdom'): 'Singapore'},
'Departure': {(0, 'London,', 'United', 'Kingdom'): 'United',
(1, 'Johannesburg,', 'South', 'Africa'): 'Ki...',
(2, 'London,', 'United', 'Kingdom'): 'United',
(3, 'Johannesburg,', 'South', 'Africa'): 'Ki...',
(4, 'London,', 'United', 'Kingdom'): 'SIN'},
'Code': {(0, 'London,', 'United', 'Kingdom'): 'Stat.',
(1, 'Johannesburg,', 'South', 'Africa'): 'JNB',
(2, 'London,', 'United', 'Kingdom'): 'Stat',
(3, 'Johannesburg,', 'South', 'Africa'): 'JNB',
(4, 'London,', 'United', 'Kingdom'): 'LHR'},
'Arrival': {(0, 'London,', 'United', 'Kingdom'): 'LHR',
(1, 'Johannesburg,', 'South', 'Africa'): 'LHR',
(2, 'London,', 'United', 'Kingdom'): 'LHR',
(3, 'Johannesburg,', 'South', 'Africa'): 'LHR',
(4, 'London,', 'United', 'Kingdom'): '1.3'},
'Code.1': {(0, 'London,', 'United', 'Kingdom'): 'JFK',
(1, 'Johannesburg,', 'South', 'Africa'): '-26.1',
(2, 'London,', 'United', 'Kingdom'): 'JFK',
(3, 'Johannesburg,', 'South', 'Africa'): '-26.1',
(4, 'London,', 'United', 'Kingdom'): '103.98'},
'Departure_lat': {(0, 'London,', 'United', 'Kingdom'): 51.5,
(1, 'Johannesburg,', 'South', 'Africa'): 28.23,
(2, 'London,', 'United', 'Kingdom'): 51.5,
(3, 'Johannesburg,', 'South', 'Africa'): 28.23,
(4, 'London,', 'United', 'Kingdom'): 51.47},
'Departure_lon': {(0, 'London,', 'United', 'Kingdom'): -0.45,
(1, 'Johannesburg,', 'South', 'Africa'): 51.47,
(2, 'London,', 'United', 'Kingdom'): -0.45,
(3, 'Johannesburg,', 'South', 'Africa'): 51.47,
(4, 'London,', 'United', 'Kingdom'): -0.45},
'Arrival_lat': {(0, 'London,', 'United', 'Kingdom'): 40.64,
(1, 'Johannesburg,', 'South', 'Africa'): -0.45,
(2, 'London,', 'United', 'Kingdom'): 40.64,
(3, 'Johannesburg,', 'South', 'Africa'): -0.45,
(4, 'London,', 'United', 'Kingdom'): np.nan},
'Arrival_lon': {(0, 'London,', 'United', 'Kingdom'): -73.79,
(1, 'Johannesburg,', 'South', 'Africa'): np.nan,
(2, 'London,', 'United', 'Kingdom'): -73.79,
(3, 'Johannesburg,', 'South', 'Africa'): np.nan,
(4, 'London,', 'United', 'Kingdom'): np.nan}})
你没有提供数据,所以我根据你的问题自己编了一个;只需在您的专栏中使用这些函数的 numpy
版本。
import pandas as pd
import numpy as np
row = pd.Series({
"lat1": 25.2296756,
"lon1": 36.0122287,
"lat2": 51.406374,
"lon2": 20.9251681
})
df = pd.concat([row]*5, axis=1).T.apply(np.radians)
df["dlon"] = df.lon2 - df.lon1
df["dlat"] = df.lat2 - df.lat1
R = 6373
a = np.sin(df.dlat / 2)**2 + np.cos(df.lat1) * np.cos(df.lat2) * np.sin(df.dlon / 2)**2
c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
df["distance"] = R*c
生成的数据框如下所示:
lat1 lon1 lat2 lon2 dlon dlat distance
0 0.440341 0.628532 0.89721 0.365213 -0.263319 0.45687 3181.11039
1 0.440341 0.628532 0.89721 0.365213 -0.263319 0.45687 3181.11039
2 0.440341 0.628532 0.89721 0.365213 -0.263319 0.45687 3181.11039
3 0.440341 0.628532 0.89721 0.365213 -0.263319 0.45687 3181.11039
4 0.440341 0.628532 0.89721 0.365213 -0.263319 0.45687 3181.11039
您可以将 dlon
、dlat
、a
和 c
都设为一些临时列,然后从那里计算(或将其全部合并到一个列中)难以阅读的行)。
类似于:
df['dlon'] = df['Arrival_lon'] - df['Departure_lon']
df['dlat'] = df['Arrival_lat'] - df['Departure_lat']
df['a'] = sin(df['dlat'] / 2)**2 + cos(df['Departure_lat']) * cos(df['Arrival_lat']) * sin(df['dlon'] / 2)**2
df['c'] = 2 * atan2(sqrt(df['a']), sqrt(1 - df['a']))
df['distance'] = R * df['c']
然后您可以 .drop()
如果需要,所有这些额外的列,但这应该创建 df['distance'] 作为为每一行计算的新列。
如果我在该代码中有拼写错误,我不会感到惊讶,但希望您明白了。每 df[xxx] =
行构成新列。
您可以为距离计算定义自定义函数。然后,使用 .apply()
在每一行上调用和应用该函数以获得每一行的距离。
1.定义距离计算的自定义函数,如下:
import math
from math import sin, cos, sqrt, atan2, radians
def get_distance(in_lat1, in_lon1, in_lat2, in_lon2):
# approximate radius of earth in km
R = 6373.0
lat1 = radians(in_lat1)
lon1 = radians(in_lon1)
lat2 = radians(in_lat2)
lon2 = radians(in_lon2)
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance = R * c
return distance
2。使用 .apply()
在每一行上调用并应用该函数以获得每一行的距离,如下所示:
df['Distance'] = df.apply(lambda x: get_distance(x['Departure_lat'], x['Departure_lon'], x['Arrival_lat'], x['Arrival_lon']), axis=1)
演示
输入数据帧
City Departure_lat Departure_lon Arrival_lat Arrival_lon
0 CityName1 25.229676 36.012229 51.406374 20.925168
输出
City Departure_lat Departure_lon Arrival_lat Arrival_lon Distance
0 CityName1 25.229676 36.012229 51.406374 20.925168 3181.11039
您可以将您的计算代码放在一个函数中:
def calculate_distance(lat1,lon1,lat2,lon2):
# approximate radius of earth in km
R = 6373.0
lat1 = radians(lat1)
lon1 = radians(lon1)
lat2 = radians(lat2)
lon2 = radians(lon2)
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance = R * c
return distance
然后使用列表理解将其应用于每一行:
df['distance'] = [calculate_distance(row.lat1, row.lon1, row.lat2, row.lon2) for row in df.itertuples() ]
根据您的数据 csv 文件的格式,可以使用类似于以下内容的内容。
本质上,您需要将计算转换为可调用函数,然后在数据文件中的每一行上调用它,可以使用 csv 库将其导入 python。
import math
import csv # Added import for importing csv into python.
from math import sin, cos, sqrt, atan2, radians
# Import the data from the csv file.
with open('data.csv', newline='') as csvfile:
data = list(csv.reader(csvfile))
# Approximate radius of earth in km.
R = 6373.0
# Create a distance calculation function.
def calculate_distance(lat1_d, lon1_d, lat2_d, lon2_d):
# Convert from degrees to radians.
lat1 = radians(lat1_d)
lon1 = radians(lon1_d)
lat2 = radians(lat2_d)
lon2 = radians(lon2_d)
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance = R * c
return distance
# Use list comprehension to run function on every data row.
distances = [calculate_distance(row[0],row[1],row[2],row[3]) for row in data]
# Append distance column to original array to create output.
output = [row + [distance[index]] for index,row in enumerate(data)]
请注意,row[0],row[1],row[2],row[3]
指的是数据 array/csv 文件中列的顺序。这些可能需要根据需要重新排序,以符合函数声明的输入顺序,即:lat1_d, lon1_d, lat2_d, lon2_d
.
# Import the data from the csv file.
with open('data.csv', newline='') as csvfile:
data = list(csv.reader(csvfile))
还需要调整这些导入参数以说明 csv 文件的格式和名称。