将函数应用于数据框中的每个观察值
Applying a function to every observation in a dataframe
我有一个很大的 df 坐标,我正在通过一个函数(反向地理编码器)输入,
我如何 运行 遍历整个 df 而无需迭代(需要很长时间)
示例 df:
Latitude Longitude
0 -25.66026 28.0914
1 -25.67923 28.10525
2 -30.68456 19.21694
3 -30.12345 22.34256
4 -15.12546 17.12365
运行完成我想要的函数后(没有 for 循环...)一个 df:
City
0 HappyPlace
1 SadPlace
2 AveragePlace
3 CoolPlace
4 BadPlace
注意:我不需要知道如何进行反向地理编码,这是一个关于在没有迭代的情况下将函数应用于整个 df 的问题。
编辑:
使用 df.apply() 可能不起作用,因为我的代码如下所示:
for i in range(len(df)):
results = g.reverse_geocode(df['LATITUDE'][i], df['LONGITUDE'][i])
city.append(results.city)
较慢的方法遍历地理点列表并获取地理点的城市
import pandas as pd
import time
d = {'Latitude': [-25.66026,-25.67923,-30.68456,-30.12345,-15.12546,-25.66026,-25.67923,-30.68456,-30.12345,-15.12546], 'Longitude': [28.0914, 28.10525,19.21694,22.34256,17.12365,28.0914, 28.10525,19.21694,22.34256,17.12365]}
df = pd.DataFrame(data=d)
# example method of g.reverse_geocode() -> geo_reverse
def geo_reverse(lat, long):
time.sleep(2)
#assuming that your reverse_geocode will take 2 second
print(lat, long)
for i in range(len(df)):
results = geo_reverse(df['Latitude'][i], df['Longitude'][i])
因为 time.sleep(2)
。上面的程序至少需要 20 秒来处理所有十个地理点。
比上面更好的方法:
import pandas as pd
import time
d = {'Latitude': [-25.66026,-25.67923,-30.68456,-30.12345,-15.12546,-25.66026,-25.67923,-30.68456,-30.12345,-15.12546], 'Longitude': [28.0914, 28.10525,19.21694,22.34256,17.12365,28.0914, 28.10525,19.21694,22.34256,17.12365]}
df = pd.DataFrame(data=d)
import threading
def runnable_method(f, args):
result_info = [threading.Event(), None]
def runit():
result_info[1] = f(args)
result_info[0].set()
threading.Thread(target=runit).start()
return result_info
def gather_results(result_infos):
results = []
for i in range(len(result_infos)):
result_infos[i][0].wait()
results.append(result_infos[i][1])
return results
def geo_reverse(args):
time.sleep(2)
return "City Name of ("+str(args[0])+","+str(args[1])+")"
geo_points = []
for i in range(len(df)):
tuple_i = (df['Latitude'][i], df['Longitude'][i])
geo_points.append(tuple_i)
result_info = [runnable_method(geo_reverse, geo_point) for geo_point in geo_points]
cities_result = gather_results(result_info)
print(cities_result)
注意方法 geo_reverse
有 2 秒的处理时间来获取基于地理点的数据。在第二个示例中,代码仅需 2 秒 即可处理任意数量的点。
注意: 尝试这两种方法,假设您的 geo_reverse
将花费大约。 2秒获取数据。第一种方法将花费 20+1 秒,处理时间将随着输入数量的增加而增加,但无论您要处理多少地理点,第二种方法的处理时间几乎不变(即大约 2+1)秒。
假设 g.reverse_geocode()
方法是上面代码中的 geo_reverse()
。 运行 上面的两种代码(方法)分别在上面,你自己看看区别。
解释:
查看上面的代码及其主要部分,即创建元组列表并理解该列表将每个元组传递给动态创建的线程(主要部分):
#Converting df of geo points into list of tuples
geo_points = []
for i in range(len(df)):
tuple_i = (df['Latitude'][i], df['Longitude'][i])
geo_points.append(tuple_i)
#List comprehension with custom methods and create run-able threads
result_info = [runnable_method(geo_reverse, geo_point) for geo_point in geo_points]
#gather result from each thread.
cities_result = gather_results(result_info)
print(cities_result)
我有一个很大的 df 坐标,我正在通过一个函数(反向地理编码器)输入, 我如何 运行 遍历整个 df 而无需迭代(需要很长时间)
示例 df:
Latitude Longitude
0 -25.66026 28.0914
1 -25.67923 28.10525
2 -30.68456 19.21694
3 -30.12345 22.34256
4 -15.12546 17.12365
运行完成我想要的函数后(没有 for 循环...)一个 df:
City
0 HappyPlace
1 SadPlace
2 AveragePlace
3 CoolPlace
4 BadPlace
注意:我不需要知道如何进行反向地理编码,这是一个关于在没有迭代的情况下将函数应用于整个 df 的问题。
编辑:
使用 df.apply() 可能不起作用,因为我的代码如下所示:
for i in range(len(df)):
results = g.reverse_geocode(df['LATITUDE'][i], df['LONGITUDE'][i])
city.append(results.city)
较慢的方法遍历地理点列表并获取地理点的城市
import pandas as pd
import time
d = {'Latitude': [-25.66026,-25.67923,-30.68456,-30.12345,-15.12546,-25.66026,-25.67923,-30.68456,-30.12345,-15.12546], 'Longitude': [28.0914, 28.10525,19.21694,22.34256,17.12365,28.0914, 28.10525,19.21694,22.34256,17.12365]}
df = pd.DataFrame(data=d)
# example method of g.reverse_geocode() -> geo_reverse
def geo_reverse(lat, long):
time.sleep(2)
#assuming that your reverse_geocode will take 2 second
print(lat, long)
for i in range(len(df)):
results = geo_reverse(df['Latitude'][i], df['Longitude'][i])
因为 time.sleep(2)
。上面的程序至少需要 20 秒来处理所有十个地理点。
比上面更好的方法:
import pandas as pd
import time
d = {'Latitude': [-25.66026,-25.67923,-30.68456,-30.12345,-15.12546,-25.66026,-25.67923,-30.68456,-30.12345,-15.12546], 'Longitude': [28.0914, 28.10525,19.21694,22.34256,17.12365,28.0914, 28.10525,19.21694,22.34256,17.12365]}
df = pd.DataFrame(data=d)
import threading
def runnable_method(f, args):
result_info = [threading.Event(), None]
def runit():
result_info[1] = f(args)
result_info[0].set()
threading.Thread(target=runit).start()
return result_info
def gather_results(result_infos):
results = []
for i in range(len(result_infos)):
result_infos[i][0].wait()
results.append(result_infos[i][1])
return results
def geo_reverse(args):
time.sleep(2)
return "City Name of ("+str(args[0])+","+str(args[1])+")"
geo_points = []
for i in range(len(df)):
tuple_i = (df['Latitude'][i], df['Longitude'][i])
geo_points.append(tuple_i)
result_info = [runnable_method(geo_reverse, geo_point) for geo_point in geo_points]
cities_result = gather_results(result_info)
print(cities_result)
注意方法 geo_reverse
有 2 秒的处理时间来获取基于地理点的数据。在第二个示例中,代码仅需 2 秒 即可处理任意数量的点。
注意: 尝试这两种方法,假设您的 geo_reverse
将花费大约。 2秒获取数据。第一种方法将花费 20+1 秒,处理时间将随着输入数量的增加而增加,但无论您要处理多少地理点,第二种方法的处理时间几乎不变(即大约 2+1)秒。
假设 g.reverse_geocode()
方法是上面代码中的 geo_reverse()
。 运行 上面的两种代码(方法)分别在上面,你自己看看区别。
解释: 查看上面的代码及其主要部分,即创建元组列表并理解该列表将每个元组传递给动态创建的线程(主要部分):
#Converting df of geo points into list of tuples
geo_points = []
for i in range(len(df)):
tuple_i = (df['Latitude'][i], df['Longitude'][i])
geo_points.append(tuple_i)
#List comprehension with custom methods and create run-able threads
result_info = [runnable_method(geo_reverse, geo_point) for geo_point in geo_points]
#gather result from each thread.
cities_result = gather_results(result_info)
print(cities_result)