I have got CSVs full of property transactions in the UK from 1995 to 2017, separated by year such as "RS2015.csv". I have a 2nd CSV with a list of wind turbines in the UK. Both have coordinates in WGS 1984.
The intention of my project is to find the nearest wind turbine to each property transaction. The output file should look like:
ID, pc_lat, pc_long, Ref. Number, Lat, Lng, NWindFarm_DistKmWhere the first 3 columns are from the Property Transactions Sample above. The 4 columns on the right should correspond to the nearest wind turbine to the property transaction coordinates.
NWindFarm_DistKm is my disorganised term for the distance to the nearest turbine.
How should I speed up my Python code?
proxAddDataWindPoI.py
import pandas as pdfrom proxWindPoI import *def addWindData(csv1, csv2): df1 = pd.read_csv(csv1) PoI = pd.read_csv("Wind/PoI_RUK.csv") df1['linenumber'] = df1.index total = len(df1) df1['completion'] = (df1.linenumber/total)*100 print("Working out nearest Wind Farm site") df1['Nearest Wind Turbine'] = df1.apply( lambda row: nearestWindFarm(PoI, row['pc_lat'], row['pc_long'],row['completion']), axis = 1) print("Merging with Wind Site Information") df2 = pd.merge(df1,PoI,left_on='Nearest Wind Turbine', right_on='Ref. Number',how='left') print("Working Out Nearest Distance") df2['NWindFarm_DistKm'] = df2.apply( lambda row: distanceBetweenCm(row['pc_lat'], row['pc_long'],row['Lat'], row['Lng']), axis = 1) df2.to_csv(csv2, index=None, encoding="utf-8")proxWindPoI.py
import mathimport pandas as pdimport numpy as npdef distanceBetweenCm(lat1, lon1, lat2, lon2): """ https://stackoverflow.com/questions/44910530/ how-to-find-the-distance-between-2-points-in-2-different-dataframes-in-pandas/44910693#44910693 Haversine Formula: https://en.wikipedia.org/wiki/Haversine_formula """ dLat = math.radians(lat2 - lat1) dLon = math.radians(lon2 - lon1) lat1 = math.radians(lat1) lat2 = math.radians(lat2) a = math.sin(dLat / 2) ** 2 + math.sin(dLon / 2) **2 * math.cos(lat1) * math.cos(lat2) c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a)) return c * 6371 #multiply by 100k to get distance in cmdef haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371): """ slightly modified version: of http://stackoverflow.com/a/29546836/2901002 Calculate the great circle distance between two points on the earth (specified in decimal degrees or in radians) All (lat, lon) coordinates must have numeric dtypes and be of equal length. """ if to_radians: lat1, lon1, lat2, lon2 = pd.np.radians([lat1, lon1, lat2, lon2]) a = pd.np.sin((lat2-lat1)/2.0)**2 + \ pd.np.cos(lat1) * pd.np.cos(lat2) * pd.np.sin((lon2-lon1)/2.0)**2 return earth_radius * 2 * pd.np.arcsin(np.sqrt(a))def nearestWindFarm(wind, lat, long, pct): distances = wind.apply( lambda row: haversine(lat, long, row['Lat'], row['Lng']), axis = 1) print("Currently completed: "+str(pct)+"%") return wind.loc[distances.idxmin(),'Ref. Number']WindPoIEffect.py
from proxAddDataWindPoI import *import pandas as pddef nearestBiomass(startyr, endyr): """ This loop goes through csv's by year + adds data from nearest site NEED TO EDIT CSV NAME """ for i in range(startyr, endyr, 1): year = i print("Year: "+str(year)) csv1 = "RSRaw/RSPytAnyP2"+str(year)+".csv" csv2 = "Wind/RS_PoI/RSPoIP2"+str(year)+".csv" print("Input CSV: "+str(csv1)) addWindData(csv1, csv2)def withinXkm(csv1, csv2, buffer): """ Function which only keeps rows where the distance between site and property is less than a specified buffer """ df = pd.read_csv(csv1) df = df[df.NWindFarm_DistKm <= buffer] print(df.NWindFarm_DistKm) df.to_csv(csv2,index=None,encoding='utf-8')def treatmentGroup(year, buffer): """ The full process of determining treatment group for a certain year of Repeat Sales. Need to specify buffer """ print("Starting Treatment Group Identification for Year: "+str(year)) nearestBiomass(year, year+1) csv1 = "Wind/RS_PoI/RSPoIP2"+str(year)+".csv" csv2 = "Wind/TG_PoI/RSWindPoI_TGP2"+str(year)+".csv" print("Extracting Repeat Sales within a buffer of "+str(buffer)+" from "+str(csv1) + " to "+str(csv2)) withinXkm(csv1, csv2, buffer)treatmentGroup(2004, 2)Brief description of how it is supposed to work:
def nearestBiomasshas an input CSV"RSRaw/RSPytAnyP2"+str(year)+".csv"which you can see a sample of at the bottom of this question. It enacts the functionaddWindDatafromproxAddDataWindPoI.pyaddWindDatacreates a pandas dataframedf1from the input CSV + also reads in a 2nd CSV fromPoI = pd.read_csv("Wind/PoI_RUK.csv")as a 2nd dataframe.addWindDatacreates alinenumber+completioncolumn purely for showing me how far along I am in the functionnearestWindFarmaddWindDataapplies a function to createdf1['Nearest Wind Turbine']. This function (nearestWindFarm) is imported fromproxWindPoI.pynearestWindFarmapplies a haversine formula, to find the distances from all wind turbines in the wind dataframe as created in the 2. bullet point. The function returns the Site Name when the distance is at a minimum. This part of my code is the really slow bit - the rest is fine.
There are some things which I might change but I don't know if it's quicker or not:
- instead of
from proxWindPoI import *, havefrom proxWindPoI import nearestWindFarmsince importing all of a module is probably slower than importing specific function - I've heard than
df.applyis a slow way of applying a function inpandas.
I don't suppose anyone has any advice with regards to speeding up my code. Here are two links of subsets from my original CSVs.
1 Answer1
The code in the post computes the distance between every transaction and every wind farm. This means that if there are \$n\$ transactions and \$m\$ wind farms, then the overall runtime is \$\Omega(nm)\$.
The runtime can be reduced to \$O(n\log m)\$ by using aspatial lookup structure such as ak-d tree. The idea is to insert the locations of the wind farms into thek-d tree, and then for each transaction, to query the tree, taking \$O(\log m)\$.
For this problem I would try thesklearn.neighbors.KDTree class (from thescikit-learn package) and the'haversine' distance metric.
- \$\begingroup\$Hi, just tried this. Works well however the haversine distance metric isn't valid for KDTree method\$\endgroup\$CTaylor19– CTaylor192017-07-25 15:32:49 +00:00CommentedJul 25, 2017 at 15:32
- \$\begingroup\$Sorted it now! t\$\endgroup\$CTaylor19– CTaylor192017-07-25 15:39:19 +00:00CommentedJul 25, 2017 at 15:39
You mustlog in to answer this question.
Explore related questions
See similar questions with these tags.

