7
\$\begingroup\$

I have got CSVs full of property transactions in the UK from 1995 to 2017, separated by year such as "RS2015.csv". I have a 2nd CSV with a list of wind turbines in the UK. Both have coordinates in WGS 1984.

The intention of my project is to find the nearest wind turbine to each property transaction. The output file should look like:

ID, pc_lat, pc_long, Ref. Number, Lat, Lng, NWindFarm_DistKm

Where the first 3 columns are from the Property Transactions Sample above. The 4 columns on the right should correspond to the nearest wind turbine to the property transaction coordinates.

NWindFarm_DistKm is my disorganised term for the distance to the nearest turbine.

How should I speed up my Python code?


proxAddDataWindPoI.py

import pandas as pdfrom proxWindPoI import *def addWindData(csv1, csv2):    df1 = pd.read_csv(csv1)    PoI = pd.read_csv("Wind/PoI_RUK.csv")    df1['linenumber'] = df1.index    total = len(df1)    df1['completion'] = (df1.linenumber/total)*100    print("Working out nearest Wind Farm site")    df1['Nearest Wind Turbine'] = df1.apply(            lambda row: nearestWindFarm(PoI, row['pc_lat'], row['pc_long'],row['completion']),            axis = 1)    print("Merging with Wind Site Information")    df2 = pd.merge(df1,PoI,left_on='Nearest Wind Turbine', right_on='Ref. Number',how='left')      print("Working Out Nearest Distance")    df2['NWindFarm_DistKm'] = df2.apply(        lambda row: distanceBetweenCm(row['pc_lat'], row['pc_long'],row['Lat'], row['Lng']),         axis = 1)    df2.to_csv(csv2, index=None, encoding="utf-8")

proxWindPoI.py

import mathimport pandas as pdimport numpy as npdef distanceBetweenCm(lat1, lon1, lat2, lon2):    """    https://stackoverflow.com/questions/44910530/    how-to-find-the-distance-between-2-points-in-2-different-dataframes-in-pandas/44910693#44910693    Haversine Formula: https://en.wikipedia.org/wiki/Haversine_formula    """    dLat = math.radians(lat2 - lat1)    dLon = math.radians(lon2 - lon1)    lat1 = math.radians(lat1)    lat2 = math.radians(lat2)    a = math.sin(dLat / 2) ** 2 + math.sin(dLon / 2) **2 * math.cos(lat1) * math.cos(lat2)    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))    return c * 6371 #multiply by 100k to get distance in cmdef haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):    """    slightly modified version: of http://stackoverflow.com/a/29546836/2901002    Calculate the great circle distance between two points    on the earth (specified in decimal degrees or in radians)    All (lat, lon) coordinates must have numeric dtypes and be of equal length.    """    if to_radians:        lat1, lon1, lat2, lon2 = pd.np.radians([lat1, lon1, lat2, lon2])    a = pd.np.sin((lat2-lat1)/2.0)**2 + \        pd.np.cos(lat1) * pd.np.cos(lat2) * pd.np.sin((lon2-lon1)/2.0)**2    return earth_radius * 2 * pd.np.arcsin(np.sqrt(a))def nearestWindFarm(wind, lat, long, pct):    distances = wind.apply(        lambda row: haversine(lat, long, row['Lat'], row['Lng']),        axis = 1)    print("Currently completed: "+str(pct)+"%")    return wind.loc[distances.idxmin(),'Ref. Number']

WindPoIEffect.py

from proxAddDataWindPoI import *import pandas as pddef nearestBiomass(startyr, endyr):    """    This loop goes through csv's by year + adds data from nearest site    NEED TO EDIT CSV NAME    """    for i in range(startyr, endyr, 1):        year = i         print("Year: "+str(year))        csv1 = "RSRaw/RSPytAnyP2"+str(year)+".csv"        csv2 = "Wind/RS_PoI/RSPoIP2"+str(year)+".csv"        print("Input CSV: "+str(csv1))        addWindData(csv1, csv2)def withinXkm(csv1, csv2, buffer):    """    Function which only keeps rows where the distance between site and    property is less than a specified buffer    """    df = pd.read_csv(csv1)    df = df[df.NWindFarm_DistKm <= buffer]    print(df.NWindFarm_DistKm)    df.to_csv(csv2,index=None,encoding='utf-8')def treatmentGroup(year, buffer):    """    The full process of determining treatment group for a certain year of     Repeat Sales. Need to specify buffer    """    print("Starting Treatment Group Identification for Year: "+str(year))    nearestBiomass(year, year+1)    csv1 = "Wind/RS_PoI/RSPoIP2"+str(year)+".csv"    csv2 = "Wind/TG_PoI/RSWindPoI_TGP2"+str(year)+".csv"    print("Extracting Repeat Sales within a buffer of "+str(buffer)+" from "+str(csv1) + " to "+str(csv2))    withinXkm(csv1, csv2, buffer)treatmentGroup(2004, 2)

Brief description of how it is supposed to work:

  1. def nearestBiomass has an input CSV"RSRaw/RSPytAnyP2"+str(year)+".csv" which you can see a sample of at the bottom of this question. It enacts the functionaddWindData fromproxAddDataWindPoI.py
  2. addWindData creates a pandas dataframedf1 from the input CSV + also reads in a 2nd CSV fromPoI = pd.read_csv("Wind/PoI_RUK.csv") as a 2nd dataframe.
  3. addWindData creates alinenumber +completion column purely for showing me how far along I am in the functionnearestWindFarm
  4. addWindData applies a function to createdf1['Nearest Wind Turbine']. This function (nearestWindFarm) is imported fromproxWindPoI.py
  5. nearestWindFarm applies a haversine formula, to find the distances from all wind turbines in the wind dataframe as created in the 2. bullet point. The function returns the Site Name when the distance is at a minimum. This part of my code is the really slow bit - the rest is fine.

There are some things which I might change but I don't know if it's quicker or not:

  1. instead offrom proxWindPoI import *, havefrom proxWindPoI import nearestWindFarm since importing all of a module is probably slower than importing specific function
  2. I've heard thandf.apply is a slow way of applying a function inpandas.

I don't suppose anyone has any advice with regards to speeding up my code. Here are two links of subsets from my original CSVs.

Wind Turbines Sample

Property Transactions Sample

200_success's user avatar
200_success
146k22 gold badges191 silver badges481 bronze badges
askedJul 23, 2017 at 11:48
CTaylor19's user avatar
\$\endgroup\$
0

1 Answer1

1
\$\begingroup\$

The code in the post computes the distance between every transaction and every wind farm. This means that if there are \$n\$ transactions and \$m\$ wind farms, then the overall runtime is \$\Omega(nm)\$.

The runtime can be reduced to \$O(n\log m)\$ by using aspatial lookup structure such as ak-d tree. The idea is to insert the locations of the wind farms into thek-d tree, and then for each transaction, to query the tree, taking \$O(\log m)\$.

For this problem I would try thesklearn.neighbors.KDTree class (from thescikit-learn package) and the'haversine' distance metric.

answeredJul 23, 2017 at 16:16
Gareth Rees's user avatar
\$\endgroup\$
2
  • \$\begingroup\$Hi, just tried this. Works well however the haversine distance metric isn't valid for KDTree method\$\endgroup\$CommentedJul 25, 2017 at 15:32
  • \$\begingroup\$Sorted it now! t\$\endgroup\$CommentedJul 25, 2017 at 15:39

You mustlog in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.