I'am writing on a spatial clustering algorithm using pandas and scipy's kdtree. I profiled the code and the.loc part takes most time for bigger datasets. I wonder if its possible to speed up thepoints.loc[idx, 'cluster'] = clusterNr somehow.
import numpy as npimport pandas as pdfrom sklearn.neighbors import NearestNeighborsdef getRandomCoordinates(samples=1000, offsetX=52.2, offsetY=13.1, width=0.5): points = np.random.rand(samples, 2) * width #points = da.random.random(size=(samples, 2), chunks=(500, 500)) data = pd.DataFrame(points, columns=['lat', 'lng']) data.lat += offsetX data.lng += offsetY # set spatial properties data.columnX = 'lat' data.columnY = 'lng' return dataradius = 0.01points = getRandomCoordinates(25)samples = points.sample(10)tree = NearestNeighbors(n_neighbors=2, radius=0.1, leaf_size=30, algorithm="ball_tree", n_jobs=1).fit(points)nngrf = tree.radius_neighbors_graph(samples, radius, mode='connectivity').toarray().astype(np.bool)points['cluster'] = -1for clusterNr, idx in enumerate(nngrf): points.loc[idx, 'cluster'] = clusterNrInput data:
lng lat0 12.988426 52.3433611 13.055824 52.3964622 13.353571 52.3474573 12.980915 52.3390214 13.232137 52.3391555 12.877804 52.3859266 13.220915 52.3789517 13.479688 52.4244558 13.324399 52.6375309 13.052958 52.39808410 13.087653 52.41306411 13.330557 52.63788312 13.354927 52.38004013 13.163061 52.51444514 13.371755 52.52066515 13.698472 52.38939716 13.405825 52.50775717 13.239793 52.39134118 13.369102 52.52512219 13.322234 52.51145320 13.326276 52.51504521 13.318642 52.29628322 13.411129 52.47850923 13.207719 52.28384424 13.222899 52.381747and the result:
lng lat cluster0 12.988426 52.343361 91 13.055824 52.396462 62 13.353571 52.347457 -13 12.980915 52.339021 94 13.232137 52.339155 45 12.877804 52.385926 -16 13.220915 52.378951 77 13.479688 52.424455 -18 13.324399 52.637530 -19 13.052958 52.398084 610 13.087653 52.413064 511 13.330557 52.637883 -112 13.354927 52.380040 013 13.163061 52.514445 -114 13.371755 52.520665 215 13.698472 52.389397 -116 13.405825 52.507757 117 13.239793 52.391341 -118 13.369102 52.525122 219 13.322234 52.511453 820 13.326276 52.515045 821 13.318642 52.296283 -122 13.411129 52.478509 -123 13.207719 52.283844 324 13.222899 52.381747 7and the nearest neighbors graph:
[[False False False False False False False False False False False False False False True False False False True False False False False False False] [False False True False False False False False False False False False False False False False False False False False False False False False False] [False False False False False False False False False False False False False False False False False False False False False False True False False] [False False False False False False False False False False False False True False False False False False False False False False False False False] [False False False False False False True False False False False False False False False False False False False False False False False False True] [False False False False False False False False False False False False False False False False False False False True True False False False False] [False False False False False True False False False False False False False False False False False False False False False False False False False] [False False False False False False False False False False False False False False False True False False False False False False False False False] [False False False False False False False False False False False False False False False False False False False False False False False True False] [False False False False False False False False False False False False False False False False False False False True True False False False False]]1 Answer1
Assignments to Pandas dataframes always work better when doing entire columns at once, instead of filling a single row or a small number of rows at a time. In the code below I first completely define a NumPy array of cluster numbers and only after it is completely defined do I pass it into the Pandas DataFrame.
Even aside from the speed issue, your
forloop where you assign cluster numbers is very confusing because of a misuse ofenumerate: you haveclusterNr, idx in enumerate(nngrf)but the idiom is alwaysidx, value in enumerate(values). You have the position of the index and the value switched. In your particular case this works and is a nice trick but you should document how this works in your code if anyone besides you is ever meant to read it.I changed variable names to bePEP8 compliant and to be easier to read. I also added some comments,docstrings, and formatting. Particular changes of note were defining
n_pointsandn_samplesas variables, wrapping the code you wrote into a function (so that I could line profile it), makingRADIUSa variable defined in all-caps (which in Python suggests it is a constant), and defining the radius in the call to.radius_neighbors_graphto be in terms ofRADIUSrather than just be another magic number hard-coded into the code. I think most of these changes improve your code's readability and make it more in line with style guidelines for Python.Minor point: coercing to boolean with
.astype('bool')works on SciPy sparse CSR matrices, and so doing the coercionbefore converting to a non-sparse NumPy array with.toarray()should be slightly faster and use less memory than doing things the other way around -- no time is wasted on converting zeroes.import numpy as npimport pandas as pdfrom sklearn.neighbors import NearestNeighbors%load_ext line_profilerdef knn_clusters(n_points, n_samples): """ Finds the which of n_samples points is closest to each of n_points randomly defined points. """ RADIUS = 0.01 def get_random_coords(samples=1000, offsetX=52.2, offsetY=13.1, width=0.5): """Generate random coordinates in a pandas dataframe""" points = np.random.rand(samples, 2) * width data = pd.DataFrame(points, columns=['latitude', 'longitude']) data.latitude += offsetX data.longitude += offsetY # set spatial properties data.columnX = 'latitude' data.columnY = 'longitude' return data # some random points points = get_random_coords(n_points) # a subset of those points samples = points.sample(n_samples) # KNN tree = NearestNeighbors(n_neighbors=2, radius=RADIUS, leaf_size=30, algorithm="ball_tree", n_jobs=1, ).fit(points) # nn_graph = tree.radius_neighbors_graph(samples, radius = 10 * RADIUS, mode = 'connectivity', ).astype('bool').toarray() # faster assigment to pandas dataframes: entire columns at once cluster_number, point_number = np.where(nn_graph) cluster_assignments = -np.ones(n_points) cluster_assignments[point_number] = cluster_number points.loc[:, 'cluster'] = cluster_assignments # for comparison: assignment by specific rows is slow points['cluster_alt'] = -1 for clusterNr, idx in enumerate(nn_graph): points.loc[idx, 'cluster_alt'] = clusterNr # ensure equality of both approaches assert np.all(np.equal(points.cluster, points.cluster_alt)) return points, nn_graph
The results of line profiling with%lprun -f knn_clusters knn_clusters(10000, 1000) were:
Timer unit: 1e-06 sTotal time: 0.999948 sFile: <ipython-input-76-7be6bc3ea686>Function: knn_clusters at line 6Line # Hits Time Per Hit % Time Line Contents============================================================== 6 def knn_clusters(n_points, n_samples): 7 """ 8 Finds the which of n_samples points is closest to each of n_points randomly defined points. 9 """ 10 1 2 2.0 0.0 RADIUS = 0.01 11 12 1 2 2.0 0.0 def get_random_coords(samples=1000, offsetX=52.2, offsetY=13.1, width=0.5): 13 """Generate random coordinates in a pandas dataframe""" 14 points = np.random.rand(samples, 2) * width 15 data = pd.DataFrame(points, columns=['latitude', 'longitude']) 16 data.latitude += offsetX 17 data.longitude += offsetY 18 19 # set spatial properties 20 data.columnX = 'latitude' 21 data.columnY = 'longitude' 22 23 return data 24 25 # some random points 26 1 3710 3710.0 0.4 points = get_random_coords(n_points) 27 28 # a subset of those points 29 1 7687 7687.0 0.8 samples = points.sample(n_samples) 30 31 # KNN 32 1 4 4.0 0.0 tree = NearestNeighbors(n_neighbors=2, 33 1 2 2.0 0.0 radius=RADIUS, 34 1 2 2.0 0.0 leaf_size=30, 35 1 2 2.0 0.0 algorithm="ball_tree", 36 1 275 275.0 0.0 n_jobs=1, 37 1 6029 6029.0 0.6 ).fit(points) 38 39 # 40 1 7 7.0 0.0 nn_graph = tree.radius_neighbors_graph(samples, 41 1 4 4.0 0.0 radius=10*RADIUS, 42 1 90594 90594.0 9.1 mode='connectivity', 43 1 36823 36823.0 3.7 ).astype('bool').toarray() 44 45 # faster assigment to pandas dataframes: entire columns at once 46 1 73476 73476.0 7.3 cluster_number, point_number = np.where(nn_graph) 47 1 74 74.0 0.0 cluster_assignments = -np.ones(n_points) 48 1 11675 11675.0 1.2 cluster_assignments[point_number] = cluster_number 49 1 2548 2548.0 0.3 points.loc[:, 'cluster'] = cluster_assignments 50 51 # for comparison: assignment by specific rows is slow 52 1 889 889.0 0.1 points['cluster_alt'] = -1 53 1001 2464 2.5 0.2 for clusterNr, idx in enumerate(nn_graph): 54 1000 763337 763.3 76.3 points.loc[idx, 'cluster_alt'] = clusterNr 55 56 # ensure equality of both approaches 57 1 341 341.0 0.0 assert np.all(np.equal(points.cluster, points.cluster_alt)) 58 59 1 1 1.0 0.0 return points, nn_graphThus assigning an entire column at once was just under 10x faster.
You mustlog in to answer this question.
Explore related questions
See similar questions with these tags.
