PANDAS spatial clustering

Question 1

I'am writing on a spatial clustering algorithm using pandas and scipy's kdtree. I profiled the code and the.loc part takes most time for bigger datasets. I wonder if its possible to speed up thepoints.loc[idx, 'cluster'] = clusterNr somehow.

import numpy as npimport pandas as pdfrom sklearn.neighbors import NearestNeighborsdef getRandomCoordinates(samples=1000, offsetX=52.2, offsetY=13.1, width=0.5):    points = np.random.rand(samples, 2) * width    #points = da.random.random(size=(samples, 2), chunks=(500, 500))    data = pd.DataFrame(points, columns=['lat', 'lng'])    data.lat += offsetX    data.lng += offsetY    # set spatial properties    data.columnX = 'lat'    data.columnY = 'lng'    return dataradius = 0.01points = getRandomCoordinates(25)samples = points.sample(10)tree = NearestNeighbors(n_neighbors=2, radius=0.1, leaf_size=30, algorithm="ball_tree", n_jobs=1).fit(points)nngrf = tree.radius_neighbors_graph(samples, radius, mode='connectivity').toarray().astype(np.bool)points['cluster'] = -1for clusterNr, idx in enumerate(nngrf):    points.loc[idx, 'cluster'] = clusterNr

Input data:

          lng        lat0   12.988426  52.3433611   13.055824  52.3964622   13.353571  52.3474573   12.980915  52.3390214   13.232137  52.3391555   12.877804  52.3859266   13.220915  52.3789517   13.479688  52.4244558   13.324399  52.6375309   13.052958  52.39808410  13.087653  52.41306411  13.330557  52.63788312  13.354927  52.38004013  13.163061  52.51444514  13.371755  52.52066515  13.698472  52.38939716  13.405825  52.50775717  13.239793  52.39134118  13.369102  52.52512219  13.322234  52.51145320  13.326276  52.51504521  13.318642  52.29628322  13.411129  52.47850923  13.207719  52.28384424  13.222899  52.381747

and the result:

          lng        lat  cluster0   12.988426  52.343361        91   13.055824  52.396462        62   13.353571  52.347457       -13   12.980915  52.339021        94   13.232137  52.339155        45   12.877804  52.385926       -16   13.220915  52.378951        77   13.479688  52.424455       -18   13.324399  52.637530       -19   13.052958  52.398084        610  13.087653  52.413064        511  13.330557  52.637883       -112  13.354927  52.380040        013  13.163061  52.514445       -114  13.371755  52.520665        215  13.698472  52.389397       -116  13.405825  52.507757        117  13.239793  52.391341       -118  13.369102  52.525122        219  13.322234  52.511453        820  13.326276  52.515045        821  13.318642  52.296283       -122  13.411129  52.478509       -123  13.207719  52.283844        324  13.222899  52.381747        7

and the nearest neighbors graph:

[[False False False False False False False False False False False False  False False  True False False False  True False False False False False  False] [False False  True False False False False False False False False False  False False False False False False False False False False False False  False] [False False False False False False False False False False False False  False False False False False False False False False False  True False  False] [False False False False False False False False False False False False   True False False False False False False False False False False False  False] [False False False False False False  True False False False False False  False False False False False False False False False False False False   True] [False False False False False False False False False False False False  False False False False False False False  True  True False False False  False] [False False False False False  True False False False False False False  False False False False False False False False False False False False  False] [False False False False False False False False False False False False  False False False  True False False False False False False False False  False] [False False False False False False False False False False False False  False False False False False False False False False False False  True  False] [False False False False False False False False False False False False  False False False False False False False  True  True False False False  False]]

Question 2

Assignments to Pandas dataframes always work better when doing entire columns at once, instead of filling a single row or a small number of rows at a time. In the code below I first completely define a NumPy array of cluster numbers and only after it is completely defined do I pass it into the Pandas DataFrame.
Even aside from the speed issue, yourfor loop where you assign cluster numbers is very confusing because of a misuse ofenumerate: you haveclusterNr, idx in enumerate(nngrf) but the idiom is alwaysidx, value in enumerate(values). You have the position of the index and the value switched. In your particular case this works and is a nice trick but you should document how this works in your code if anyone besides you is ever meant to read it.
I changed variable names to bePEP8 compliant and to be easier to read. I also added some comments,docstrings, and formatting. Particular changes of note were definingn_points andn_samples as variables, wrapping the code you wrote into a function (so that I could line profile it), makingRADIUS a variable defined in all-caps (which in Python suggests it is a constant), and defining the radius in the call to.radius_neighbors_graph to be in terms ofRADIUS rather than just be another magic number hard-coded into the code. I think most of these changes improve your code's readability and make it more in line with style guidelines for Python.

Minor point: coercing to boolean with.astype('bool') works on SciPy sparse CSR matrices, and so doing the coercionbefore converting to a non-sparse NumPy array with.toarray() should be slightly faster and use less memory than doing things the other way around -- no time is wasted on converting zeroes.

import numpy as npimport pandas as pdfrom sklearn.neighbors import NearestNeighbors%load_ext line_profilerdef knn_clusters(n_points, n_samples):    """    Finds the which of n_samples points is closest to each of n_points randomly defined points.    """    RADIUS = 0.01    def get_random_coords(samples=1000, offsetX=52.2, offsetY=13.1, width=0.5):        """Generate random coordinates in a pandas dataframe"""        points = np.random.rand(samples, 2) * width        data = pd.DataFrame(points, columns=['latitude', 'longitude'])        data.latitude += offsetX        data.longitude += offsetY        # set spatial properties        data.columnX = 'latitude'        data.columnY = 'longitude'        return data    # some random points    points = get_random_coords(n_points)    # a subset of those points    samples = points.sample(n_samples)    # KNN    tree = NearestNeighbors(n_neighbors=2,                             radius=RADIUS,                             leaf_size=30,                             algorithm="ball_tree",                             n_jobs=1,                           ).fit(points)    #     nn_graph = tree.radius_neighbors_graph(samples,                                            radius = 10 * RADIUS,                                            mode = 'connectivity',                                          ).astype('bool').toarray()    # faster assigment to pandas dataframes: entire columns at once    cluster_number, point_number = np.where(nn_graph)    cluster_assignments = -np.ones(n_points)    cluster_assignments[point_number] = cluster_number    points.loc[:, 'cluster'] = cluster_assignments    # for comparison: assignment by specific rows is slow    points['cluster_alt'] = -1    for clusterNr, idx in enumerate(nn_graph):        points.loc[idx, 'cluster_alt'] = clusterNr    # ensure equality of both approaches    assert np.all(np.equal(points.cluster, points.cluster_alt))    return points, nn_graph

The results of line profiling with%lprun -f knn_clusters knn_clusters(10000, 1000) were:

Timer unit: 1e-06 sTotal time: 0.999948 sFile: <ipython-input-76-7be6bc3ea686>Function: knn_clusters at line 6Line #      Hits         Time  Per Hit   % Time  Line Contents==============================================================     6                                           def knn_clusters(n_points, n_samples):     7                                               """     8                                               Finds the which of n_samples points is closest to each of n_points randomly defined points.     9                                               """    10         1            2      2.0      0.0      RADIUS = 0.01    11                                                   12         1            2      2.0      0.0      def get_random_coords(samples=1000, offsetX=52.2, offsetY=13.1, width=0.5):    13                                                   """Generate random coordinates in a pandas dataframe"""    14                                                   points = np.random.rand(samples, 2) * width    15                                                   data = pd.DataFrame(points, columns=['latitude', 'longitude'])    16                                                   data.latitude += offsetX    17                                                   data.longitude += offsetY    18                                               19                                                   # set spatial properties    20                                                   data.columnX = 'latitude'    21                                                   data.columnY = 'longitude'    22                                               23                                                   return data    24                                                   25                                               # some random points    26         1         3710   3710.0      0.4      points = get_random_coords(n_points)    27                                                   28                                               # a subset of those points    29         1         7687   7687.0      0.8      samples = points.sample(n_samples)    30                                                   31                                               # KNN    32         1            4      4.0      0.0      tree = NearestNeighbors(n_neighbors=2,     33         1            2      2.0      0.0                              radius=RADIUS,     34         1            2      2.0      0.0                              leaf_size=30,     35         1            2      2.0      0.0                              algorithm="ball_tree",     36         1          275    275.0      0.0                              n_jobs=1,    37         1         6029   6029.0      0.6                             ).fit(points)    38                                                   39                                               #     40         1            7      7.0      0.0      nn_graph = tree.radius_neighbors_graph(samples,     41         1            4      4.0      0.0                                             radius=10*RADIUS,     42         1        90594  90594.0      9.1                                             mode='connectivity',    43         1        36823  36823.0      3.7                                            ).astype('bool').toarray()    44                                                   45                                               # faster assigment to pandas dataframes: entire columns at once    46         1        73476  73476.0      7.3      cluster_number, point_number = np.where(nn_graph)    47         1           74     74.0      0.0      cluster_assignments = -np.ones(n_points)    48         1        11675  11675.0      1.2      cluster_assignments[point_number] = cluster_number    49         1         2548   2548.0      0.3      points.loc[:, 'cluster'] = cluster_assignments    50                                                   51                                               # for comparison: assignment by specific rows is slow    52         1          889    889.0      0.1      points['cluster_alt'] = -1    53      1001         2464      2.5      0.2      for clusterNr, idx in enumerate(nn_graph):    54      1000       763337    763.3     76.3          points.loc[idx, 'cluster_alt'] = clusterNr    55                                                       56                                               # ensure equality of both approaches    57         1          341    341.0      0.0      assert np.all(np.equal(points.cluster, points.cluster_alt))    58                                                   59         1            1      1.0      0.0      return points, nn_graph

Thus assigning an entire column at once was just under 10x faster.

Curt F. 1,65611 silver badges22 bronze badges · Answer 1 · 2016-02-01 16:24:28Z

Assignments to Pandas dataframes always work better when doing entire columns at once, instead of filling a single row or a small number of rows at a time. In the code below I first completely define a NumPy array of cluster numbers and only after it is completely defined do I pass it into the Pandas DataFrame.
Even aside from the speed issue, yourfor loop where you assign cluster numbers is very confusing because of a misuse ofenumerate: you haveclusterNr, idx in enumerate(nngrf) but the idiom is alwaysidx, value in enumerate(values). You have the position of the index and the value switched. In your particular case this works and is a nice trick but you should document how this works in your code if anyone besides you is ever meant to read it.
I changed variable names to bePEP8 compliant and to be easier to read. I also added some comments,docstrings, and formatting. Particular changes of note were definingn_points andn_samples as variables, wrapping the code you wrote into a function (so that I could line profile it), makingRADIUS a variable defined in all-caps (which in Python suggests it is a constant), and defining the radius in the call to.radius_neighbors_graph to be in terms ofRADIUS rather than just be another magic number hard-coded into the code. I think most of these changes improve your code's readability and make it more in line with style guidelines for Python.

Minor point: coercing to boolean with.astype('bool') works on SciPy sparse CSR matrices, and so doing the coercionbefore converting to a non-sparse NumPy array with.toarray() should be slightly faster and use less memory than doing things the other way around -- no time is wasted on converting zeroes.

import numpy as npimport pandas as pdfrom sklearn.neighbors import NearestNeighbors%load_ext line_profilerdef knn_clusters(n_points, n_samples):    """    Finds the which of n_samples points is closest to each of n_points randomly defined points.    """    RADIUS = 0.01    def get_random_coords(samples=1000, offsetX=52.2, offsetY=13.1, width=0.5):        """Generate random coordinates in a pandas dataframe"""        points = np.random.rand(samples, 2) * width        data = pd.DataFrame(points, columns=['latitude', 'longitude'])        data.latitude += offsetX        data.longitude += offsetY        # set spatial properties        data.columnX = 'latitude'        data.columnY = 'longitude'        return data    # some random points    points = get_random_coords(n_points)    # a subset of those points    samples = points.sample(n_samples)    # KNN    tree = NearestNeighbors(n_neighbors=2,                             radius=RADIUS,                             leaf_size=30,                             algorithm="ball_tree",                             n_jobs=1,                           ).fit(points)    #     nn_graph = tree.radius_neighbors_graph(samples,                                            radius = 10 * RADIUS,                                            mode = 'connectivity',                                          ).astype('bool').toarray()    # faster assigment to pandas dataframes: entire columns at once    cluster_number, point_number = np.where(nn_graph)    cluster_assignments = -np.ones(n_points)    cluster_assignments[point_number] = cluster_number    points.loc[:, 'cluster'] = cluster_assignments    # for comparison: assignment by specific rows is slow    points['cluster_alt'] = -1    for clusterNr, idx in enumerate(nn_graph):        points.loc[idx, 'cluster_alt'] = clusterNr    # ensure equality of both approaches    assert np.all(np.equal(points.cluster, points.cluster_alt))    return points, nn_graph

The results of line profiling with%lprun -f knn_clusters knn_clusters(10000, 1000) were:

Timer unit: 1e-06 sTotal time: 0.999948 sFile: <ipython-input-76-7be6bc3ea686>Function: knn_clusters at line 6Line #      Hits         Time  Per Hit   % Time  Line Contents==============================================================     6                                           def knn_clusters(n_points, n_samples):     7                                               """     8                                               Finds the which of n_samples points is closest to each of n_points randomly defined points.     9                                               """    10         1            2      2.0      0.0      RADIUS = 0.01    11                                                   12         1            2      2.0      0.0      def get_random_coords(samples=1000, offsetX=52.2, offsetY=13.1, width=0.5):    13                                                   """Generate random coordinates in a pandas dataframe"""    14                                                   points = np.random.rand(samples, 2) * width    15                                                   data = pd.DataFrame(points, columns=['latitude', 'longitude'])    16                                                   data.latitude += offsetX    17                                                   data.longitude += offsetY    18                                               19                                                   # set spatial properties    20                                                   data.columnX = 'latitude'    21                                                   data.columnY = 'longitude'    22                                               23                                                   return data    24                                                   25                                               # some random points    26         1         3710   3710.0      0.4      points = get_random_coords(n_points)    27                                                   28                                               # a subset of those points    29         1         7687   7687.0      0.8      samples = points.sample(n_samples)    30                                                   31                                               # KNN    32         1            4      4.0      0.0      tree = NearestNeighbors(n_neighbors=2,     33         1            2      2.0      0.0                              radius=RADIUS,     34         1            2      2.0      0.0                              leaf_size=30,     35         1            2      2.0      0.0                              algorithm="ball_tree",     36         1          275    275.0      0.0                              n_jobs=1,    37         1         6029   6029.0      0.6                             ).fit(points)    38                                                   39                                               #     40         1            7      7.0      0.0      nn_graph = tree.radius_neighbors_graph(samples,     41         1            4      4.0      0.0                                             radius=10*RADIUS,     42         1        90594  90594.0      9.1                                             mode='connectivity',    43         1        36823  36823.0      3.7                                            ).astype('bool').toarray()    44                                                   45                                               # faster assigment to pandas dataframes: entire columns at once    46         1        73476  73476.0      7.3      cluster_number, point_number = np.where(nn_graph)    47         1           74     74.0      0.0      cluster_assignments = -np.ones(n_points)    48         1        11675  11675.0      1.2      cluster_assignments[point_number] = cluster_number    49         1         2548   2548.0      0.3      points.loc[:, 'cluster'] = cluster_assignments    50                                                   51                                               # for comparison: assignment by specific rows is slow    52         1          889    889.0      0.1      points['cluster_alt'] = -1    53      1001         2464      2.5      0.2      for clusterNr, idx in enumerate(nn_graph):    54      1000       763337    763.3     76.3          points.loc[idx, 'cluster_alt'] = clusterNr    55                                                       56                                               # ensure equality of both approaches    57         1          341    341.0      0.0      assert np.all(np.equal(points.cluster, points.cluster_alt))    58                                                   59         1            1      1.0      0.0      return points, nn_graph

Thus assigning an entire column at once was just under 10x faster.

Movatterモバイル変換

Stack Exchange Network

PANDAS spatial clustering

1 Answer1

You mustlog in to answer this question.

Related

Hot Network Questions

Subscribe to RSS