Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Cover image for Speeding up geodata processing
Sophia Parafina
Sophia Parafina

Posted on

     

Speeding up geodata processing

I've been using the excellentgeopandas for working with largish geodata sets and CSV files. While geopandas has been great for working with data, it is slow to ingest geodata. I ran a simple test to time reading a 1.2GB line shapefile into a dataframe.

importgeopandasasgpdimporttimeimportpickle# read shapefileread_start=time.process_time()data=gpd.read_file("Streets.shp")read_end=time.process_time()read_time=read_end-read_startprint(str(read_time/60)+" minutes")
Enter fullscreen modeExit fullscreen mode

25.43723483333333 minutes

Martin Fleischmann suggested usingpyogrio instead of geopandas. The result was quite impressive.

# alternate package for reading datafrompyogrioimportread_dataframeimporttimeimportpickle# read shapefileread_start=time.process_time()data=read_dataframe("Streets.shp")read_end=time.process_time()read_time=read_end-read_startprint(str(read_time/60)+" minutes")
Enter fullscreen modeExit fullscreen mode

2.9936875333333335 minutes

While going from 25 minutes to 3 minutes is quite an improvement, I'm building out a data processing pipeline and I want to reduce read time even more. My next experiment uses Python'spickle which serializes the data frame into a byte stream and writes it to a file. Here are the results from pickling the dataframe and reading the pickled data.

#create a filepicklefile=open('streets','wb')#pickle the dataframepickle_write_start=time.process_time()pickle.dump(data,picklefile)pickle_write_end=time.process_time()#close filepicklefile.close()pickle_write=(pickle_write_end-pickle_write_start)/60print(str(pickle_write)+" minutes")
Enter fullscreen modeExit fullscreen mode

4.362236583333333 minutes

#read the pickle filepicklefile=open('streets','rb')#unpickle the dataframepickle_read_start=time.process_time()df=pickle.load(picklefile)pickle_read_end=time.process_time()#close filepicklefile.close()pickle_read=(pickle_read_end-pickle_read_start)/60print(str(pickle_read)+" minutes")
Enter fullscreen modeExit fullscreen mode

0.9217719833333339 minutes

Wow! We've gone from 25 minutes to read a 1.2GB shapefile to less than a minute.

Finally, pickling the dataframe reduces the shapefile size from 1.2GB to 984MB. However, pickled data can be efficiently compressed.

importgzipimportshutilwithopen('streets','rb')asf_in:withgzip.open('streets.gz','wb')asf_out:shutil.copyfileobj(f_in,f_out)
Enter fullscreen modeExit fullscreen mode

The compressed file is 78.8MB, a bit more than a 10x reduction in file size.

Conclusion

If you're working with geodata that remains static, you can improve geopanda's read times by using pyogrio and pickling the resulting dataframe. Additionally, pickle files can be compressed efficiently, which can lower costs for data egress when using cloud storage.

Pickle image by Renee Comet

Top comments(0)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

Developer Advocate at Large
  • Location
    Texas and New Mexico
  • Pronouns
    she/her
  • Work
    open to work
  • Joined

More fromSophia Parafina

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp