Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Hackers And Slackers profile imageTodd Birchard
Todd Birchard forHackers And Slackers

Posted on • Originally published athackersandslackers.com on

     

Extract Nested Data From Complex JSON

Extract Nested Data From Complex JSON

We're all data people here, so you already know the scenario: it happens perhaps once a day, perhaps 5, or even more. There's an API you're working with, and it's great. It contains all the information you're looking for, but there's just one problem: the complexity of nested JSON objects is endless, and suddenly the job you love needs to be put on hold to painstakingly retrieve the data you actually want, and it's 5 levels deep in a nested JSON hell. Nobody feels like much of a "scientist" or an "engineer" when half their day becomes dealing with key value errors.

Luckily, we code inPython! (okay fine, language doesn't make much of a difference here. It felt like a rallying call at the time).

Using Google Maps API as an Example

To visualize the problem, let's take an example somebody might actually want to use. I think theGoogle Maps API is a good candidate to fit the bill here.

While Google Maps is actually a collection of APIs, theGoogle Maps Distance Matrix. The idea is that with a single API call, a user can calculate the distance and time traveled between an origin and an infinite number of destinations. It's a great full-featured API, but as you might imagine the resulting JSON for calculating commute time between where you stand andevery location in the conceivable universe makes an awfully complex JSON structure.

Getting a Taste of JSON Hell

Real quick, here's an example of the types of parameters this request accepts:

importrequestsimportAPI_KEYdefgoogle_api_matrix():"""Example Google Distance Matrix function."""endpoint="https://maps.googleapis.com/maps/api/distancematrix/json"params={'units':'imperial','key':API_KEY,'origins':'New York City, NY','destinations':'Philadelphia,PA','transit_mode':'car'}r=requests.get(endpoint,params=params)returnr.json
Enter fullscreen modeExit fullscreen mode

One origin, one destination. The JSON response for a request this straightforward is quite simple:

{"destination_addresses":["Philadelphia, PA, USA"],"origin_addresses":["New York, NY, USA"],"rows":[{"elements":[{"distance":{"text":"94.6 mi","value":152193},"duration":{"text":"1 hour 44 mins","value":6227},"status":"OK"}]}],"status":"OK"}
Enter fullscreen modeExit fullscreen mode

For each destination, we're getting two data points: thecommute distance, andestimated duration. If we hypothetically wanted to extract those values, typingresponse['rows'][0]['elements']['distance']['test'] isn'ttoo crazy. I mean, it's somewhat awful and brings on casual thoughts of suicide, but nothing out of the ordinary

Now let's make things interesting by adding a few more stops on our trip:

importrequestsimportAPI_KEYdefgoogle_api_matrix():"""Example Google Distance Matrix function."""endpoint="https://maps.googleapis.com/maps/api/distancematrix/json"params={'units':'imperial','key':API_KEY,'origins':'New York City, NY','destinations':'Washington,DC|Philadelphia,PA|Santa Barbara,CA|Miami,FL|Austin,TX|Napa County,CA','transit_mode':'car'}r=requests.get(endpoint,params=params)returnr.json
Enter fullscreen modeExit fullscreen mode

Oh fuuucckkkk:

{"destination_addresses":["Washington, DC, USA","Philadelphia, PA, USA","Santa Barbara, CA, USA","Miami, FL, USA","Austin, TX, USA","Napa County, CA, USA"],"origin_addresses":["New York, NY, USA"],"rows":[{"elements":[{"distance":{"text":"227 mi","value":365468},"duration":{"text":"3 hours 54 mins","value":14064},"status":"OK"},{"distance":{"text":"94.6 mi","value":152193},"duration":{"text":"1 hour 44 mins","value":6227},"status":"OK"},{"distance":{"text":"2,878 mi","value":4632197},"duration":{"text":"1 day 18 hours","value":151772},"status":"OK"},{"distance":{"text":"1,286 mi","value":2069031},"duration":{"text":"18 hours 43 mins","value":67405},"status":"OK"},{"distance":{"text":"1,742 mi","value":2802972},"duration":{"text":"1 day 2 hours","value":93070},"status":"OK"},{"distance":{"text":"2,871 mi","value":4620514},"duration":{"text":"1 day 18 hours","value":152913},"status":"OK"}]}],"status":"OK"}
Enter fullscreen modeExit fullscreen mode

A lot is happening here. There are objects. There are lists. There are lists of objects which are part of an object. The last thing I'd want to deal with is trying to parse this data only to accidentally get a useless key:value pair like"status": "OK".

Code Snippet To The Rescue

Let's say we only want the human-readable data from this JSON, which is labeled"text" for both distance and duration. We've created a function below dubbedextract_values() to help us resolve this very issue. The idea is thatextract_values() is flexible and agnostic, therefore can be imported as a module into any project you might need.

# recursivejson.pydefextract_values(obj,key):"""Pull all values of specified key from nested JSON."""arr=[]defextract(obj,arr,key):"""Recursively search for values of key in JSON tree."""ifisinstance(obj,dict):fork,vinobj.items():ifisinstance(v,(dict,list)):extract(v,arr,key)elifk==key:arr.append(v)elifisinstance(obj,list):foriteminobj:extract(item,arr,key)returnarrresults=extract(obj,arr,key)returnresults
Enter fullscreen modeExit fullscreen mode

We need to pass this function two values:

  • A JSON object, such asr.json() from an API request.
  • The name of thekey we're looking to extract values from.<!--kg-card-begin: markdown-->
names=extract_values('myjson.json','name')print(names)
Enter fullscreen modeExit fullscreen mode

Regardless of where the key"text" lives in the JSON, this function returns every value for the instance of"key." Here's our function in action:

importrequestsimportAPI_KEYfromrecursivejsonimportextract_valuesdefgoogle_api_matrix():"""Example Google Distance Matrix function."""endpoint="https://maps.googleapis.com/maps/api/distancematrix/json"params={'units':'imperial','key':API_KEY,'origins':"New York City,NY",'destinations':"Washington,DC|Philadelphia,PA|Santa Barbara,CA|Miami,FL|Austin,TX|Napa Valley,CA",'transit_mode':'car',}r=requests.get(endpoint,params=params)travel_values=extract_values(r.json(),'text')returntravel_values
Enter fullscreen modeExit fullscreen mode

Running this function will result in the following output:

['227 mi','3 hours 54 mins','94.6 mi','1 hour 44 mins','2,878 mi','1 day 18 hours','1,286 mi','18 hours 43 mins','1,742 mi','1 day 2 hours','2,871 mi','1 day 18 hours']
Enter fullscreen modeExit fullscreen mode

Ohfiddle me timbers! Because the Google API alternates betweendistance andtrip duration , every other value alternates between distance and time (can we pause to appreciate this horrible design? There are infinitely better ways to structure this response). Never fear, some simple Python can help us split this list into two lists:

my_values=extract_values(r.json(),'text')durations=my_values[1::2]distances=my_values[2::1]print('DURATIONS =',durations)print('DISTANCES =',distances)
Enter fullscreen modeExit fullscreen mode

This will take our one list and split it in totwo lists, alternating between even and odd:

DURATIONS=['3 hours 54 mins','1 hour 44 mins','1 day 18 hours','18 hours 43 mins','1 day 2 hours','1 day 18 hours']DISTANCES=['94.6 mi','1 hour 44 mins','2,878 mi','1 day 18 hours','1,286 mi','18 hours 43 mins','1,742 mi','1 day 2 hours','2,871 mi','1 day 18 hours']
Enter fullscreen modeExit fullscreen mode

Getting Creative With Lists

A common theme I run into while extracting lists of values from JSON objects like these is that the lists of values I extract are very much related. In the above example, for everyduration we have an accompanyingdistance, which is a one-to-one basis. Imagine if we wanted to associate these values somehow?

To use a better example, I recently I used thisexact_values() function to extract lists of column names and their data types from a database schema. As separate lists, the data looked something like this:

column_names=['index','first_name','last_name','join_date']column_datatypes=['integer','string','string','date']
Enter fullscreen modeExit fullscreen mode

Clearly these two lists are directly related; the latter is describing the former. How can this be useful? By using Python'szip method!

schema_dict=dict(zip(column_names,column_datatypes))print(schema_dict)
Enter fullscreen modeExit fullscreen mode

I like to think they call itzip because it's like zipping up a zipper, where each side of the zipper is a list. This output a dictionary where list 1 serves as the keys, and list 2 serves as values:

{'index':'integer','first_name':'string','last_name':'string','join_date':'date'}
Enter fullscreen modeExit fullscreen mode

And there you have it folks: a free code snippet to copy and secretly pretend you wrote forever. I've thrown the function up onGithub Gists, if such a thing pleases you.

In the meantime, zip it up and zip it out. Zippity-do-da, buh bye.

Top comments(0)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

Technology for Badasses.

A community of degenerates obsessed with data science, data engineering, and analysis. Tackling major issues with unconventional tutorials. Openly pushing a pro-robot agenda.

More fromHackers And Slackers

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp