Analyzing NYC Traffic Collision Data on the Linux Command Line with SoQL and Curl

New York City and the NYPD publish a dataset of traffic collisions, related fatalities, injuries and other detailshere.This was first published in July 2012, and is updated regularly to this date.

NYC Commuters, especially pedestrians and cyclists, have to endure many hazards just to get to and from work. Crime is one issue, but it’s not as treacherous as crossing Queens Boulevard during rush hour, or cycling in downtown Manhattan whensome clown driving an SUV the size of anSherman tank is taking up half the road.

Previously I did someanalysis using their downloadable CSV dataset.

Here I’m going to use thecurl utility along with theSODA, Socrata Query LanguageSoQL, to try and make some sense out of this publisheddata.

How many collisions in New York CIty since July 2012

curl --get --data-urlencode "\$\$app_token=uvwxyz" --data-urlencode "\$select=count(*)" https://data.cityofnewyork.us/resource/h9gi-nx95.json [{"count":"1977803"}]

Explanation

  • curl --get or-G
    • Use the GET verb as we are ‘getting’ data
    • v
      • Lots of verbose output as you can see from the above output.
    • d
      • Request data to pass to the API in ASCII format
    • --data-urlencode
      • URL-Encode the data. Safer than just using-d
  • $$app_token
    • Users personal authorization. Not really necessary for ad-hoc requests
    • Socrata open data API [App-Token]](https://dev.socrata.com/docs/app-tokens.html)
    • I replaced my actual token with ‘uvwxyz’ for fairly obvious reasons
  • `"$select=count(*)"
    • Similar to the SQLSELECT and SQLcount aggregate function
    • SoQL$select
    • SoQLcount

Observation

1,977,803 Collisions from July 2012 to March 2023 seems like a lot to me. You’d wonder what’s the point of driving tests if we still end up with this many collisions.

Getting all the collision records between two arbitrary dates, June 30th 2022 to December 31 2022.

This time I’ll use the-v switch for curl to get a much more verbose output.

> curl --get -v  --data-urlencode "\$\$app_token=xyz" --data-urlencode "\$select=*" --data-urlencode "\$where=crash_date between '2022-06-30T00:00:00.000' and '2022-12-31T00:00:00.000'"  https://data.cityofnewyork.us/resource/h9gi-nx95.json % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                 Dload  Upload   Total   Spent    Left  Speed  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 52.206.68.26:443...* Connected to data.cityofnewyork.us (52.206.68.26) port 443 (#0)...> GET /resource/h9gi-nx95.json?$$app_token=xyz&$select=%2A&$where=crash_date+between+%272022-06-30T00%3A00%3A00.000%27+and+%272022-12-31T00%3A00%3A00.000%27 HTTP/1.1> Host: data.cityofnewyork.us> User-Agent: curl/7.81.0...< HTTP/1.1 200 OK....< X-SODA2-Fields: ["crash_date","crash_time","borough","zip_code","latitude","longitude","location","on_street_name","off_street_name","cross_street_name","number_of_persons_injured","number_of_persons_killed","number_of_pedestrians_injured","number_of_pedestrians_killed","number_of_cyclist_injured","number_of_cyclist_killed","number_of_motorist_injured","number_of_motorist_killed","contributing_factor_vehicle_1","contributing_factor_vehicle_2","contributing_factor_vehicle_3","contributing_factor_vehicle_4","contributing_factor_vehicle_5","collision_id","vehicle_type_code1","vehicle_type_code2","vehicle_type_code_3","vehicle_type_code_4","vehicle_type_code_5"]< X-SODA2-Types: ["floating_timestamp","text","text","text","number","number","location","text","text","text","number","number","number","number","number","number","number","number","text","text","text","text","text","number","text","text","text","text","text"]< X-SODA2-Data-Out-Of-Date: false...{ [14733 bytes data]...[{"crash_date":"2022-06-30T00:00:00.000","crash_time":"14:01","cross_street_name":"101       EAST DRIVE","number_of_persons_injured":"1","number_of_persons_killed":"0","number_of_pedestrians_injured":"0","number_of_pedestrians_killed":"0","number_of_cyclist_injured":"1","number_of_cyclist_killed":"0","number_of_motorist_injured":"0","number_of_motorist_killed":"0","contributing_factor_vehicle_1":"Pedestrian/Bicyclist/Other Pedestrian Error/Confusion","collision_id":"4542318","vehicle_type_code1":"Bike"}...,{"crash_date":"2022-07-03T00:00:00.000","crash_time":"22:30","borough":"BRONX","zip_code":"10458","latitude":"40.866802","longitude":"-73.88444","location":{"latitude":"40.866802","longitude":"-73.88444","human_address":"{\"address\": \"\", \"city\": \"\", \"state\": \"\", \"zip\": \"\"}"},"on_street_name":"WEBSTER AVENUE","off_street_name":"EAST 199 STREET","number_of_persons_injured":"0","number_of_persons_killed":"0","number_of_pedestrians_injured":"0","number_of_pedestrians_killed":"0","number_of_cyclist_injured":"0","number_of_cyclist_killed":"0","number_of_motorist_injured":"0","number_of_motorist_killed":"0","contributing_factor_vehicle_1":"Driver Inattention/Distraction","contributing_factor_vehicle_2":"Unspecified","collision_id":"4543075","vehicle_type_code1":"Station Wagon/Sport Utility Vehicle","vehicle_type_code2":"Station Wagon/Sport Utility Vehicle"}]

Explanation

  • 1000 records

    • When no$limit is set, this is the default maximum rows returned
  • curl --get or-G

    • Use the GET verb as we are ‘getting’ data
    • -v
      • Lots of verbose output as you can see
    • -d
      • Request data to pass to the API in ASCII format
    • --data-urlencode
      • URL-Encode the data. Safer than just using-d
  • $$app_token

    • Users personal authorization. Not really necessary for ad-hoc requests
  • `"$select=*"

    • Similar to an SQLSELECT
    • Selecting all columns. This is the default and can be omitted
    • SoQL$select
  • $where

    • Similar to SQLWHERE to filter down data.
    • SoQL$where
  • between … and …

    • SoQLbetween
    • Narrow our results down to collisions between the two *inclusive ‘crash_date’ values

Observation

It turns out after piping this request to awc command, that the API only returns 1000 rows, which is the default maximum amount if the$limit clause isn’t specified. With the$limit clause, the maximum amount that can be returned with one call is 50,000 rows. To get more, you will need to order andpage through the data.One other thing to note here is that when using the-v, verbose switch, you get to see the column names and their data types.

The NYC dataset column names
"crash_date","crash_time","borough","zip_code","latitude","longitude","location","on_street_name","off_street_name","cross_street_name","number_of_persons_injured","number_of_persons_killed","number_of_pedestrians_injured","number_of_pedestrians_killed","number_of_cyclist_injured","number_of_cyclist_killed","number_of_motorist_injured","number_of_motorist_killed","contributing_factor_vehicle_1","contributing_factor_vehicle_2","contributing_factor_vehicle_3","contributing_factor_vehicle_4","contributing_factor_vehicle_5","collision_id","vehicle_type_code1","vehicle_type_code2","vehicle_type_code_3","vehicle_type_code_4","vehicle_type_code_5"

Corresponding Field Data Types

"floating_timestamp","text","text","text","number","number","location","text","text","text","number","number","number","number","number","number","number","number","text","text","text","text","text","number","text","text","text","text","text"

SoQLQuery Clauses from the Docs

ParameterDescriptionDefaultIn $query
$selectThe set of columns to be returned, similar to a SELECT in SQLAll columns, equivalent to $select=*SELECT
$whereFilters the rows to be returned, similar to WHERENo filterWHERE
$orderColumn to order results on, similar to ORDER BY in SQLUnspecified orderORDER BY
$groupColumn to group results on, similar to GROUP BY in SQLNo groupingGROUP BY
$havingFilters the rows that result from an aggregation, similar to HAVINGNo filterHAVING
$limitMaximum number of results to return1000 (2.0 endpoints: maximum of 50,000; 2.1: unlimited »)LIMIT
$offsetOffset count into the results to start at, used for paging0OFFSET
$qPerforms a full text search for a value.No searchN/A
$queryA full SoQL query string, all as one parameterN/AN/A
$$bomPrepends a UTF-8 Byte Order Mark to the beginning of CSV outputfalseN/A

Get all the collisions for zip code 10036, Times Square NYC, for Feb 2023

Save it into filetimes_square_july_2022.json

curl --get --data-urlencode "\$\$app_token=uvwxyz"  --data-urlencode "\$select=*" / --data-urlencode "\$where=crash_date between '2023-02-01T00:00:00.000' and '2023-02-28T00:00:00.000'" --data-urlencode "zip_code=10036"  https://data.cityofnewyork.us/resource/h9gi-nx95.json >    collisions_z10036_feb_2023.json

Explanation

  • "$where=crash_date between ‘2023-02-01T00:00:00.000’ and ‘2023-02-28T00:00:00.000’" –data-urlencode "zip_code=10036"
    • Specify dates between and including February 1st to the 28th.
    • zip_code=10036 to narrow down our results.

Count how many collisions using the Linuxwc command with our newly created file,times_square_july_2022.json.

wc -l collisions_z10036_feb_2023.json 25 collisions_z10036_feb_2023.json

Double check that count of 25 collisions, using the SoQlcount(*) function.

> curl --get --data-urlencode "\$\$app_token=uvwxyz"  --data-urlencode "\$select=count(*)"  --data-urlencode "\$where=crash_date between '2023-02-01T00:00:00.000' and '2023-02-28T00:00:00.000'" --data-urlencode "zip_code=10036"  https://data.cityofnewyork.us/resource/h9gi-nx95.json [{"count":"25"}]

Explanation

  • \$select=count(*)
    • Similar to the SQLcount function, this uses the SoQLcount function to count the number of rows that match our search criteria.
  • [{"count":"25"}], which matches the number of records in the collisions_z10036_feb_2023.json file

Observation

25 collisions in one midtown zip code for February is almost 1 collision a day. I’m sure that’s lower than many other zip codes.

Get the 10 worst zip codes for collisions in February 2023

> curl --get --silent  ‘$$app_token=uvwxyz’  --data-urlencode "\$select=count(*), zip_code"   --data-urlencode "\$where=crash_date between '2023-02-01T00:00:00.000' and '2023-02-28T00:00:00.000'"  --data-urlencode '$group=zip_code'   https://data.cityofnewyork.us/resource/h9gi-nx95.json | jq -r '.[] | .zip_code + " " + .count' | sort  -k 2,2nr -k 1n | head -n1011207 10511212 8511208 7911226 7511234 7211236 7211101 7111203 6711368 6711211 62

Explanation

OK, I threw in a lot of commands here.

  • "$select=count(*), zip_code"
    • Selecting the count and zip_code
    • SoQLcount function to count the number of rows that match our search criteria.
  • $group=zip_code
    • Similar to the SQLGROUP BY
    • Returns aggregate rows grouped by thezip_code
  • jq -r '.[] | .zip_code + " " + .count'
    • Using the very usefuljq to do additional filtering
    • jq bills itself as, “a lightweight and flexible command-line JSON processor”
    • I extract the zip_code and collision count for each zip code and concatenate them using thebash +, concatenation operator
  • sort -k 2,2nr -k 1n
    • Using the bashsort command, we do a reverse numerical sort by the second field, which is the count. We also do a numerical sort on the zip_code for zip_codes with identical collision counts
  • head -n10
    • This gets the first 10, which are the 10 zip codes with the most collisions, starting with the very worst.

Observation

I could have used SoQL$sort and$limit to do some of this work, but I chose thebash sort, just because …Zip code11207, East New York, Brooklyn, emerges as the zip with the most collisions in February.This zip has a lot of issues with traffic safety, as you could also checkhere .105 collisions in one month. 3.75 a day? There’s something seriously wrong there. You’d probably need some kind of armor suit just to cross the street there.

As the queries get more complex, these one line commands start to get long and hard to manage.

Curl has an option to create a config file. On a Linux system the default config is usually~/.curlrc. You can specify a config file with the-K or--config switch.

I created the below config file for these requestsThe config file sets the NYC API URL, the $$app_token parameter, a GET request, as well as asking for verbose output

##### The ./.nyc_curlrc file contents# --- NYC Collision Data ---geturl = "https://data.cityofnewyork.us/resource/h9gi-nx95.json"data-urlencode  =  "\$\$app_token=uvwxyz"verbose

The previous example can now be rewritten to use the.nyc_curlrc config file. I also broke up the commands into separate lines using the bash continuation ‘\’ . Enclosing some of the commands in single quotes also means that the ‘$’ doesn’t need to be escaped.

> curl -K ./.nyc_curlrc \ --data-urlencode '$select=count(*), zip_code' \ --data-urlencode '$where=crash_date between "2023-02-01T00:00:00.000" and "2023-02-28T00:00:00.000"' \ --data-urlencode '$group=zip_code'  \  | jq -r '.[] | .zip_code + " " + .count' | sort  -k 2,2nr -k 1n | head -n10

This is a little more concise than the previous version, and yields the same result.

Now to find how many cyclists and pedestrians were killed over the duration of this dataset

 curl -K ./.nyc_curlrc \  --data-urlencode "\$select=date_extract_y(crash_date) AS year, SUM(number_of_pedestrians_killed) AS tot_pedestrians_killed, SUM(number_of_cyclist_killed) AS tot_cyclist_killed"  \  --data-urlencode "\$group=year" \  --data-urlencode "\$order=tot_pedestrians_killed DESC"  | jq .  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                 Dload  Upload   Total   Spent    Left  Speed100   885    0   885    0     0   1616      0 --:--:-- --:--:-- --:--:--  1614[  {    "year": "2013",    "tot_pedestrians_killed": "176",    "tot_cyclist_killed": "11"  },  {    "year": "2016",    "tot_pedestrians_killed": "149",    "tot_cyclist_killed": "18"  },  {    "year": "2014",    "tot_pedestrians_killed": "133",    "tot_cyclist_killed": "20"  },  {    "year": "2015",    "tot_pedestrians_killed": "133",    "tot_cyclist_killed": "15"  },  {    "year": "2022",    "tot_pedestrians_killed": "132",    "tot_cyclist_killed": "18"  },  {    "year": "2019",    "tot_pedestrians_killed": "131",    "tot_cyclist_killed": "31"  },  {    "year": "2021",    "tot_pedestrians_killed": "129",    "tot_cyclist_killed": "19"  },  {    "year": "2017",    "tot_pedestrians_killed": "127",    "tot_cyclist_killed": "27"  },  {    "year": "2018",    "tot_pedestrians_killed": "123",    "tot_cyclist_killed": "10"  },  {    "year": "2020",    "tot_pedestrians_killed": "101",    "tot_cyclist_killed": "29"  },  {    "year": "2012",    "tot_pedestrians_killed": "72",    "tot_cyclist_killed": "6"  },  {    "year": "2023",    "tot_pedestrians_killed": "18",    "tot_cyclist_killed": "8"  }]

Explanation

  • date_extract_y(crash_date) AS year
    • Will extract ‘2023’ from ‘2023-02-03T00:00:00.000’
    • SoQLdate_extract_y
    • SUM(number_of_pedestrians_killed) AS tot_pedestrians_killed
      • SUM
        • Similar to SQLSUM aggregate function.
      • AS
        • Give these aggregate results a meaningful label
    • $group=year and$order=tot_pedestrians_killed
      • Similar to the SQLGROUP BY andORDER BY
      • Returns aggregate rows grouped by the year they occurred.
      • Sorted having the year with most pedestrian fatalities first
    • jq .
      • This is the most basicjq command
      • It just prints the JSON output in it’s default “pretty” format
    • We could have added--silent to thecurl command or config file, to not print the curl download statistics.

Observation

    2012 and the current year, 2023,  can be omitted as both years have incomplete data.

Run the previous query minus years 2012 and 2023

curl -K ./.nyc_curlrc   --data-urlencode '$select=date_extract_y(crash_date) AS year, SUM(number_of_pedestrians_killed) AS tot_pedestrians_killed, SUM(number_of_cyclist_killed) AS tot_cyclists_killed' \  --data-urlencode '$where=year not in ("2012", "2023")' \  --data-urlencode '$group=year' \  --data-urlencode '$order=tot_pedestrians_killed DESC, tot_cyclists_killed'[{"year":"2013","tot_pedestrians_killed":"176","tot_cyclists_killed":"11"},{"year":"2016","tot_pedestrians_killed":"149","tot_cyclists_killed":"18"},{"year":"2015","tot_pedestrians_killed":"133","tot_cyclists_killed":"15"},{"year":"2014","tot_pedestrians_killed":"133","tot_cyclists_killed":"20"},{"year":"2022","tot_pedestrians_killed":"132","tot_cyclists_killed":"18"},{"year":"2019","tot_pedestrians_killed":"131","tot_cyclists_killed":"31"},{"year":"2021","tot_pedestrians_killed":"129","tot_cyclists_killed":"19"},{"year":"2017","tot_pedestrians_killed":"127","tot_cyclists_killed":"27"},{"year":"2018","tot_pedestrians_killed":"123","tot_cyclists_killed":"10"},{"year":"2020","tot_pedestrians_killed":"101","tot_cyclists_killed":"29"}]

Explanation

  • $where=year not in ("2012", "2023")
    • Added aWHERE clause to omit years 2012 and 2023 from the query
    • SoQLnot in (…)

Observation

Well, it’s not that safe being a pedestrian or cyclist in New York City. Checking the injury count would yield much higher numbers.

Run a query to get a yearly total of injured pedestrians and cyclists

Our query string was getting a little bit out of hand and difficult to manage.I created a dedicated config file,.nyc_ped_cyc_injured_yearly_curlrc for our next request.

The Config

> cat .nyc_ped_cyc_injured_yearly_curlrc # --- NYC Collision Data - Injured List  ---geturl = "https://data.cityofnewyork.us/resource/h9gi-nx95.json"data-urlencode  = "\$\$app_token=uvwxyz"data-urlencode  = "\$select=date_extract_y(crash_date) AS year, SUM(number_of_pedestrians_injured) AS tot_pedestrians_injured, SUM(number_of_cyclist_injured) AS tot_cyclists_injured"data-urlencode  = "\$where=year not in ('2012','2023')"data-urlencode  = "\$group=year" data-urlencode  = "\$order=tot_pedestrians_injured DESC, tot_cyclists_injured DESC"silent

Query using the config file

>  curl --config ./.nyc_ped_cyc_injured_yearly_curlrc [{"year":"2013","tot_pedestrians_injured":"11988","tot_cyclists_injured":"4075"},{"year":"2017","tot_pedestrians_injured":"11151","tot_cyclists_injured":"4889"},{"year":"2018","tot_pedestrians_injured":"11123","tot_cyclists_injured":"4725"},{"year":"2016","tot_pedestrians_injured":"11090","tot_cyclists_injured":"4975"},{"year":"2014","tot_pedestrians_injured":"11036","tot_cyclists_injured":"4000"},{"year":"2019","tot_pedestrians_injured":"10568","tot_cyclists_injured":"4986"},{"year":"2015","tot_pedestrians_injured":"10084","tot_cyclists_injured":"4281"},{"year":"2022","tot_pedestrians_injured":"8963","tot_cyclists_injured":"5025"},{"year":"2021","tot_pedestrians_injured":"7503","tot_cyclists_injured":"4961"},{"year":"2020","tot_pedestrians_injured":"6691","tot_cyclists_injured":"5576"}]

Observation

Looks like the config file worked as expected. While the number of pedestrians injured is declining a little, the number of cyclists injured is going in the opposite direction.

Usingjq to do additional filtering

Similar to the previous query, extract the yearly totals of injured cyclists. This time we’ll usejq to filter the output.

> curl --config ./.nyc_ped_cyc_injured_yearly_curlrc \  | jq -r '.[] | .year + "," + .tot_cyclists_injured' | sort -k 1n \  | column -t -s, --table-columns=Year,CyclistsInjured Year  CyclistsInjured2013  40752014  40002015  42812016  49752017  48892018  47252019  49862020  55762021  49612022  5025

Explanation

This is similar to the previous query except I usedjq to extract the injured cyclist data only from the returned results.

  • sort -k 1n
    • Sort the Year, numerically
  • column -t -s, --table-columns=Year,CyclistsInjured
    • Add column headers for readability
    • Thejq command already created comma separated results

Observation

2020 and 2022 were the worst years for bicyclist injuries. 2020 was a year where cycling became more popular. The injuries dropped a little in 2021, maybe because cyclists got a little scared after the slaughter in 2020. The upward trend may be returning, based on the 2022 results.

Get the 10 worst zip codes for collisions in January 2023

Previously I got the [10 worst Zip codes for collisions in February][Get the 10 worst zip codes for collisions in February 2023]. I used some bash commands to fine tune results. Here I will use SoQL to do most of the heavy lifting.

Config file.nyc_jan_coll_curlrc

> cat .nyc_jan_coll_curlrc # --- NYC Collision Data - January Collisions  ---geturl = "https://data.cityofnewyork.us/resource/h9gi-nx95.json"data-urlencode  = "\$\$app_token=uvwxyz"data-urlencode = "\$select=zip_code,count(zip_code) AS collision_count"data-urlencode = "\$where=crash_date between '2023-01-01' AND '2023-01-31' "data-urlencode = "\$group=zip_code"data-urlencode = "\$order=collision_count DESC, zip_code"data-urlencode = "\$limit=10"silent
> curl --config ./.nyc_jan_coll_curlrc \  | jq -r '.[] | .zip_code + ", " + .collision_count' \  | column -t -s, --table-columns=ZipCode,CollisionCountZipCode  CollisionCount11207     12411236     8311208     8211212     7711203     6911385     6711234     6611206     6410002     6311101     61

Explanation

Most of this is similar to our earlier request for February stats. This time we are using a new config file.nyc_jan_coll_curlrc.
Instead of sorting the results using the bash sort, we sort using the SoQL[$order](https://dev.socrata.com/docs/queries/order.html). We get the 10 worst using the$limit clause.

  • \$order=collision_count DESC, zip_code
    • Sort the collision count from worst to “least worst”.
    • zip_code ascending sort
  • $limit=10
    • Get the first 10 after the sort using$limit

Observation

Zip Code11207, again emerges as a collision prone area with 124 collisions in January. That’s 4 collisions a day. Every day is a regular demolition derby day in that part of Brooklyn.

Some Perl CLI Resources

Perldocs – perlrun

Peteris Krumins has some great e-books

Dave Cross – From one of his older posts on perl.com

Some NYC Street Resources

SODA Developers Guide

Some NYC Street Resources

StreetsBlog NYC

Hellgate NYC – Local NYC News

Liam Quigley – Local Reporter

More Liam Quigley – Twitter

These Stupid Trucks are Literally Killing Us – YouTube

Me

LinkedInblog