⅕ of NYC collision data wrong

New York has really stepped up its open data game in the last year. Data sets like ACRIS (the city property records) and MapPLUTO (GIS) that used to cost thousands of dollars are now free. While it’s in beta with limited access, the city now makes available its official geocoder, too. This openness is drawing into harsh relief excctly how poor the quality of the city’s collision statistics are.

The NYPD has been releasing intersection-level collision reports every month since August 2011. I’ve been collecting and processing these reports from an obfuscated Excel (originally only PDF) format into machine-readable CSVs.

These reports are the basis for NYC Crashmapper, an interactive map of these collisions across the city. Unfortunately, the actual releases do not include geographical information; they include only the borough, police precinct, and street intersection. With some help from Tom Swanson, most of these intersections were geocoded using ArcGIS, while new intersections appearing on the monthly report were batched through Google’s public geocoder.

This process yielded longitude and latitude for about 97% of the almost 44,000 intersections with collisions on record for the last several years. However, there were clearly problems with the data. Sometimes Google would get it wrong, placing an intersection well outside the boundaries of New York, or simply place an intersection in the wrong police precinct. Occasionally I would do cleanings to filter these out, but at heart Google’s geocoder is something of a black box. I knew some percentage of the intersections had the wrong longitude/latitude, but it was impossible to know precisely how many, or how wrong they were.

While NYC’s own geocoder has been floating around for a while, it was originally a Windows-only executable. I messed around with it, but booting up a Virtualbox and clicking around in a GUI couldn’t be easily integrated into an automated workflow.

Now the same data is available, although it requires an app registration and delay (several days, for me), as the Geoclient API. Six REST endpoints provide longitude and latitude, alongside a host of other info (city council districts, police precincts, etc.) as JSON based off of address, intersection, blockface, place name, BBL, or BIN input. I wrapped the API in Python bindings to make it a bit easier to work with.

Upon receiving email confirmation from DoITT that my app had been approved to use the API, I ran through the 44,000 intersections. There doesn’t appear to be any kind of rate limiting, and the process went relatively quickly. There’s no batch feature at the moment, so each request must be made individually.

	# of intersections	%
Successfully geocoded	36058	82.37
””” (precinct match)	33700	76.99
””” (precinct mismatch)	2358	5.39
Streets do not intersect	4139	9.46
Streets intersect twice	2651	6.06
Streets intersect more than twice	626	1.43
Invalid street name	149	0.34
Place name instead of street name, non-addressable	77	0.17
Misspelled street name	32	0.07
One street name was ‘UNKNOWN’	21	0.05
Place name instead of street name, addressable	19	0.04
Other (“not part of”)	2	0.00

The success rate was much less than the combined ArcGIS/Google approach: almost 18% of the intersections could not be associated with lon/lats. However, the Geoclient’s error messages are great. Instead of pulling the wrong lon/lat or nothing at all, it identifies exactly what’s going wrong: in almost 1 in 10 intersections in the reports, the identified streets don’t even intersect.

In many cases, this can be chalked up to a misspelling, generally a missed prefix or switched up Avenue, Street, or Place — for example, the intersection of 32 Street and Flatbush Ave in Brooklyn. Clearly East 32 Street must have been intended. In many other cases, the information must simply be wrong. One intersection, of Ocean Parkway and Seton Place in Brooklyn, is merely impossible.

The “streets intersect twice” phenomenon is interesting, and even approximately solvable. This happens in cases like the intersection of Brighton 3 Street and Neptune Avenue in Brooklyn, where the cross street jogs a short distance. It’s possible to ask the geocoder for the more northerly or southerly intersection, however, the NYPD data is still ambiguous as to which was closer.

The last significant chunk of missed geocoding is because of intersections “more than twice”. This can happen in intersections like Avenue K and King’s Highway where the layout of one of the streets results in multiple crossings.

There were an additional slightly more than 5% of intersections that were geocoded, but whose police precincts did not match. Since this is all intersection-level data, this is to be expected in some cases: intersections can skirt precinct boundaries, and in such cases it would be unclear which precinct the “real” report would come from. However, there were odd cases where the NYPD filed an intersection under the wrong precinct, for example, filing the intersection of Avenue V and Ocean Parkway in Brooklyn in the 60, instead of the 61 precinct where it actually is.

Even if the data quality on the existing reports were perfect, the information would be hobbled by the identification by intersection. The data is assembled from the state’s MV-104 forms, which do not have to specify a location more precisely than an intersection. Officers fill out these forms on-site, and any mistakes or inaccuracies on the form will be entered into the database, without any sort of further checking.

Google was far more liberal dispensing coordinates. Its geocoder was able to provide coordinates for the vast majority of the intersections Geoclient couldn’t, however these coordinates are very suspect.

	# of intersections	%
Geocoded in both	36055	82.37%
””” (> 0.01 lat/lon difference)	180	0.41%
””” (< 0.01, >0.0003 lat/lon difference)	378	0.86%
Google only	6490	14.83%
Geoclient only	3	0.00%
Neither	1226	2.80%

In cases where both geocoders captured a lon/lat but there was a significant difference, the NYC geocoder was superior. For example, it was able to capture the intersection of East 27 St and 1 Avenue in Manhattan. Google was unable to understand that East 27 St continues even though it is closed off to cars, and erroneously identified the intersection of East 2 St and 1 Ave.

Even fairly minute differences in longitude or latitude signal problems with the Google geocoding — for example, a difference as small as 0.0003 degree indicates that Google mixed up Beach Terrace and Oak Terrace on their intersection across Beekman Ave in the Bronx.

While the Geoclient geocoder is not able to make clear statements about many of the intersections in the NYPD collision dataset, it does a far better job for those it identifies. Those it cannot, which Google does, are ambiguous in the cases where the Geoclient’s error was that there are multiple intersections (about 7.5%), and generally indicate an NYPD data entry error in the cases where Geoclient recognizes no intersection (almost 10%).

What can be done now that it’s so clear how problematic these reports are?

The NYPD should be pressed to cooperate with DoITT to geocode intersections on the reports they file, and double check those that don’t pass through the geocoder. Even if the reports, being state mandated, can’t be modified, the geocoder is the perfect tool to catch bad data at the time of entry.
If the Geoclient API identifies a problem, the officer who entered problematic information should be asked to clarify. The correction or comment would be entered into the monthly release.
The NYPD could switch to a daily feed of MV-104 reports, and such a check could be worked directly into the filing process.

Accursed Ware

blog of john krauss, hacker, mapper, etc. github.com/talos @recessionporn

⅕ of NYC Collision Data Wrong

Comments