First, let’s recap:
- Chris Whong asks for and receives the NYC Taxi & Limousine Commission (TLC) taxi trip data, using a Freedom of Information request. (In New York State, this is called FOIL).
- Chris, along with Andrés Monroy, make this data available to the public via basic download as well as BitTorrent.
- Vijay Pandurangan looks at the data, notices there are two columns of hash values. After an exercise in rainbow tables, he publishes a blog post describing how the two hash value columns can be reverse engineered to identify both the vehicle and the driver for each trip.
- Ars Technica publishes an article entitled “Poorly anonymized logs reveal NYC cab drivers’ detailed whereabouts”
- Motherboard (Vice) publishes a similar, but perhaps more balanced, story.
- Mark Headd, former Chief Data Officer for Philadelphia, writes a blog post, suggesting that these types of FOIL requests should be more tightly integrated into the open data / open government pipeline.
I’m going to play devil’s advocate here for a moment. Let me be clear, the below is not necessarily the stance that I have, and it is absolutely not my opinion in any official capacity. Rather, I seek to push the conversation forward.
Let’s start by asking a basic question: is this data actually sensitive? Taxis in NYC are a highly regulated industry, and as a result, riders should get consistently safe and fair service. The TLC already publishes lists of authorized taxi drivers (with their license numbers) and authorized vehicles. This publication isn’t accidental; companies which operate fleets of taxis use it to determine whether a driver or a vehicle is authorized to be in service for customers. What’s really being highlighted in the attention so far is the mosaic effect.
Applying the mosaic effect with the above mentioned additional TLC data allows you to know not only where the taxis have traveled, but who owns the vehicle, who was driving it, and so on. This isn’t inherently a bad thing. For example, the New York Taxi Workers Alliance or the Greater New York Taxi Association, or even vehicle fleet owners could use it to identify drivers who are overworked - or other individual and systemic concerns.
One particular issue which has been identified is the ability to calculate how much a particular taxi driver has earned. I might counter that this, too, is not necessarily a bad thing. It is generally not too difficult to obtain income data, though you might have to pay to obtain it for non-government workers. Even so, there are plenty of other factors which would throw off that number - tip income, gas and other vehicle maintenance expenses, fleet rental charges, etc.
Another concern noted is the potential for identifying where a taxi driver lives. It is important to remember that many drivers operate vehicles out of depots; far fewer are owner-operators. Also, this data represents trips which are paid for by customers. Off-duty use of the vehicles (including trips to and from home, vehicle depots, waiting in line for a fare at the airport, rest stops, etc) is not included. In an analysis, you might find a pattern of drivers only driving to a certain neighborhood at a certain time of the day. That neighborhood may be near their home or near the taxi depot, but that same outcome also implies that this driver might be breaking the rules - if an on-duty taxi picks you up, it has to take you wherever you want within the five boroughs. Is the risk of finding out a driver’s home neighborhood or depot location enough to justify losing the ability to determine if they might be improperly refusing trips?
But that’s just scratching the surface. There are actually some deeper questions which are worth asking:
- Let’s say a taxi driver decides their privacy has been violated and they have established grounds for a law suit. Who do they sue? The TLC? Chris Whong? Both? And who is actually liable? A court might decide the TLC made reasonable effort to protect driver privacy (technical explanations and alternatives notwithstanding), and it’s really Chris who is liable because he made the data publicly accessible.
- Does this (or a similar situation) open the door to allow government to refuse FOIA/FOIL requests on the basis of the possibility of the mosaic effect? And if so, are there specific criteria that exist to help determine whether the data might contribute to the mosaic effect?
- Can or should the government have the ability to place terms and conditions or license restrictions on information obtained through FOIA/FOIL requests? For example, “you may publish works based upon this data but may not publish the source data”. And, by extension, does the government have the right to file a DMCA takedown request against Chris Whong?
- Are government staff who handle FOIA/FOIL requests properly equipped to handle these types of requests and their mosaic-effect implications? Should every request for raw data also be routed through an IT Security team and/or open data specialists?
- Should data that the government makes publicly available in single-record form be held back in raw, bulk form because it lowers the barrier for use, on the basis that it could be used for nefarious purposes?
- Had the TLC simply left the data de-anonymized, would any of this conversation have taken place?
Calling attention to the technique used to anonymize the data is valuable. But this is not a straightforward issue. The potential for quick reactions and missteps - without thorough exploration of the consequences - can have significant impact to the future of both proactive (open data) and reactive (FOIA/FOIL) government disclosure. Let’s keep the dialogue going and find the right balance.