First, let’s recap:
- Chris Whong asks for and receives the NYC Taxi & Limousine Commission (TLC) taxi trip data, using a Freedom of Information request. (In New York State, this is called FOIL).
- Chris, along with Andrés Monroy, make this data available to the public via basic download as well as BitTorrent.
- Vijay Pandurangan looks at the data, notices there are two columns of hash values. After an exercise in rainbow tables, he publishes a blog post describing how the two hash value columns can be reverse engineered to identify both the vehicle and the driver for each trip.
- Ars Technica publishes an article entitled “Poorly anonymized logs reveal NYC cab drivers’ detailed whereabouts”
- Motherboard (Vice) publishes a similar, but perhaps more balanced, story.
- Mark Headd, former Chief Data Officer for Philadelphia, writes a blog post, suggesting that these types of FOIL requests should be more tightly integrated into the open data / open government pipeline.
I’m going to play devil’s advocate here for a moment. Let me be clear, the below is not necessarily the stance that I have, and it is absolutely not my opinion in any official capacity. Rather, I seek to push the conversation forward.
Let’s start by asking a basic question: is this data actually sensitive? Taxis in NYC are a highly regulated industry, and as a result, riders should get consistently safe and fair service. The TLC already publishes lists of authorized taxi drivers (with their license numbers) and authorized vehicles. This publication isn’t accidental; companies which operate fleets of taxis use it to determine whether a driver or a vehicle is authorized to be in service for customers. What’s really being highlighted in the attention so far is the mosaic effect.
Applying the mosaic effect with the above mentioned additional TLC data allows you to know not only where the taxis have traveled, but who owns the vehicle, who was driving it, and so on. This isn’t inherently a bad thing. For example, the New York Taxi Workers Alliance or the Greater New York Taxi Association, or even vehicle fleet owners could use it to identify drivers who are overworked - or other individual and systemic concerns.
One particular issue which has been identified is the ability to calculate how much a particular taxi driver has earned. I might counter that this, too, is not necessarily a bad thing. It is generally not too difficult to obtain income data, though you might have to pay to obtain it for non-government workers. Even so, there are plenty of other factors which would throw off that number - tip income, gas and other vehicle maintenance expenses, fleet rental charges, etc.
Another concern noted is the potential for identifying where a taxi driver lives. It is important to remember that many drivers operate vehicles out of depots; far fewer are owner-operators. Also, this data represents trips which are paid for by customers. Off-duty use of the vehicles (including trips to and from home, vehicle depots, waiting in line for a fare at the airport, rest stops, etc) is not included. In an analysis, you might find a pattern of drivers only driving to a certain neighborhood at a certain time of the day. That neighborhood may be near their home or near the taxi depot, but that same outcome also implies that this driver might be breaking the rules - if an on-duty taxi picks you up, it has to take you wherever you want within the five boroughs. Is the risk of finding out a driver’s home neighborhood or depot location enough to justify losing the ability to determine if they might be improperly refusing trips?
But that’s just scratching the surface. There are actually some deeper questions which are worth asking:
- Let’s say a taxi driver decides their privacy has been violated and they have established grounds for a law suit. Who do they sue? The TLC? Chris Whong? Both? And who is actually liable? A court might decide the TLC made reasonable effort to protect driver privacy (technical explanations and alternatives notwithstanding), and it’s really Chris who is liable because he made the data publicly accessible.
- Does this (or a similar situation) open the door to allow government to refuse FOIA/FOIL requests on the basis of the possibility of the mosaic effect? And if so, are there specific criteria that exist to help determine whether the data might contribute to the mosaic effect?
- Can or should the government have the ability to place terms and conditions or license restrictions on information obtained through FOIA/FOIL requests? For example, “you may publish works based upon this data but may not publish the source data”. And, by extension, does the government have the right to file a DMCA takedown request against Chris Whong?
- Are government staff who handle FOIA/FOIL requests properly equipped to handle these types of requests and their mosaic-effect implications? Should every request for raw data also be routed through an IT Security team and/or open data specialists?
- Should data that the government makes publicly available in single-record form be held back in raw, bulk form because it lowers the barrier for use, on the basis that it could be used for nefarious purposes?
- Had the TLC simply left the data de-anonymized, would any of this conversation have taken place?
Calling attention to the technique used to anonymize the data is valuable. But this is not a straightforward issue. The potential for quick reactions and missteps - without thorough exploration of the consequences - can have significant impact to the future of both proactive (open data) and reactive (FOIA/FOIL) government disclosure. Let’s keep the dialogue going and find the right balance.
(note: this is limited to my experience managing the NYC OpenData portal. Other municipalities might have different factors to consider.)
Yesterday’s Wall Street Journal featured a great article about civic communities formed around open government data in Chicago. And while Tom Schenk Jr. rightly clarifies the key takeway, I took slight offense to the assertion that Chicago’s data portal has the most data of any US city. The conversation that ensued is the basis for this blog post.
The Displayed Numbers are Inaccurate
First: the number which Socrata displays at the bottom of the default list (usually on a data portal’s homepage) is not an accurate count:
This number represents the sum of all the items in all the view types:
(As a side note, don’t even bother with the public analytics page. Mine currently says I have 3,682 datasets, which doesn’t even align with the number on the default list count.)
Second, let’s define what we mean by dataset. Some of Socrata’s view types are merely different representations of an underlying dataset, and they typically include user-generated content. You may have seen the term “unique representations” in our press releases, and that number does tend to align with the displayed count. But when we publicly report our actual number of datasets, we define datasets (using Socrata’s types) as “Datasets”, “Maps”, “External Datasets”, and “Files and Documents" that we (The City of New York) own and have specifically added to the portal. In theory, we could sum up these numbers, but unfortunately this theory doesn’t align with reality.
The culprit here is “Maps”. It really represents two different things: map data (kml, shapefiles, etc) which we have published, AND user-created views of tabular data which can be goecoded (this example is currently in the top-ten list of maps on my site).
Our Current Approach
Quite some time ago, Socrata set up a special dataset which appears in our catalog - the dataset of datasets. (I’ll leave it to the reader to decide whether that should be officially counted as a dataset or not.) We export that dataset as a file, load it up in MS Excel, filter for the types we want (see the list above), and then we filter the owner column for users whom we have authorized to publish data to the portal. An additional bit of confusion (which frankly we are still trying to iron out) is that some of our datasets are owned by Socrata employees (for example, the dataset of datasets), who helped publish our initial data and have helped us over the past two years as we diagnose bugs, etc.
How can this work better?
- The dataset of datasets should have a simple flag in it which we can use to filter city-owned vs user-generated.
- The visual presentation of the dataset catalog should distinguish first-party vs third-party items. User-created filters (which is a great community feature) should be distinguished from officially-created filters.
- The dataset catalog filtering mechanism should include the ability to select for official items vs unofficial items.
Recently a colleague from outside of government said to me:
If only [government agency x] had shown it to us before launching, we could have given some feedback to make it better.
I conceptually agree that such an exercise would have been valuable, but, the challenge I see with a statement like this is that it’s - to a limited extent - the equivalent of insider trading. My colleague may not have intended his statement this way. As a government of/by/for the people, though, it’s important to avoid playing favorites (setting aside party politics and so on for a moment). Therefore, allowing one person or group a “sneak peek” while excluding others isn’t reasonable or acceptable.
The underlying premise of my colleague’s statement - get something into the hands of outside stakeholders so they can give feedback - is a very valid one. Sadly, it bumps up against the systemic aversion to risk which exists broadly across government. (There is a strong argument to be made that one of government’s basic functions is to reduce risk for all of society, hence organizations like the police, the SEC, and the FAA, but that’s another discussion.) This risk can be broken down into two types: reputation and liability.
Avoiding risks to reputation generally means avoiding the proverbial black eye - when newspaper articles argue that government should have done something differently, or when the public complains about poor use of the taxes they have paid. In my view, this is also known as the “get it done right the first time” factor.
Liability is a different beast - direct harm has come to a constituent as a result of an action or inaction for which the government theoretically bears some responsibility, and the repercussions can be financial restitution (more tax money spent), dramatic policy changes, invalidation of laws, criminal liability, and/or other things.
A lot of time - and therefore, money - is spent addressing risk, and a significant percentage of the bureaucracy is about risk management. (I had a recent example of this, where a small mistake resulted in the implementation of a governance process just in case it should ever happen again, despite there being only a tiny risk to reputation.) Aversion to risk isn’t just a government thing - it’s a human thing. We teach our children to look both ways before crossing the street; publicly-traded corporations rarely intentionally make decisions which will cause their stock to lose value. We value feeling safe.
The question, therefore, is how can a government explore new capacities in an intelligent way, without exposing itself to additional risk? How can it do so in a way which avoids bias? When it comes to public-facing technology, the answer - at least to me - seems relatively simple: set the correct expectations with the stakeholders. And of course, “stakeholders” doesn’t mean just the public. It also means the political leadership, the executive management, and so on.
I did exactly this when I helped launch the NYC Developer Portal beta a few months ago. Internally, we set expectations that we would launch the site without any fanfare (no press release, no big event, etc). Rather, we would do a small public launch where we could invite people to provide feedback about what they saw. When we did the quiet launch, we also carefully set public expectations: this is a work in progress (it’s not perfect and shouldn’t be heavily relied upon), we want your feedback (the public gets to help improve it), more things to come (some planned features aren’t ready yet). Both internally and externally, we stated that we planned to have a bigger “official” release at some undefined point in the future. We branded it as a “beta” effort, which it still remains today.
More important, though, is the need to establish a “beta” program - a systemic plan to set internal and public expectations correctly while engaging public stakeholders for feedback. It starts with well-defined criteria for what can enter the program. Clearly, some projects are poor candidates, because the risks are significant. But many projects, even ones we often think are critical, could be potential candidates for a “beta” release. Once the criteria is defined, it becomes a lot easier to convince internal stakeholders that the rewards outweigh the risks.
A formalized “beta” program allows us to avoid playing favorites to one individual or group over another. It allows us to get public stakeholder feedback in a constructive manner. Finally, it allows us to move more freely within a highly risk-averse culture.
This past Wednesday, I had the opportunity to share with NYC’s civic technology community a project I’ve been working on for quite a while. We launched the nyc.gov/developer beta platform, which we want to grow to be a community for software developers who work with City APIs to develop apps and other solutions. The slides from the presentation are embedded below.