It’s great to see an empirical analysis of U.S. state open data programs! And of course, great that “We are [tied for] #1!!!” can be chanted from the rooftops all across the Empire State.

nylovestech:

Of the six top-scoring states for Open Data, New York was among the top. New York was the only state to establish an open data policy by means of an executive order.  Learn more about how New York ranks against the rest of country at http://bit.ly/1uMCYO7. 

It’s great to see an empirical analysis of U.S. state open data programs! And of course, great that “We are [tied for] #1!!!” can be chanted from the rooftops all across the Empire State.

nylovestech:

Of the six top-scoring states for Open Data, New York was among the top. New York was the only state to establish an open data policy by means of an executive order.  Learn more about how New York ranks against the rest of country at http://bit.ly/1uMCYO7

Contemplating the Mosaic Effect of Free Taxi Information

First, let’s recap:

I’m going to play devil’s advocate here for a moment. Let me be clear, the below is not necessarily the stance that I have, and it is absolutely not my opinion in any official capacity. Rather, I seek to push the conversation forward.

Let’s start by asking a basic question: is this data actually sensitive? Taxis in NYC are a highly regulated industry, and as a result, riders should get consistently safe and fair service. The TLC already publishes lists of authorized taxi drivers (with their license numbers) and authorized vehicles. This publication isn’t accidental; companies which operate fleets of taxis use it to determine whether a driver or a vehicle is authorized to be in service for customers. What’s really being highlighted in the attention so far is the mosaic effect.

Applying the mosaic effect with the above mentioned additional TLC data allows you to know not only where the taxis have traveled, but who owns the vehicle, who was driving it, and so on. This isn’t inherently a bad thing. For example, the New York Taxi Workers Alliance or the Greater New York Taxi Association, or even vehicle fleet owners could use it to identify drivers who are overworked - or other individual and systemic concerns.

One particular issue which has been identified is the ability to calculate how much a particular taxi driver has earned. I might counter that this, too, is not necessarily a bad thing. It is generally not too difficult to obtain income data, though you might have to pay to obtain it for non-government workers. Even so, there are plenty of other factors which would throw off that number - tip income, gas and other vehicle maintenance expenses, fleet rental charges, etc.

Another concern noted is the potential for identifying where a taxi driver lives. It is important to remember that many drivers operate vehicles out of depots; far fewer are owner-operators. Also, this data represents trips which are paid for by customers. Off-duty use of the vehicles (including trips to and from home, vehicle depots, waiting in line for a fare at the airport, rest stops, etc) is not included. In an analysis, you might find a pattern of drivers only driving to a certain neighborhood at a certain time of the day. That neighborhood may be near their home or near the taxi depot, but that same outcome also implies that this driver might be breaking the rules - if an on-duty taxi picks you up, it has to take you wherever you want within the five boroughs. Is the risk of finding out a driver’s home neighborhood or depot location enough to justify losing the ability to determine if they might be improperly refusing trips?

But that’s just scratching the surface. There are actually some deeper questions which are worth asking:

  • Let’s say a taxi driver decides their privacy has been violated and they have established grounds for a law suit. Who do they sue? The TLC? Chris Whong? Both? And who is actually liable? A court might decide the TLC made reasonable effort to protect driver privacy (technical explanations and alternatives notwithstanding), and it’s really Chris who is liable because he made the data publicly accessible.
  • Does this (or a similar situation) open the door to allow government to refuse FOIA/FOIL requests on the basis of the possibility of the mosaic effect? And if so, are there specific criteria that exist to help determine whether the data might contribute to the mosaic effect?
  • Can or should the government have the ability to place terms and conditions or license restrictions on information obtained through FOIA/FOIL requests? For example, “you may publish works based upon this data but may not publish the source data”. And, by extension, does the government have the right to file a DMCA takedown request against Chris Whong?
  • Are government staff who handle FOIA/FOIL requests properly equipped to handle these types of requests and their mosaic-effect implications? Should every request for raw data also be routed through an IT Security team and/or open data specialists?
  • Should data that the government makes publicly available in single-record form be held back in raw, bulk form because it lowers the barrier for use, on the basis that it could be used for nefarious purposes?
  • Had the TLC simply left the data de-anonymized, would any of this conversation have taken place?

Calling attention to the technique used to anonymize the data is valuable. But this is not a straightforward issue. The potential for quick reactions and missteps - without thorough exploration of the consequences - can have significant impact to the future of both proactive (open data) and reactive (FOIA/FOIL) government disclosure. Let’s keep the dialogue going and find the right balance.

Join me at the BetaNYC meetup on Wednesday 4/30 to help make an awesome visualization and get some new insights!
Also, see the source code for the above visualization here.
nylovestech:

Data.ny.gov has more than 100 transportation data items.  Check out the NYS Thruway’s trip data and share your data visualizations with us using #OpenNY: http://bit.ly/1rBeilz

Join me at the BetaNYC meetup on Wednesday 4/30 to help make an awesome visualization and get some new insights!

Also, see the source code for the above visualization here.

nylovestech:

Data.ny.gov has more than 100 transportation data items.  Check out the NYS Thruway’s trip data and share your data visualizations with us using #OpenNY: http://bit.ly/1rBeilz

noneck:

image

Chris M WhongBetaNYC’s co-Captian, visualized the 1100+ open datasets made available by New York City. This is a force-directed graph generated with the charting library d3.js. NYC’s open data portal runs on the Socrata platform*, and this visualization was created using the “dataset of datasets" and the Socrata Open Data API (SODA).

Chris writes “Why? The point is to show the scale of the portal, and to illustrate which datasets have user-created views. In the future, it would be great to dynamically size the circles by the popularity of the datasets.

* Chris is employed by Socrata
** GitHub Link

(via codeforamerica)

Accurately Counting Socrata Datasets

(note: this is limited to my experience managing the NYC OpenData portal. Other municipalities might have different factors to consider.)

Yesterday’s Wall Street Journal featured a great article about civic communities formed around open government data in Chicago. And while Tom Schenk Jr. rightly clarifies the key takeway, I took slight offense to the assertion that Chicago’s data portal has the most data of any US city. The conversation that ensued is the basis for this blog post.

The Displayed Numbers are Inaccurate

First: the number which Socrata displays at the bottom of the default list (usually on a data portal’s homepage) is not an accurate count:image

This number represents the sum of all the items in all the view types:

image

(As a side note, don’t even bother with the public analytics page. Mine currently says I have 3,682 datasets, which doesn’t even align with the number on the default list count.)

Second, let’s define what we mean by dataset. Some of Socrata’s view types are merely different representations of an underlying dataset, and they typically include user-generated content. You may have seen the term “unique representations” in our press releases, and that number does tend to align with the displayed count. But when we publicly report our actual number of datasets, we define datasets (using Socrata’s types) as “Datasets”, “Maps”, “External Datasets”, and “Files and Documents" that we (The City of New York) own and have specifically added to the portal. In theory, we could sum up these numbers, but unfortunately this theory doesn’t align with reality.

The culprit here is “Maps”. It really represents two different things: map data (kml, shapefiles, etc) which we have published, AND user-created views of tabular data which can be goecoded (this example is currently in the top-ten list of maps on my site).

Our Current Approach

Quite some time ago, Socrata set up a special dataset which appears in our catalog - the dataset of datasets. (I’ll leave it to the reader to decide whether that should be officially counted as a dataset or not.) We export that dataset as a file, load it up in MS Excel, filter for the types we want (see the list above), and then we filter the owner column for users whom we have authorized to publish data to the portal. An additional bit of confusion (which frankly we are still trying to iron out) is that some of our datasets are owned by Socrata employees (for example, the dataset of datasets), who helped publish our initial data and have helped us over the past two years as we diagnose bugs, etc.

The Future

How can this work better?

  1. The dataset of datasets should have a simple flag in it which we can use to filter city-owned vs user-generated.
  2. The visual presentation of the dataset catalog should distinguish first-party vs third-party items. User-created filters (which is a great community feature) should be distinguished from officially-created filters.
  3. The dataset catalog filtering mechanism should include the ability to select for official items vs unofficial items.

Beta Culture & Government

Recently a colleague from outside of government said to me:

If only [government agency x] had shown it to us before launching, we could have given some feedback to make it better.

I conceptually agree that such an exercise would have been valuable, but, the challenge I see with a statement like this is that it’s - to a limited extent - the equivalent of insider trading. My colleague may not have intended his statement this way. As a government of/by/for the people, though, it’s important to avoid playing favorites (setting aside party politics and so on for a moment). Therefore, allowing one person or group a “sneak peek” while excluding others isn’t reasonable or acceptable.

The underlying premise of my colleague’s statement - get something into the hands of outside stakeholders so they can give feedback - is a very valid one. Sadly, it bumps up against the systemic aversion to risk which exists broadly across government. (There is a strong argument to be made that one of government’s basic functions is to reduce risk for all of society, hence organizations like the police, the SEC, and the FAA, but that’s another discussion.) This risk can be broken down into two types: reputation and liability.

Avoiding risks to reputation generally means avoiding the proverbial black eye - when newspaper articles argue that government should have done something differently, or when the public complains about poor use of the taxes they have paid. In my view, this is also known as the “get it done right the first time” factor.

Liability is a different beast - direct harm has come to a constituent as a result of an action or inaction for which the government theoretically bears some responsibility, and the repercussions can be financial restitution (more tax money spent), dramatic policy changes, invalidation of laws, criminal liability, and/or other things.

A lot of time - and therefore, money - is spent addressing risk, and a significant percentage of the bureaucracy is about risk management. (I had a recent example of this, where a small mistake resulted in the implementation of a governance process just in case it should ever happen again, despite there being only a tiny risk to reputation.) Aversion to risk isn’t just a government thing - it’s a human thing. We teach our children to look both ways before crossing the street; publicly-traded corporations rarely intentionally make decisions which will cause their stock to lose value. We value feeling safe.

The question, therefore, is how can a government explore new capacities in an intelligent way, without exposing itself to additional risk? How can it do so in a way which avoids bias? When it comes to public-facing technology, the answer - at least to me - seems relatively simple: set the correct expectations with the stakeholders. And of course, “stakeholders” doesn’t mean just the public. It also means the political leadership, the executive management, and so on.

I did exactly this when I helped launch the NYC Developer Portal beta a few months ago. Internally, we set expectations that we would launch the site without any fanfare (no press release, no big event, etc). Rather, we would do a small public launch where we could invite people to provide feedback about what they saw. When we did the quiet launch, we also carefully set public expectations: this is a work in progress (it’s not perfect and shouldn’t be heavily relied upon), we want your feedback (the public gets to help improve it), more things to come (some planned features aren’t ready yet). Both internally and externally, we stated that we planned to have a bigger “official” release at some undefined point in the future. We branded it as a “beta” effort, which it still remains today.

More important, though, is the need to establish a “beta” program - a systemic plan to set internal and public expectations correctly while engaging public stakeholders for feedback. It starts with well-defined criteria for what can enter the program. Clearly, some projects are poor candidates, because the risks are significant. But many projects, even ones we often think are critical, could be potential candidates for a “beta” release. Once the criteria is defined, it becomes a lot easier to convince internal stakeholders that the rewards outweigh the risks.

A formalized “beta” program allows us to avoid playing favorites to one individual or group over another. It allows us to get public stakeholder feedback in a constructive manner. Finally, it allows us to move more freely within a highly risk-averse culture.

Launching the NYC Developer Portal Beta

This past Wednesday, I had the opportunity to share with NYC’s civic technology community a project I’ve been working on for quite a while. We launched the nyc.gov/developer beta platform, which we want to grow to be a community for software developers who work with City APIs to develop apps and other solutions. The slides from the presentation are embedded below.

Read More

Via nycopendata:

Check out this visualization designed and created by Eric Schles and Thomas Levine. Using MTA turnstyle open data and daily weather observations from NOAA, they show the impact that super storms have on subway ridership. 

What’s most curious about the bottom graph is the web of diagonal lines that travel from 1,000+ entries down to zero over several days after the subway system was shut down during superstorm Sandy. Is that caused by missing data, or is some other factor responsible?

Via nycopendata:

Check out this visualization designed and created by Eric Schles and Thomas Levine. Using MTA turnstyle open data and daily weather observations from NOAA, they show the impact that super storms have on subway ridership

What’s most curious about the bottom graph is the web of diagonal lines that travel from 1,000+ entries down to zero over several days after the subway system was shut down during superstorm Sandy. Is that caused by missing data, or is some other factor responsible?

This is a good starting point for SODA2.

flatironschool:

The following is a guest post by Stephen Chen and originally appeared on his blog. Stephen is currently a student a The Flatiron School. You can learn more about him here, or follow him on twitter here.

NYC provides a ton of data from a variety of sources to the public to use for…

Wish I’d thought of this.

Wish I’d thought of this.

(via windows95tips)