Chris M WhongBetaNYC’s co-Captian, visualized the 1100+ open datasets made available by New York City. This is a force-directed graph generated with the charting library d3.js. NYC’s open data portal runs on the Socrata platform*, and this visualization was created using the “dataset of datasets" and the Socrata Open Data API (SODA).

Chris writes “Why? The point is to show the scale of the portal, and to illustrate which datasets have user-created views. In the future, it would be great to dynamically size the circles by the popularity of the datasets.

* Chris is employed by Socrata
** GitHub Link

(via codeforamerica)

Accurately Counting Socrata Datasets

(note: this is limited to my experience managing the NYC OpenData portal. Other municipalities might have different factors to consider.)

Yesterday’s Wall Street Journal featured a great article about civic communities formed around open government data in Chicago. And while Tom Schenk Jr. rightly clarifies the key takeway, I took slight offense to the assertion that Chicago’s data portal has the most data of any US city. The conversation that ensued is the basis for this blog post.

The Displayed Numbers are Inaccurate

First: the number which Socrata displays at the bottom of the default list (usually on a data portal’s homepage) is not an accurate count:image

This number represents the sum of all the items in all the view types:


(As a side note, don’t even bother with the public analytics page. Mine currently says I have 3,682 datasets, which doesn’t even align with the number on the default list count.)

Second, let’s define what we mean by dataset. Some of Socrata’s view types are merely different representations of an underlying dataset, and they typically include user-generated content. You may have seen the term “unique representations” in our press releases, and that number does tend to align with the displayed count. But when we publicly report our actual number of datasets, we define datasets (using Socrata’s types) as “Datasets”, “Maps”, “External Datasets”, and “Files and Documents" that we (The City of New York) own and have specifically added to the portal. In theory, we could sum up these numbers, but unfortunately this theory doesn’t align with reality.

The culprit here is “Maps”. It really represents two different things: map data (kml, shapefiles, etc) which we have published, AND user-created views of tabular data which can be goecoded (this example is currently in the top-ten list of maps on my site).

Our Current Approach

Quite some time ago, Socrata set up a special dataset which appears in our catalog - the dataset of datasets. (I’ll leave it to the reader to decide whether that should be officially counted as a dataset or not.) We export that dataset as a file, load it up in MS Excel, filter for the types we want (see the list above), and then we filter the owner column for users whom we have authorized to publish data to the portal. An additional bit of confusion (which frankly we are still trying to iron out) is that some of our datasets are owned by Socrata employees (for example, the dataset of datasets), who helped publish our initial data and have helped us over the past two years as we diagnose bugs, etc.

The Future

How can this work better?

  1. The dataset of datasets should have a simple flag in it which we can use to filter city-owned vs user-generated.
  2. The visual presentation of the dataset catalog should distinguish first-party vs third-party items. User-created filters (which is a great community feature) should be distinguished from officially-created filters.
  3. The dataset catalog filtering mechanism should include the ability to select for official items vs unofficial items.

Beta Culture & Government

Recently a colleague from outside of government said to me:

If only [government agency x] had shown it to us before launching, we could have given some feedback to make it better.

I conceptually agree that such an exercise would have been valuable, but, the challenge I see with a statement like this is that it’s - to a limited extent - the equivalent of insider trading. My colleague may not have intended his statement this way. As a government of/by/for the people, though, it’s important to avoid playing favorites (setting aside party politics and so on for a moment). Therefore, allowing one person or group a “sneak peek” while excluding others isn’t reasonable or acceptable.

The underlying premise of my colleague’s statement - get something into the hands of outside stakeholders so they can give feedback - is a very valid one. Sadly, it bumps up against the systemic aversion to risk which exists broadly across government. (There is a strong argument to be made that one of government’s basic functions is to reduce risk for all of society, hence organizations like the police, the SEC, and the FAA, but that’s another discussion.) This risk can be broken down into two types: reputation and liability.

Avoiding risks to reputation generally means avoiding the proverbial black eye - when newspaper articles argue that government should have done something differently, or when the public complains about poor use of the taxes they have paid. In my view, this is also known as the “get it done right the first time” factor.

Liability is a different beast - direct harm has come to a constituent as a result of an action or inaction for which the government theoretically bears some responsibility, and the repercussions can be financial restitution (more tax money spent), dramatic policy changes, invalidation of laws, criminal liability, and/or other things.

A lot of time - and therefore, money - is spent addressing risk, and a significant percentage of the bureaucracy is about risk management. (I had a recent example of this, where a small mistake resulted in the implementation of a governance process just in case it should ever happen again, despite there being only a tiny risk to reputation.) Aversion to risk isn’t just a government thing - it’s a human thing. We teach our children to look both ways before crossing the street; publicly-traded corporations rarely intentionally make decisions which will cause their stock to lose value. We value feeling safe.

The question, therefore, is how can a government explore new capacities in an intelligent way, without exposing itself to additional risk? How can it do so in a way which avoids bias? When it comes to public-facing technology, the answer - at least to me - seems relatively simple: set the correct expectations with the stakeholders. And of course, “stakeholders” doesn’t mean just the public. It also means the political leadership, the executive management, and so on.

I did exactly this when I helped launch the NYC Developer Portal beta a few months ago. Internally, we set expectations that we would launch the site without any fanfare (no press release, no big event, etc). Rather, we would do a small public launch where we could invite people to provide feedback about what they saw. When we did the quiet launch, we also carefully set public expectations: this is a work in progress (it’s not perfect and shouldn’t be heavily relied upon), we want your feedback (the public gets to help improve it), more things to come (some planned features aren’t ready yet). Both internally and externally, we stated that we planned to have a bigger “official” release at some undefined point in the future. We branded it as a “beta” effort, which it still remains today.

More important, though, is the need to establish a “beta” program - a systemic plan to set internal and public expectations correctly while engaging public stakeholders for feedback. It starts with well-defined criteria for what can enter the program. Clearly, some projects are poor candidates, because the risks are significant. But many projects, even ones we often think are critical, could be potential candidates for a “beta” release. Once the criteria is defined, it becomes a lot easier to convince internal stakeholders that the rewards outweigh the risks.

A formalized “beta” program allows us to avoid playing favorites to one individual or group over another. It allows us to get public stakeholder feedback in a constructive manner. Finally, it allows us to move more freely within a highly risk-averse culture.

Launching the NYC Developer Portal Beta

This past Wednesday, I had the opportunity to share with NYC’s civic technology community a project I’ve been working on for quite a while. We launched the beta platform, which we want to grow to be a community for software developers who work with City APIs to develop apps and other solutions. The slides from the presentation are embedded below.

Read More

Via nycopendata:

Check out this visualization designed and created by Eric Schles and Thomas Levine. Using MTA turnstyle open data and daily weather observations from NOAA, they show the impact that super storms have on subway ridership. 

What’s most curious about the bottom graph is the web of diagonal lines that travel from 1,000+ entries down to zero over several days after the subway system was shut down during superstorm Sandy. Is that caused by missing data, or is some other factor responsible?

Via nycopendata:

Check out this visualization designed and created by Eric Schles and Thomas Levine. Using MTA turnstyle open data and daily weather observations from NOAA, they show the impact that super storms have on subway ridership

What’s most curious about the bottom graph is the web of diagonal lines that travel from 1,000+ entries down to zero over several days after the subway system was shut down during superstorm Sandy. Is that caused by missing data, or is some other factor responsible?

This is a good starting point for SODA2.


The following is a guest post by Stephen Chen and originally appeared on his blog. Stephen is currently a student a The Flatiron School. You can learn more about him here, or follow him on twitter here.

NYC provides a ton of data from a variety of sources to the public to use for…

Wish I’d thought of this.

Wish I’d thought of this.

(via windows95tips)

Here I am at the Code for America Summit 2012.

Here I am at the Code for America Summit 2012.

This event is the first one where DataKind has partnered directly with a city government to run a data dive. Since it’s somewhat my fault, I feel obliged to reblog!


What: NYC Government DataDive

Where: School of Visual Arts

When: September 7- 9

This DataDive allows civic hackers to work directly with NYC government agencies and open data to create innovative and exciting new projects. Additionally, there will be an opportunity to speak with…

Register here.

The Balanced Context of Platforms

Over a month ago at the #PDFApplied hackathon, and two weeks ago at the #ReinventGreen hackathon, I saw some real-world examples of why building solutions as platforms is important. But before I get into that, let me take a step back.

When I assumed responsibility for NYC’s OpenData program late last year, it wasn’t because I have a deep passion for the data. The program represents the city government’s leading effort to become a platform for our information and services, and that is what makes it fascinating to me. Why are platforms interesting? Consider the following diagram:

As we all know, applications, be they mobile, on the web, or on your computer, are the key to most people’s interaction with digital technology. This interaction is (mostly) successful, because applications provide a highly relevant, meaningful context to the person using it. It tells the person what it is, and how to use it. A simple two-column database table (a date and some text) becomes a to-do list when context is provided by an application. The very same database table can become a history of accomplishments, but unless the application is designed that way, the additional context is not available to anyone except the very imaginative. In the application, context is very narrow (so people can understand its purpose), and in the database, context is very broad.

Between the database and the application are the logic, the rules, the platform. Rules and logic help create the function of a system, providing some context, but (if well-designed) not too much. Following the example from above, the platform can impose rule which only permits text with associated future dates to be added (making it a to-do list), or a rule which only permits those in the past (a history of accomplishments).

People who write software will easily recognize that this is multitier architecture. We’ve been building systems that way for many years. So what? Well, good platforms are the perfect balance, defining the functionality of a system just enough to be useful, but not so much that they become an imposition.

So, back to the hackathons. At #PDFApplied, one of the winning teams produced their first cut of Poll Watch USA, an application and text-messaging solution to crowdsource poll monitoring, enabling various interested parties to become aware of inappropriate activity at voting sites. A week prior, during the global Random Hacks of Kindness event, another similar application, Yo! Philly Votes was being assembled. Yo! Philly Votes claims to be leveraging the Ushahidi crowdmap platform, a good foundation to start from. (It’s wasn’t clear to me what Poll Watch USA was building on, if anything.) So how could these two projects have benefitted from each other’s work (assuming the problem of simple awareness didn’t exist)?

First, Poll Watch USA could have gotten a huge jumpstart by leveraging Ushahidi - no need to design a database out of thin air, no need to build core functionality. Second, any context-specific customizations (e.g. poll sites/voting) applied to the Ushahidi implementation could easily be shared between both projects, since they largely aim to achieve the same goal. From that foundation (i.e. platform), both applications would be contextualized for the appropriate audiences. The benefits are obvious: development time and energy saved, but more interestingly, the underlying data is now shared and can be looked at in aggregate - also with little or no effort.

At #ReinventGreen, one team created ReBounty, a site which amplifies the reuse in “Reduce, Reuse, Recycle”. This has a very similar function to NYC Department of Sanitation (DSNY)’s Stuff Exchange (also an iPhone app). If DSNY had developed Stuff Exchange as a platform - i.e. a set of APIs on which developers could have built their own highly contextualized applications, how much time and effort could have been saved, while bringing greater value to the respective target audiences? A lot.

The Clean Founders League project and the Reinvent Lots project could have benefitted from a different platform, if it existed as such: Change By Us NYC (there are definitely others, but I chose this because it’s a NYC government-supported initiative). As a platform, Change By Us could have powered both #ReinventGreen projects - enabling them to provide narrow context to their intended audience, while leveraging the power of shared information on a shared platform to cross-connect existing communities with overlapping ideas on how to improve their neighborhoods - a win for everyone.

Good platforms represent just enough of a contextual common denominator to help dependent applications be effective, without inhibiting their ability to deliver a value-add. It’s a delicate art.