Cities Open Data Working Group

I’ve been scheming for a few months now with the goal of assembling a multi-national group of municipal government professionals who focus on open data. The Cities Open Data Working Group (or CODWoG - I’m open to better acronyms or initialisms) would take the form of a monthly virtual meeting, a mailing list, and a wiki. The initial goals would be:

  • Create a knowledge base from cities who have open data initiatives.
  • Provide a peer forum for the exchange of ideas, success stories, and lessons learned.
  • Develop a multi-city strategy which would align taxonomies and other shareable specifications in a platform-independent manner.
  • Offer a support guidance framework for cities which have not yet launched open data platforms, but are considering it.
On the monthly virtual meeting, we would potentially have two participating cities spend 15 minutes discussing their open data implementations and/or aspirations. This discussion would be transferred into the knowledge base. In the ideal universe, each city would talk about:
  • When their effort started (or will start), and what prompted the beginning.
  • How have they evolved from the start to the present?
  • What is their current delivery model:
    • technology stack
    • mechanisms for providing/updating data
    • number and roles of staff supporting
    • non-HR expenses
    • documentation: legislation, executive orders, policies, standards, guidelines
  • What audiences do they typically engage with?
  • What external programs exist that leverage their efforts?
  • What significant successes have come from it?
  • What is their thinking about the future (not planned commitments, but visionary)?

The remaining 30 minutes would be spent on strategic planning and general discussion, which would include an opportunity for cities that do not have open data implementations to ask questions.

The mailing list/online discussion board would help keep the conversation going in the weeks between calls, and the wiki would serve as the repository for knowledge. Once a reasonable framework is set up, cities could add their own information into the wiki and perhaps we could dispense with the 15-minute monthly presentations - although I think there’s a lot to be gained from the personal storytelling.

So, who’s in? If you are interested in participating in the CODWoG, either as a city or to provide logistical support, please send your contact info or contact me on twitter.

Also, @anthonymobile pointed me to a great list he is putting together: Open Data Repositories for Top 100 US Municipalities By Population.

I’m surprised the article didn’t mention the upcoming NYC startup weekend event focused on games (and music).

nycdigital:

The days of 8-bit Pac-Man are no more.

Today’s generation of video game consoles, with high-definition resolutions of 1080p, boast motion sensing cameras that track your movement (Xbox 360 Kinect) and wireless controllers that serve as virtual ping pong paddles and tennis rackets (PlayStation Move and Wii-mote). Multiplayer online role-playing-games, such as World of Warcraft, provide players with opportunities to partner with others from all over the world in a fight to save their planets. And mobile games, like Angry Birds (with more than 700 million downloads across all platforms), are available anywhere, anytime, for anyone who uses a smartphone.

Experimenting with an NYC Open Data Fountain

Glossing over last week’s news that NYC Open Data is to become a mandatory function of City government, I’ve been spending a lot of my off-hours time on a specific scenario: making our data available in real-time.

It’s no secret that we leverage Socrata to power our open data site, as do many other government organizations including the federal government. However, Socrata’s platform does not support efficient access to realtime data. It is limited by two factors:

  1. Publishing data that includes updates to pre-existing records requires the creation of a draft copy of the entire dataset behind the scenes. This mechanism makes sense, because it enables a way to roll back changes in a transactional manner. However, for large datasets, creating this draft copy can take significant time - more than 20 minutes for our 3.7-million-record 311 Service Request data.  (Note: datasets which only require appends do not require this draft copy to be made.)
  2. Data consumers must query for new or changed records (assuming we have fields to support that); there isn’t way to have Socrata “push” changes out. Datasets do have RSS feeds associated with them, but large batches of changes (such as our weekly 100,000-record updates to the 311 data) can’t be accurately reflected, as the RSS only returns the most recent 10 records.

I often refer to data as being very similar to water. Putting aside the notion of it as a raw material, it also exhibits similar properties as rivers, reservoirs, and so on. Socrata makes a great reservoir, but how can I connect data consumers to the streams of information that are constantly flowing behind our firewall? In the past few months, I have been experimenting with streaming APIs from Twitter and Bit.ly, so it seemed like an obvious leap to try and set up a similar mechanism.

I had a few basic requirements:

  1. It must support deployment within a DMZ or in the cloud, and shouldn’t require opening connections into the internal network. This means it has to accept payloads that are pushed to it from within an internal network.
  2. It must support multiple, continuous HTTP streams (channels), initiated by consumers, through which it can push individual payloads as needed.
  3. It must support a (theoretically) unlimited number of consumers on each channel, and all consumers must receive the same payloads. 
  4. It must be very lean on CPU, memory, and storage space.

In the past few days, I have implemented an alpha solution to do exactly that, and I have it running two simulated streams of data. (links forthcoming)

I’m running it on Rackspace, after getting a bill from Amazon for ten days’ worth of Amazon EC2 usage when I thought I was using resources within the free limits. I landed on nginx, hosted on Ubuntu Lucid Lynx (LTS). The ‘LTS’ stands for long-term support, meaning that Canonical has committed to a 5-year support term for that particular version. Originally I had tried to deploy it on Arch Linux, which is promoted as a very lean operating system - but it turned out to be lean enough that my limited expertise in Linux prevented me from compiling nginx. The problem was related to missing header files for the cryptography libraries, but I didn’t realize that I needed a developer-essentials-like package until it was too late. (I may go back and try again.)

But nginx alone isn’t enough; to support streaming, it needs a plugin module. I initially used nxinx-http-push-module, but it hasn’t been updated since 2010. More importantly, I discovered that it doesn’t support streaming in the manner I was hoping for, namely a continuous open connection. Instead, once some data has been transmitted to the consumer, the connection is closed and the consumer has to reconnect. Although this mechanism makes sense in some contexts, for high-volume data streams, the chances of data being lost are significantly higher.

After a bit more searching, I came across Wandenberg Peixodo’s nginx-push-stream-module, which seemed to fit well. It also included a script to build itself with nginx, making it much easier for Linux n00bs like me. It took a lot of time to get the build right, as I wanted to have SSL support, and fundamental components like PCRE were installed but missing source files. Thankfully, with helpful instructions and enough knowledge to tweak the nginx install configuration (with paths to the source for dependencies like PCRE, zlib, and OpenSSL), I got it working. I’ve learned you can do amazing things with wget, and curl is like a swiss army knife that really needs to be present by default even on MS Windows operating systems.

Once everything was up and running, I started working on the nginx configuration, until I got to this:

  • A specific DNS hostname for streaming, and a separate hostname for publishing. (If deployed in a DMZ, nginx can even support binding each of these to different network interfaces.)
  • On the streaming server, the root location returns the list of current streams and the statistics for them:
    location = / {
                push_stream_channels_statistics;
                set $push_stream_channel_id “ALL”;
            }
  • REST-like URLs, like http://river.example.com/stream1 and /stream2 (instead of the default http://www.example.com/sub?id=stream1):
    location ~ ^/(.*)$ {
                push_stream_subscriber;
                set $push_stream_channels_path $1;
                push_stream_ping_message_interval 30s;
                push_stream_header_template “\n”;
            }
  • Consumers can, incidentally, subscribe to multiple channels in a single datastream by adding them to the URL, like so: http://river.example.com/stream1/stream2. Note to self: make sure all payloads include an indicator of their channel.

The push_stream_ping_message_interval is crucial to maintaining open streams, particularly when proxies (known or not) sit between the server and the consumer. Proxies have a habit of dropping connections (for many good reasons), so I set it for 30 seconds. Every half-minute the server sends a linefeed (\n) down to the client. I think there’s an undocumented variable which configures the actual ping content, but the default seems fine.

The push_stream_header_template value forces the connection to start immediately; otherwise the consumer will remain in a ‘connecting’ state until a payload or a ping message is transmitted (whichever comes first).

To push a message to a channel, all that is necessary is to send an HTTP POST to the a location through the other hostname. (e.g. POSTing to http://push.example.com/stream1 will send a message to all consumers of http://river.example.com/stream1).

There’s more to do, however:

  • The nginx-push-stream-module supports the dynamic creation of channels, either when requested by a consumer or posted to by a publisher. Since it might not be desirable to have arbitrary channels created by consumers (thus taking up server resources), there is a push_stream_authorized_channels_only on option. However, when that option is set, the server seems to destroy the channel immediately after the first payload is sent to it, thus preventing consumers from connecting to it.
  • I want to use HTTPS for the publishing side of things.
  • I want to require credentials for the publishing side of things.
  • I want to mirror two existing streams: a combined Twitter feed of all the various NYC government entities and programs, and our bitly clickstream.

And with all that, I will hopefully have set up the NYC Open Data Fountain!

Tackling the long-term strategy of Open311

As Open311 adoption grows rapidly, with more endpoints on the way, it’s time to start developing a broader strategy to solve some new problems that will emerge. I think one of the first of those problems will be recognizing where someone is, and connecting them automatically to the right Open311 endpoints (yes, I do mean plural- more on this in a moment). In his Open311 Wish List, Philip Ashlock starts to tackle this:

As more cities stand up their endpoints, it becomes more of a challenge to know they all exist and make sure client applications can discover them. Several years ago we started thinking about an idea called GeoWebDNS that would essentially act as a geospatial lookup service for geographically bound web services. Ian Bicking, built a proof of concept and I later discovered that the FCC was evaluating a similar, albeit more robust, proposal called LoST (see reference implementation) to be used for the same purpose on Next Generation 911 services. So far, these are merely proposals, but we’re increasingly in need of one of these systems to be put to use as a real world pilot and eventually to act as a critical piece of civic infrastructure.

So with that goal in mind, here are a few issues that I think need to be tackled in order to have a sustainable future.

  1. API Keys. At present, 8 of the Open311 endpoints (Baltimore, Bloomington, Boston, Brookline, Grand Rapids, San Francisco, Toronto, Washington DC) have distinct API key request mechanisms and key management solutions. The rest of the endpoints leverage a common SeeClickFix solution. (SeeClickFix also offers a proprietary API). As more endpoints are added, having to manage an array of endpoint keys is going to become untenable for the developer community.
  2. Authentication/Authorization. Although there isn’t yet clear consensus from the members of the Open311 community about how authentication/authorization fits in to the drafted specifications, one thing that is clear is that having to manage a customer identity at each endpoint will not make it easy for a customer to request services.
  3. Terms of Service. Along with requiring the independent provisioning of API keys and customer identities, each endpoint also has different terms of service which a developer must explicitly agree to, and a customer must implicitly agree to. Having to comply with multiple, differing terms of service is going to become untenable for the developer community.
  4. Geographic overlap. 311 as a telephone service is constrained to one call center/answering point per geographic region; this is imposed by the very one-to-one nature of a phone system. While following the same model has served the Open311 mission very well to-date, this isn’t the reality of how service providers work in the real world, and I think it will reduce the long-term sustainability of Open311 to stay with a one customer-to-one provider model. A customer can only be in one location when making a request, but they could be making a request to their local town/city, to their local county, their state/territory, their nation, or even to the world (e.g. the United Nations). Each of those tiers offers a distinct set of services, and in the ideal scenario, all of those services should be presented to a customer in a unified manner, regardless of who is offering them. At the moment, we all get around that problem by providing information which, technically speaking, is really the responsibility of others to maintain. For example, you can contact NYC 311 and ask how to obtain a driver’s license, and you’ll get an answer - but that’s a New York state-provided service, and they should own it.

    Incidentally, tackling this issue might eventually drive official 311 systems to leverage Open311 to make calls to other overlapping service provider systems.

    A final note on this: neither GeoWeb DNS, nor LoST (IETF RFC 5222) seem to support returning multiple services per query.

The good news is that I believe there are solutions to all of these challenges. Before we start digging into those, however, I invite the community to comment on these and, more importantly, identify other challenges which I have missed or am unable to see from my perspective.

[Update] - please don’t comment below, but give your feedback here or here. Phil was kind enough to post this on the Open311 Blog.

How SOPA/PIPA could affect NYC

The recent attention to the House of Representatives’ Stop Online Piracy Act (SOPA) and the parallel PROTECT-IP Act (PIPA) in the Senate, prompted me to think about how NYC government might be affected. This is a complex issue with many facets (Mayor Bloomberg has publicly said as much), but read on for some thoughts from my point of view as a government technologist.

  1. .nyc Top-Level Domain (TLD). I suspect the legislation would increase the cost of managing the TLD, and as the sponsors of it, the City would potentially be liable for what happens within it. It might even make the idea of the .nyc TLD less desirable.
     
  2. .gov sites are not protectedPIPA calls for ISPs, search engines, and financial transaction providers to block sites - meaning preventing end-users from finding the sites by typing the address into a web browser, doing a web search, or issuing a payment. Furthermore, it calls for these entities to proactively block “in good faith” without a court order. No action would be required by the federal government’s General Services Administration (which manages the .gov TLD) in order to bury .gov sites. Google, Bing, Yahoo, et al will more than likely try to solve this problem using automation and algorithms instead of human reviewers if they are encouraged/compelled by the law to proactively block.

    This is in direct opposition to our goals of to making our web sites more interactive and it will require active policing (and censoring) of content posted to our sites by our visitors. It’s not a big stretch to imagine a new kind of denial-of-service attack by which comment threads are populated with pirated content, triggering search engines to automatically stop presenting the site in search results, etc.

  3. Jobs and the City’s tech sector. With the majority of job growth supposedly coming from startups and small businesses, and with entrepreneurs saying the proposed legislation would hurt growth and innovation, there would probably be a significant impact, most clearly felt in the tech sector. It might cause NYC to fall short of its goal to be the best place for tech growth in the country. (I can only imagine what will happen to Silicon Valley.)
     
  4. The City is also a customer. We don’t just encourage tech sector growth, we leverage what it produces while continually striving to be a government which does more with less. SOPA/PIPA can impact services for which the City is a customer, particularly when intellectual property is interpreted to mean patents, code, and data structures - not just media/content. This is probably less of an issue with larger, well-established companies who have strong legal teams (although I’m not even sure that’s true given what I wrote in #2 above), but could easily and rapidly impact small to medium-sized organizations with whom we do plenty of business.
nycedc:

How do you like them apps? Check out our infographic to see how past NYC BigApps winners branched out to serve the different needs of New Yorkers.
BigApps by the numbers:
60 agencies, commissions and bids
750 city datasets tapped
140+ apps generated
Nearly 14,000 voters and 5,000 followers
Over $100,000 awarded in prizes
The benefits of the BigApps challenge? Improved citizen and city life; publicity and a chance to meet Mayor Bloomberg; and access to leading tech entrepreneurs and investors.
Submit your entry for this year’s challenge by January 25, 2012 at 5 PM EST.

nycedc:

How do you like them apps? Check out our infographic to see how past NYC BigApps winners branched out to serve the different needs of New Yorkers.

BigApps by the numbers:

  • 60 agencies, commissions and bids
  • 750 city datasets tapped
  • 140+ apps generated
  • Nearly 14,000 voters and 5,000 followers
  • Over $100,000 awarded in prizes

The benefits of the BigApps challenge? Improved citizen and city life; publicity and a chance to meet Mayor Bloomberg; and access to leading tech entrepreneurs and investors.

Submit your entry for this year’s challenge by January 25, 2012 at 5 PM EST.

Crowd-sourcing your priorities

Recently we had the opportunity to use an online tool I’m a big fan of: All Our Ideas.

The challenge we faced was coming up with a list of requirements for an in-flight project, then prioritizing and weighting them. After researching multiple sources, looking at our own existing processes, and thinking about how we’d like things to work in the future, we produced a list of almost 130 requirements. With a list that large, assigning priorities was a challenge: it was hard to figure out what was important when everything seemed important! In addition, since we had a large team of people involved, there was likely to be considerable spirited debate - taking an unreasonable amount of time that we did not have.

To handle this challenge, we turned to All Our Ideas. All Our Ideas is the brainchild of Matt Salganik, an assistant professor of sociology at Princeton University. Inspired by kittenwar (and elegant proof that great ideas can come from the most unusual places), All Our Ideas allows crowd-sourced ordering (and soliciting additional input) by repeatedly presenting pairwise comparisons and producing a PAPRIKA-like result:

As you might guess, choosing which of two items was more important was dramatically easier than pruning through more than a hundred items. For most pairwise decisions, one item was often clearly more important than the other; for the ones which seemed balanced, our team had the opportunity to avoid voting by clicking the ‘I can’t decide’ button, and providing additional feedback about why it was difficult.

When sharing the link to the tool, we didn’t provide much in the way of instruction to our participants - we simply asked folks to dedicate a small block of time (30-60 minutes) to vote repeatedly, carefully considering each choice they were presented. To us this seemed more reasonable (and, as it turns out, more practical) than asking people to vote a certain number of times. The link was shared amongst a variety of stakeholders; around 15 people voted over a short period of time.

Following a two-week period which included both Christmas Day and New Year’s Day holidays, as well as various team members’ vacations, we had gathered about 2,500 votes - and upon reviewing the results, we felt the list had been sorted in a reasonable order (meaning, nothing seemingly unimportant ended up with the very important items, and vice-versa). Success!

In thinking about our success in using All Our Ideas, there were also some things which we realized might have helped it even more:

  1. Items for voting have a Twitter-esque character limit (at least when batch-loaded). Obviously it makes sense to limit the length given the way the site’s presentation works, but some of our requirements were 200+ characters, so it took some careful rewording - which ran the risk of reducing clarity and/or introducing confusion - to include them all.
  2. With almost 130 items to be voted upon, it would have been helpful to know how many votes would be needed for increasing levels of confidence. Was 2,500 votes enough, from a statistical standpoint? Or could/should we have aimed lower or higher? We were comfortable with the results we got, but comfortable is very different from confident.
  3. As the administrator/owner of the survey, it wasn’t entirely clear to me whether our participants were actually proposing new items (which we wanted to minimize and moderate - very possible through the tool) or explaining why they felt they couldn’t decide between two items. Alongside the 2,500 votes, I received one actual suggestion for a new item, and 3-4 suggestions which appeared to be feedback about the specific vote which had been presented to the participant. Unfortunately, because I couldn’t see the choices which had been presented (nor who had written the suggestion), I had no way of addressing the feedback constructively.
  4. Finally, it wasn’t possible with the tool to know the total number of unique participants who voted. Since we were (are?) all responsible adults, we accepted the say-so of each participant, and the total quantity of votes seemed satisfactory. The tool did show counts of voting sessions over time, but that was also unfortunately stymied by some participants using the iOS version of Safari with cookies turned off.

In conclusion, though, we were quite happy with the results we got using the tool. Much of the team was excited to try it out, and I’m sure we’ll find other opportunities to use it. Matt and his team have built something simple and useful, very much in the web 2.0-paradigm of ‘do one thing the best and leave everything else to the rest’. For this one step in our process, it was the best choice for the job. And the icing on the cake? The source code is available on github.

I’ll be at this event to answer your questions about NYC’s Open Data platform.

nycedc:

NYCEDC is hosting Developer Day this Saturday, January 7th from 10 AM to 6 PM. Join this year’s NYC BigApps participants at Pivotal Labs to work on your app submission while staying well-fed and well-connected to fellow developers!

You can look forward to:

  • Dedicated time to work on your…

USA City TLDs

Recently a colleague suggested to me that most official city web sites in the United States might be at addresses which were not in the .gov top-level domain (TLD). My immediate reaction was one of doubt, but I didn’t have any hard evidence to back it up. Thankfully it took merely a few moments to gather this data. The results? More than 70% of the nation’s 25 most populous cities have sites in the .gov TLD. (A list of the sites can be found below). It is possible that this is too small and unrepresentative of a sampling - for example, smaller cities may find it easier to register a .com or .net domain name.

What is interesting to me is how inconsistent the naming conventions are, even within the .gov TLD. The .gov TLD guidelines say that the city or town’s state should be identified within the domain name, but plenty of cities (including my own) seem to have bypassed that. Some include the word “city” as part of the name, but many don’t. In most situations, web searches for the city name brought up the official web site within the first page of results, but often it wasn’t the first result; instead I frequently saw a blatantly unofficial site or one focused on tourism. With everyone doing things a little differently, are we making it just a bit more difficult to provide access to government information and services?

Generally speaking, I’m an advocate of city web sites using a .gov TLD, because (from a visitor’s perspective), it implies a certain amount of trust which does not exist in other TLDs. However, the policies around the .gov TLD have been somewhat restrictive (for example, public comments on content wasn’t really viable), though they have been more relaxed in the past couple of years. Additionally, having a .gov registration doesn’t cost anything, whereas even the most basic .com TLD has at least a minimal annual fee - assuming there wasn’t already a domain squatter there to begin with.

So you don’t have to repeat the effort (all of ten minutes worth of internet searches), here’s the list of the top 25 most populous US city web sites (based upon this list):

  1. New York City, NY [.gov]
  2. Los Angeles, CA [.org]
  3. Chicago, IL [.org]
  4. Houston, TX [.gov]
  5. Philadelphia, PA [.gov]
  6. Phoenix, AZ [.gov]
  7. San Antonio, TX [.gov]
  8. San Diego, CA [.gov]
  9. Dallas, TX [.com]
  10. San Jose, CA [.gov]
  11. Jacksonville, FL [.net]
  12. Indianapolis, IN [.gov]
  13. San Francisco, CA [.org]
  14. Austin, TX [.us]
  15. Columbus, OH [.gov]
  16. Fort Worth, TX [.gov]
  17. Charlotte, NC [.org]
  18. Detroit, MI [.gov]
  19. El Paso, TX [.gov]
  20. Memphis, TN [.gov]
  21. Baltimore, MD [.gov]
  22. Boston, MA [.gov]
  23. Seattle, WA [.gov]
  24. Washington D.C. [.gov]
  25. Nashville, TN [.gov]
nycdigital:

At the Foursquare Hackathon over the weekend, Andrew Nicklin (@technickle), the City’s Director of Enterprise Architecture, presented the NYC Platform API. Thanks to Foursquare for welcoming us! 
Pictured above is Accessible NYC, a web app incorporating NYC data.

nycdigital:

At the Foursquare Hackathon over the weekend, Andrew Nicklin (@technickle), the City’s Director of Enterprise Architecture, presented the NYC Platform API. Thanks to Foursquare for welcoming us! 

Pictured above is Accessible NYC, a web app incorporating NYC data.