Glossing over last week’s news that NYC Open Data is to become a mandatory function of City government, I’ve been spending a lot of my off-hours time on a specific scenario: making our data available in real-time.
It’s no secret that we leverage Socrata to power our open data site, as do many other government organizations including the federal government. However, Socrata’s platform does not support efficient access to realtime data. It is limited by two factors:
- Publishing data that includes updates to pre-existing records requires the creation of a draft copy of the entire dataset behind the scenes. This mechanism makes sense, because it enables a way to roll back changes in a transactional manner. However, for large datasets, creating this draft copy can take significant time - more than 20 minutes for our 3.7-million-record 311 Service Request data. (Note: datasets which only require appends do not require this draft copy to be made.)
- Data consumers must query for new or changed records (assuming we have fields to support that); there isn’t way to have Socrata “push” changes out. Datasets do have RSS feeds associated with them, but large batches of changes (such as our weekly 100,000-record updates to the 311 data) can’t be accurately reflected, as the RSS only returns the most recent 10 records.
I often refer to data as being very similar to water. Putting aside the notion of it as a raw material, it also exhibits similar properties as rivers, reservoirs, and so on. Socrata makes a great reservoir, but how can I connect data consumers to the streams of information that are constantly flowing behind our firewall? In the past few months, I have been experimenting with streaming APIs from Twitter and Bit.ly, so it seemed like an obvious leap to try and set up a similar mechanism.
I had a few basic requirements:
- It must support deployment within a DMZ or in the cloud, and shouldn’t require opening connections into the internal network. This means it has to accept payloads that are pushed to it from within an internal network.
- It must support multiple, continuous HTTP streams (channels), initiated by consumers, through which it can push individual payloads as needed.
- It must support a (theoretically) unlimited number of consumers on each channel, and all consumers must receive the same payloads.
- It must be very lean on CPU, memory, and storage space.
In the past few days, I have implemented an alpha solution to do exactly that, and I have it running two simulated streams of data. (links forthcoming)
I’m running it on Rackspace, after getting a bill from Amazon for ten days’ worth of Amazon EC2 usage when I thought I was using resources within the free limits. I landed on nginx, hosted on Ubuntu Lucid Lynx (LTS). The ‘LTS’ stands for long-term support, meaning that Canonical has committed to a 5-year support term for that particular version. Originally I had tried to deploy it on Arch Linux, which is promoted as a very lean operating system - but it turned out to be lean enough that my limited expertise in Linux prevented me from compiling nginx. The problem was related to missing header files for the cryptography libraries, but I didn’t realize that I needed a developer-essentials-like package until it was too late. (I may go back and try again.)
But nginx alone isn’t enough; to support streaming, it needs a plugin module. I initially used nxinx-http-push-module, but it hasn’t been updated since 2010. More importantly, I discovered that it doesn’t support streaming in the manner I was hoping for, namely a continuous open connection. Instead, once some data has been transmitted to the consumer, the connection is closed and the consumer has to reconnect. Although this mechanism makes sense in some contexts, for high-volume data streams, the chances of data being lost are significantly higher.
After a bit more searching, I came across Wandenberg Peixodo’s nginx-push-stream-module, which seemed to fit well. It also included a script to build itself with nginx, making it much easier for Linux n00bs like me. It took a lot of time to get the build right, as I wanted to have SSL support, and fundamental components like PCRE were installed but missing source files. Thankfully, with helpful instructions and enough knowledge to tweak the nginx install configuration (with paths to the source for dependencies like PCRE, zlib, and OpenSSL), I got it working. I’ve learned you can do amazing things with wget, and curl is like a swiss army knife that really needs to be present by default even on MS Windows operating systems.
Once everything was up and running, I started working on the nginx configuration, until I got to this:
- A specific DNS hostname for streaming, and a separate hostname for publishing. (If deployed in a DMZ, nginx can even support binding each of these to different network interfaces.)
- On the streaming server, the root location returns the list of current streams and the statistics for them:
location = / {
push_stream_channels_statistics;
set $push_stream_channel_id “ALL”;
}
- REST-like URLs, like http://river.example.com/stream1 and /stream2 (instead of the default http://www.example.com/sub?id=stream1):
location ~ ^/(.*)$ {
push_stream_subscriber;
set $push_stream_channels_path $1;
push_stream_ping_message_interval 30s;
push_stream_header_template “\n”;
}
- Consumers can, incidentally, subscribe to multiple channels in a single datastream by adding them to the URL, like so: http://river.example.com/stream1/stream2. Note to self: make sure all payloads include an indicator of their channel.
The push_stream_ping_message_interval is crucial to maintaining open streams, particularly when proxies (known or not) sit between the server and the consumer. Proxies have a habit of dropping connections (for many good reasons), so I set it for 30 seconds. Every half-minute the server sends a linefeed (\n) down to the client. I think there’s an undocumented variable which configures the actual ping content, but the default seems fine.
The push_stream_header_template value forces the connection to start immediately; otherwise the consumer will remain in a ‘connecting’ state until a payload or a ping message is transmitted (whichever comes first).
To push a message to a channel, all that is necessary is to send an HTTP POST to the a location through the other hostname. (e.g. POSTing to http://push.example.com/stream1 will send a message to all consumers of http://river.example.com/stream1).
There’s more to do, however:
- The nginx-push-stream-module supports the dynamic creation of channels, either when requested by a consumer or posted to by a publisher. Since it might not be desirable to have arbitrary channels created by consumers (thus taking up server resources), there is a push_stream_authorized_channels_only on option. However, when that option is set, the server seems to destroy the channel immediately after the first payload is sent to it, thus preventing consumers from connecting to it.
- I want to use HTTPS for the publishing side of things.
- I want to require credentials for the publishing side of things.
- I want to mirror two existing streams: a combined Twitter feed of all the various NYC government entities and programs, and our bitly clickstream.
And with all that, I will hopefully have set up the NYC Open Data Fountain!