The Iceberg of Knowledge: A Quick Parable on Learning

2014.06.04 – This was inspired by Dawn Casey’s article on learning programming. Personally, I like to think about knowledge on any particular topic as an iceberg. There are three types of knowledge, and learning is the process of moving around the iceberg of knowledge: What you know you know – this is the part of the iceberg you can see from where you’re standing at this very moment. What you know you don’t know – this is the other side of the iceberg, that you can’t see from where you’re standing but is obviously the other side of the iceberg, just over that peak in front of you. It may be slightly larger than you expect, but in you’ve got a good idea of the shape of it from looking from where you are. You may climb to a higher point on the iceberg to see more of it. What you don’t know that you don’t know – this is the majority of the iceberg, underneath your feet. It’s large, and epic. It’s always bigger than you can reasonably imagine. It takes specialized equipment and knowledge to be able cover every inch of it on the outside, and careful dissection to get to the center.

Read This



A Brief Summary

After a hackathon a few months back, we were joking about creating an easy way to take the data we’d painstakingly parsed from PDFs, word documents, and XML files, and “translate” it back into a format that government agencies are used to. Many of us have been shell-shocked in dealing with PDFs from government agencies, which are often scanned documents, off kilter and photocopied many times over. Fundamentally, they’re very difficult to pry information out of. For the OpenGov Foundation’s April Fools’ prank, we created, a tool to convert plain text into truly ugly PDFs.


A quick, [one-line ImageMagick command](, was the first version. We quickly produced a few sample documents, and decided that it would be fantastic if users could upload their own files and convert them. Very quickly it became clear that the process might take a couple of seconds, and a decent amount of CPU – so to deal with any sort of load, we’d need a modular, decentralized process, rather than a single webpage to do everything.  

Hidden troll of @FoundOpenGov‘s Govify is that front end is PHP, deploy is Ruby, worker is Python, & conversion is a shell script. #nailedit

— Ben Balter (@BenBalter) April 1, 2014  

As Ben Balter points out, there are a lot of moving pieces to this relatively-simple setup. is actually a combination of PHP, Compass + SASS, Capistrano, Apache and Varnish, Rackspace Cloud services and their graet API tools, Python and Supervisord, and ImageMagick with a bash script wrapper. Why in the world would you use such a hodgepodge of tools across so many languages? Or, as most people are asking these days, “why not just build the whole thing with Node.js?”  

The short answer is, the top concern was time. We put the whole project together in a weekend, using very small pushes to build standalone, modular systems. We reused components wherever possible and tried to wholly avoid known pitfalls via the shortest route around them. A step by step breakdown of those challenges follow.

Rough Draft

We started with just a single ImageMagick command, which:

  • Takes text as an input

  • Fills the text into separate images

  • Adds noise to the images

  • Rotates the images randomly

  • And finally outputs all of the pages as a PDF.

  • Using that to create a few sample documents, we began putting together a rough website to show them off. Like everyone else who needs to build a website in zero time, we threw Bootstrap onto a really basic template (generated with HTML5 Boilerplate. We use a few SASS libraries – Compass, SASS Bootstrap, and Keanu – to get some nice helpers, and copied in our standard brand styles that we use everywhere else. A few minutes in photoshop and some filler text later, and we had a full website.

    We needed a nice way to deploy the site as we make changes, and our preferred tool is Capistrano. There are other tools available, like Fabric for Python or Rocketeer for PHP, but Capistrano excels in being easy to use, easy to modify, and mostly standalone. It’s also been around for a very long time and the one that we’ve been using the longest.

    We’re using Rackspace for most of our hosting, so we stood up box with Varnish in front of Apache and stuck the files on there. Website shipped! ship it!

    Web Uploading

    Once that was done, we made the decision to allow users to upload their own files. At OpenGov, we’re primarily a PHP shop, so we decided to use PHP. OK, OK – stop groaning already. PHP is not the most elegant language in the world, and never will be. It has lots of horns and warts, and people like to trash it as a result. That being said, there are a few things it’s great at.

    First and foremost, it’s incredibly easy to optimize. Tools like APC and HipHop VM which allow you to take existing PHP scripts and make them run *very* well. The variety and diversity of optimization tools for PHP make it a very attractive language for dealing with high-performance apps, generally.  

    Second, it’s a “web-first” language, rather than one that’s been repurposed for the web – and as a result, it’s very quick to build handlers for common web-tasks without using a single additional library or package. (And most of those tasks are very well documented on the PHP website as well.) Handling file uploads in PHP is a very simple pattern.  

    So in no time at all, we were able to create a basic form where users could input a file to upload, have that file processed on the server, and output back our PDF. Using the native PHP ImageMagick functions to translate the files seemed like a lot of extra work for very little benefit, so we ran kept that part as a tiny shell script.  

    At this point however, we realized that the file processing iself was slow enough that any significant load could bring slow the server considerably. Rather than spinning up a bunch of identical servers, a job queue seemed like an ideal solution.

    Creating a Job Queue

    A very common pattern for large websites that do processing of data is the job queue, where single items that need processing are added to a list somewhere by one application, and pulled off the list to be processed by another. (Visual explanation, from the Wikipedia Thread Queue article.) Since we’re using Rackspace already, we were able to use Rackspace Cloud Files to store our files for processing, and the Rackspace Queue to share the messages across the pieces of the application. The entire Rackspace Cloud stack is controllable via their API, and there are nice libraries for many languages available.  

    On our frontend, we were able to drop in the php-opencloud library to get access to the API. Instead of just storing the file locally, we push it up to Rackspace Cloud Files, and then insert a message into our queue, listing the details of the job. We also now collect the user’s email address, so that we can email to let them know that their file is ready.  

    The backend processing, however, presented a different set of challenges. Generally, you want an always-running process that is constantly checking the queue for new files to process. For processes that take a variable amount of time, you don’t want just a Cron job, since the processes can start stacking up and choke the server – instead we just have a single run loop that runs indefinitely, a daemon or service.  

    For all the things that PHP is good at, memory management is not on the list. Garbage collection is not done very well, so large processes can start eating memory rapidly. PHP also has a hard memory limit, which will just kill the process in an uncatchable way when it dies.  

    Python, on the other hand, does a rather admirable job of this. Creating a quick script to get the job back out of the Rackspace Queue, pull down the file to be manipulared, and push that file back up was a rather simple task using the Rackspace Pyrax library. After several failed attempts in trying to use both the python-daemon and daemonize packages as a runner for the script, we reverted to using Supervisor to keep the script going instead.  

    Final Thoughts

    Obviously, this isn’t the most elegant architecture ever created. It would have made far more sense to use a single language for the whole application – most likely Python, even though very little is shared across the different pieces aside from the API.

    That being said, this thing scales remarkably well. Everything is nicely decentralized, and would perform well under significant load. However, we didn’t really get very significant load from our little prank – most people were just viewing the site and example PDFs, and very few were uploading their own. Sometimes overengineering is its own reward.

    Not bad for three days of work, if I do say so myself.

    All of the pieces are available on Github and GPL2 licensed for examining, forking, and commenting on.

  • Govify Shell Script

  • Govify Web Site & Uploader

  • Govify Worker Script

  • Read This

    Code for America Summit 2013

    2013.11.13 – There were far too many great talks to summarize them all, but CFA has posted the full agenda online as well as videos of all of the talks.  Here are a few of our favorites. Clay Shirky’s Keynote

  • Clay covered a lot of ground, and his whole speech is worth a listen.
  • One of my favorite takeaways:  “The output of a hackathon isn’t a solution to a problem, it’s a better understanding of the problem.”
  • He also covers that idea that Sir Winston Churchill summarized so well: “Success is the ability to go from one failure to another with no loss of enthusiasm.” 
  • Tim O’Reilly, Seamus Kraft, & Gavin Newsom – Open Gov Crossing Party Lines
    • Tim O’Reilly led this discussion between OpenGov’s own Seamus Kraft (standing in for Congressman Darrell Issa) and California Lt. Governor Gavin Newsom, on how Open Government transcends party politics, and both sides of the aisle come together on this topic.

    There were several talks about the problems with procurement in government, and how it is one of the fastest ways to reduce wasteful, inefficient government spending. Clay Johnson – Procurement
    • Clay has been featured in a lot of interviews lately talking about procurement and the new website. He points out that the vendors that win bids on RFPs typically are those that have good legal departments, not necessarily the ones that have the best  team for the job.  He showed off a few tools designed to fix that problem:
    •  Screendoor – a system for creating readable, usable RFPs
    •  WriteForHumans – a simple tool to gauge how difficult to read a piece of text is.
    Jeff Rubenstein – SmartProcure
    • Jeff showed us SmartProcure, a system designed for localities to find and share vendor information.

    There were several talks around using data and technology to improve the effectiveness of services. Nigel Jacobs, Joel Mahoney & Michael Evans – Boston
    • There were a few talks about the improvements being made in Boston.  This clip features Michael Evanstalking about City Hall To Go – a mobile van that brings City Hall services to underserved areas, allowing residents to pay taxes, get handicap parking spaces, register to vote, and do other things on the spot instead of making a trip all the way down to Boston City Hall.
    Cris Cristina & Karen Boyd – RecordTrac
    • Cris and Karen discussed the RecordTrac system, used in Oakland to respond to public records requests.  This project is opensource, so other cities can adopt this and start using it immediately.
    Saru Ramanan – Family Assessment Form
    • Historically, there have been few tools to actually tell how effective social services are in helping families.  The Family Assessment Form is a tool to record and investigate help to individual families, and track their progression over time.

    The second day featured many talks on user-friendly tools for local improvement.  This also focused on user experience as a form of social justice. Tamara Manik-Perlman – CityVoice
    • City Voice is a tool built on Twilio‘s API to collect feedback from residents over the phone, using a very easy-to-use system.  This was originally implemented in South Bend to address the housing situation, but could be used in any locality for a variety of topics where gathering feedback outside of a single forum is useful.
    Lou Huang and Marcin Wichary – StreetMix
    • Lou and Marcin gave one of the most entertaining talks of the conference on StreetMix, which allows citizens to propose layouts for streets, using simple drag and drop tools to create bike lanes, bus stops, medians, and more.
    Dana Chisnell – Redesigning Government
    • Dana and her team were set with the task of designing a ballot for *everyone*, from adults who have never used a computer, to those have low literacy and low education, those with cognitive disabilities, and other often ignored groups.  This is a must-watch for anyone who spends time on accessibility issues, or is interested in Good UX For All.
    Cyd Harrell, Ginger Zielinskie, Mark Headd, and Dana Chisnell – Redesigning Government
    • Cyd led a panel discussion talking about a variety of topics around UX and accessibility in technology for cities.  This was my favorite talk of the conference, and one covering topics that are often overlooked.

    Read This

    The State Decoded 0.8 Release

    2013.11.11 Waldo just posted The State Decoded 0.8 Release. This is a *huge* update that we’ve spent the last few months working on. 577 changed files, 127,076 additions, 5,123 deletions. That’s a lot of code. There are a few pieces I would have liked to squeeze into this update, like abstracting the XML import to make JSON importing more user-friendly, and cleaning up the admin system – but those will come for the 1.0 release. Which is pretty close on the horizon. In the meantime, check out the 0.8 release of State Decoded on Github!

    Read This

    Upgrading Solr from 4.2 to 4.3+ on CentOS 6.4

    2013.11.01 – I ran into an issue with a Solr configuration that was working for me locally, but not on our CentOS 6.4 server. I’ve documented below all of the issues I encountered along the way, as the upgrade from Solr 4.2 to 4.3+ is a pretty nasty one, due to major changes in the logging system (LOG4J / SLF4J)

    Update: as of January 2015, I have been unable to get Solr 4.10 to run on Tomcat 6, so I’ve defaulted to just using the example Jetty server that’s bundled with it.  It’s less than ideal, but it works.  Here’s good documentation on that process.

    Lucene/Solr 4.3+ : The Beginning

    The awesome guys at Open Source Connections had created a custom tokenizer for Solr for The State Decoded, which was suddenly throwing this new error:
    java.lang.IllegalArgumentException: No enum const class org.apache.lucene.util.Version.LUCENE_43
    Which I guessed to mean that they’d specified to use Lucene 4.3 (or in my case, Solr 4.3) in the solr_home/conf/solrconfig.xml file
    Since CentOS 6.4 only comes with Solr 4.2 out of the box, I tried changing that to LUCENE_42 but received even more intimidating errors:
    org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType "descendent_facet_hierarchical": Plugin init failure for [schema.xml] analyzer/tokenizer: Error instantiating class: 'com.o19s.RegexPathHierarchyTokenizerFactory'
    The full stack dump is here We guessed that the package needed to be rebuilt from source.


    To do that, first I had to install Maven, which was missing from CentOS. Following the instructions here, I went to the Maven site and found the download link for the latest version (in my case, that was 3.1.1). I grabbed that file and stored it in my /usr/local directory (for ease of use), unzipping as usual:
    tar -xzf ./apache-maven-3.1.1-bin.tar.gz -C /usr/local
    Then I created a link to it:
    sudo ln -s apache-maven-3.1.1 maven
    And added a native profile loader:
    sudo vi /etc/profile.d/
    To add it to the path:
    export M2_HOME=/usr/local/maven
    export PATH=${M2_HOME}/bin:${PATH}


    Now, after all of that, apparently I still did not have the Java Development Kit (aka JDK) installed. You’d think that the package OpenJDK would provide this, by the name, but it turns out that that only contains the Java Virual Machine (JVM). I discovered this as maven complained about a missing file tools.jar So, I checked yum for the latest version and installed it:
    sudo yum list jdk
    sudo yum install java-1.6.0-openjdk-devel
    And added another profile loader:
    sudo vi /etc/profile.d/
    for that path:
    export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-
    A quick log out and log in fixed up my paths to use those. All that being done, I tried rebuilding the solr tokenizer, but it still didn’t work. I needed to upgrade Solr first.

    Upgrading to Solr4.3+

    Now, between Solr 4.2 and Solr 4.3, they removed the native LOG4J support that is by default needed to run Solr. So there are a lot of extra steps here. Seriously, why they would do this, I have no idea. You install the Solr package, and it just won’t run. And there’s absolutely no documentation on exactly how to fix it. The developers just broke it and left everyone to figure it out on their own. I cannot begin to express my frustration or count the number of hours I lost on this step.

    A note on Tomcat logging

    As you’re going through and starting and restarting tomcat (sudo service tomcat6 restart), you’ll probably want to keep an eye on the log file here:
    less /var/log/tomcat6/localhost.[DATE].log
    Most frustratingly, I found that any errors encountered during the startup of Tomcat were not actually logged until you shut it down. This is generally not how most Apache projects handle logging. (Again, WTF.) So the only way to know if you still had an issue was to restart *twice*. Unfortunately, every time that Tomcat encountered a configuration error, it would delete my web app configuration file entirely. (Aside: Seriously why would anyone think this is a good idea?) So I had to create it multiple times. That file lives under the Catalina configuration directory here:
    And looks like this:
    <Context docBase="/opt/solr/solr.war" debug="0" crossContext="false" >
       <Environment name="solr/home" type="java.lang.String" value="/opt/solr" override="true"/>

    Now back to the good part

    Now, I was finally ready to install the latest version of Solr. I headed over to the Solr downloads page and found that 4.5.1 was the latest. I grabbed the download from that page (after several redirects in between), and extracted it to my home directory:
    tar -xzf ./solr-4.5.1.tgz
    Now, solr does not actually need to be built, just copied over. The previous installation of solr on the system was under /opt/solr, and just in case I broke everything, I wanted to keep a copy of that. If you’re not upgrading from a previous version, you don’t need to do this to move that out of the way:
    sudo mv /opt/solr /opt/solr-4.2.1
    Then I copied the new version of solr’s example to the opt directory, and symlinked it into place.
    sudo cp -R ~/solr-4.5.1/example/solr /opt/solr-4.5.1
    cd /opt
    sudo ln -s solr-4.5.1 solr
    I needed the solr.war file as well, and also symlinked that.
    cd /opt/solr
    sudo cp ~/solr-4.5.1/dist/solr-4.5.1.war ./
    sudo ln -s solr-4.5.1.war solr.war
    Next, I received nasty errors in the localhost.log about missing SLF4J:
    SEVERE: Exception starting filter SolrRequestFilter
    org.apache.solr.common.SolrException: <strong>Could not find necessary SLF4j logging jars.</strong> If using Jetty, the SLF4j logging jars need to go in the jetty lib/ext directory. For other containers, the corresponding directory should be used. For more information, see:
    I assumed you needed to provide the extra libraries to tomcat in the obvious way:
    sudo cp ~/solr-4.5.1/example/lib/ext/* /var/lib/tomcat6/lib/
    sudo cp ~/solr-4.5.1/dist/solrj-lib/* /var/lib/tomcat6/lib/
    And others to solr:
    sudo cp -R ~/solr-4.5.1/dist/* /opt/solr/lib/
    But apparently the libraries needed to be somewhere else also, which I found absolutely no documentation on anywhere. This post finally got me to the right answer.
    sudo cp ~/solr-4.5.1/dist/solrj-lib/* /var/lib/tomcat6/webapps/solr/WEB-INF/lib/
    Last, I needed to tweak the default solr file for the multicore support that we needed, allowing us to run multiple Solr indexes on one box.
    sudo vi /opt/solr/solr.xml
    and a added new line for the second core:
    <cores adminPath="/admin/cores" defaultCoreName="collection1" host="${host:}" hostPort="${jetty.port:}" hostConte    xt="${hostContext:}" zkClientTimeout="${zkClientTimeout:15000}">
        <core name="collection1" instanceDir="collection1" />
        <core name="baltimorecode_dev_www" instanceDir="baltimorecode/www/staging/statedecoded" />

    Finally Rebuilding the custom Solr tokenizer

    From there, I just needed to rebuild the custom tokenizer that we had.
    cd /opt/solr/baltimorecode/www/staging/statedecoded/src/
    sudo mvn package
    cp target/regex-path-tokenizer-0.0.1-SNAPSHOT.jar ../lib/regex-path-tokenizer-0.0.1-SNAPSHOT.jar
    A final restart of tomcat:
    sudo service tomcat6 restart
    And I could hit the localhost version of solr for that core!
    lynx http://localhost:8080/solr/baltimorecode_dev_www/admin/ping
    I created an SSH Tunnel to verify that this was indeed working as expected:
    ssh -L 8009:localhost:8080 -N
    And then fired up a browser to point at http://localhost:8009/solr/ and solr was running perfectly.

    Finishing Up

    The last little thing that caught me up in my sleep-deprived state was that the PHP Solarium package requires allow_url_fopen to be set to On. This seems rather obvious, but is not the default value on CentOS. That was a long day. Here are a few of the articles that I used to get this far:
    1. Patrick Reilly: How to install maven on CentOS
    2. Solr: MultiCore
    3. Apache Tomcat configuration: HTTP Connectors
    4. Andrew Jaimes: Installing Lucene/Solr on CentOS with Tomcat
    5. Hipsnip Cookbooks issues: Solr Fails to start missing SLF4j
    6. StackOverflow: Solr 4.3 Tomcat6 Ubuntu installation exception (More confusing than helpful.)
  • Xrigher: Install Lucene Solar with Tomcat on Windows (This was the one that got me to the right answer!)
  • Read This

    A brief history of the web

    2013.10.21 – A few months back, there was a rather interesting discussion on Reddit about different internet technologies. I put the following comment together as a bit of historical perspective.  In retrospect, it’s probably not 100% accurate, and a bit or a rant, but I figured other people might find it interesting. Your webserver back in the old days was probably NCSA HTTPd if your host hadn’t switched over to Apache. Apache is still one of the most popular webservers today. NCSA HTTPd, not so much. Apache is written in C, but you don’t need to know any C to configure it. Webservers generally deliver files, most often HTML which is just a marked up text file, but also images and videos and audio and pdfs and anything else you want. A few years later and you also commonly see Nginx as a webserver and Microsoft’s IIS on Windows servers. Programming on the web pretty much started with CGI scripts and was purely server-side. Back in the early nineties, this is how we did things – generally a C program that was compiled, and set to receive input from the webserver and do something with it – your form emailers and counters and guestbooks and so forth. You really don’t see much of this any more. Most people switched over to Perl as an easy way to write scripts, rather than using compiled executables. Perl is notoriously hard to read and write, but a pretty good language for string processing (which is basically all the web is at the core). It has declined in popularity significantly over the last ten years or so. Java began to show up more for web use towards the end of the nineties. These days, it’s synonymous with “Enterprise”, and generally used for large business applications. It’s a very dense language, and lots of frameworks exist for it, including ones like Spring, Hibernate, Struts, etc. Many, many other languages have grown up around Java, including ScalaClojure, and Groovy. PHP showed up about then as well, and quickly became the most widely-used language on the web. This was primarily due to its friendliness, ease of use, and ease of deployment – you could just drop scripts on a server the same as with html files and run them. Since PHP is so easy to use, lots of hobbyists have picked it up, and a large amount of awful code and poor tutorials have been written with PHP. Even so, the changes since version 5 have been increasingly bringing PHP towards the modern era, and it’s actually quite good these days. (Haters gonna hate.) There are many frameworks for PHP, including Symfony, CodeIgniter, Laravel, Cake, Zend, and others. Several of the most popular Content Management Systems (or CMS) are written in PHP, including Drupal, WordPress, and Joomla. Ruby was gaining steam by the early 2000s. Based off of Smalltalk, it’s a very “smart” language that’s actually a lot of fun to write. It’s got a lot of niceties that do help you write code with less effort. The joke is that hipsters use Ruby, but it’s really a great language. It also has a fantastic package manager RubyGems / gemRails is the most popular framework for it; most other web frameworks often tend to just be clones of Rails. It’s a very opinionated framework, though – with lots of, what we call in the programming world, “magic”. Another popular framework is Sinatra, which is a thinner framework. Python has technically been used on the web as long as PHP, but still doesn’t have very much marketshare. It is, however, used for not-web things frequently, especially among academics. In some schools, it has replaced Basicas the first language to teach children, due to its friendliness. The most popular framework is Django, though there is also Pyramid/Pylons and Flask. ASP also needs a mention, though it is not an OpenSource language like the others, and owned by Microsoft. You often see it in conjunction with .Net, and it’s very popular for use in large corporations (pretty much anywhere you might otherwise see Java). It’s not very friendly, IMHO. In the last few years, some older “purely functional” languages have come to popularity as well, including Haskelland Clojure (outgrowth of Lisp). Now, you still needed somewhere to store all of your data, so the first popular databases came about. Most of these were ANSI SQL variants, and so behave almost identically, with the main differences being under the hood. These include: MySQL : Still one of the most popular databases in use on the web. There’s a popular fork known as MariaDB as well, which gives you a few new tricks (like server clusters!). PostgreSQL: probably the second most popular. (I’m guessing.) MSSQL: the obligatory Microsoft version. and SQLite which actually just operates on text files and is super lightweight. And last, there’s Oracle which is a megalithic company all to itself, as well as a database, and a Business Intelligence service provider. Large organizations, colleges, and universities give them lots of money to make their systems go. And then you have the no-SQL data stores that came to popularity in the last ten years or so, which generally are non-relational document stores. There are lots of these, but a few of note are: Memcache was one of the first and most popular, especially since Facebook picked it up. MongoDB which gives you a lot of the abilities of a relational DB.  Mongo has recently received a lot of negative criticism due to fundamental flaws in implementation.  [1] [2] [3] CouchDB another Apache project that uses JSON to store data. Cassandra yet another Apache project, designed for high-availability of large amounts of data Tokyo Cabinet and Kyoto Cabinet successors of dbm. Now, going back for a second, we eventually wanted browsers to be able to do some neat things, so Javascriptcame about – a programming language that runs in your browser. It actually doesn’t have much in common with Java, regardless of the name, and is actually an outgrowth of ECMAScript, which also gave us ActionScript which is used in Adobe’s Flash multimedia platform. There are many, many Javascript frameworks. The old bunch, which mainly made writing code a bit easier and doing things like animations, included PrototypeScriptaculous, and jQuery, and newer ones such asUnderscore, and Zepto. The new bunch is are more similar to traditional web frameworks, giving Models and Views (and sometimes Controllers or otherwise just Routers), including BackboneEmber, Knockout, and Angular. In the last few years, Node has appeared, as a way of using Javascript on the server-side. It’s actually very performant for IO operations, but suffers from being relatively young, and people are still finding the best ways to use existing frameworks with it.  Still, it provides an excellent package manager called NPM which is ver similar to Ruby‘s.

    Read This