posts

Bash Basics

2015.05.10 – Most Unix systems (including OS X) provide a large number of fantastic tools for manipulating data right out of the box.  If you have been writing small Python/Ruby/Node scripts to perform transformations or just to manage your data, you’ll probably find that there are already tools written to do what you want. Let me start with the conclusion first. The next time you have to transform or manipulate your data, look around for what Unix tools already exist first.  It might take you a little longer to figure out all of the flags and parameters you need, and you’ll have to dig through some unfriendly documentation, but you’ll have a new, far more flexible tool in your toolbox the next time around. Before you settle on a policy, see if you can get the one on an insurance cash back deal from a comparison site. This is part 1 of a series on Unix tools. Read the other parts:

  1. Bash Basics
  2. Unix Tools
  3. Managing Data with Unix Tools

Assumptions

I’ll assume you know the very basics here – what a man page is, how to create an executable bash script, how to open a terminal window, and how to use basic utilities.  If you don’t know any of those, you should start with one of the many intros to the command line available.  This intro by Zed Shaw is a good place to start.

The Shell

Bash is the default shell on most systems these days, but what we’re covering here will mostly work for zsh or other shells – though some syntax elements will be different. First off, Bash is a powerful tool by itself. Even with no additional packages added, you get variables, loops, expansions & regular expressions, and much more. Here’s a good guide with more information on using bash. I’ll assume you know the basics from here on out, and show you what you can do with them.

Advanced Paths

If you want to work with several directory paths in a row that are very similar, you can pass a list to the shell using curly braces {} and it’ll expand that list automagically. Let’s say I wanted to setup a few directories for a new project’s test suite. Rather than running a lot of duplicated commands, I could pass a few lists instead.
mkdir -p ./test/{unit,fixtures}
> Creates ./test/unit and ./test/fixtures
mkdir -p ./test/unit/{controllers,models}
> Creates ./test/unit/controllers and ./test/unit/models
Note that we’ve passed the -p flag to mkdir so that it’ll create all of the directories up the chain, even ./test here. We can also use pattern matching with brackets []. For instance, if you’ve got a lot of files that you want to separate alphabetically, you use a letter pattern:
mv ./[A-H]* /Volumes/Library/A-H/
mv ./[I-O]* /Volumes/Library/I-O/
mv ./[P-Z]* /Volumes/Library/P-Z/
This will have broken your library up into three sets. You can also use that matching later in the string:
mv ./A[a-k]* /Volumes/Library/Aa-Ak/
mv ./A[l-z]* /Volumes/Library/Al-Az/
Now, by default most systems will be case sensitive, so you will have left behind all of your files starting with a lowercase letter. This is less than ideal, so we can set a flag to tell the system to be case insensitive for file matching. This type of matching is known as globbing, and to set this flag, we run shopt -s nocaseglob. (In zsh this would be unsetopt CASE_GLOB) If you just run that on your shell, it’ll stick on the current session until you unset it with shopt -u nocaseglob. You might even want to add that to your .bash_profile. Bash, however, also allows us to set environment variables for just the current execution, by wrapping the commands in parentheses:
(shopt -s nocaseglob ; mv ./[A-H]* /Volumes/Library/A-H/)
This will only use case insensitive globbing for that single command, and then will turn the value back off.

Loops

Bash allows you to make use of some rather powerful for loops. I frequently use loops to automate boring manual work, like converting a bunch of RAW image files into web-friendly JPEGs of appropriate size:
for i in *.CR2; do
    dcraw -c -a -w -v $i | convert - -resize '1000x1000' -colorspace gray $i.jpg;
done;
(You could run that as a one-liner as well, the line breaks are just here to make this readable.) Here, I’m taking all of the .CR2 files in the current directory, passing those to dcraw to translate the format from RAW into JPEG, then <a href="http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO-4.html">piping</a> the output to ImageMagick, which shrinks it to web-size of no more than 1000 pixels on a side and makes everything black and white, which is extra-artsy. I use a similar command in our legal docs repo to convert our source Markdown files into a variety of formats, using pandoc:
for myfile in $( ls ./markdown ); do
  echo Converting $myfile;
  for fmt in html docx pdf; do
    filetrim=${myfile%???};
    pandoc -o "./"$fmt"/"$filetrim"."$fmt -f markdown "./markdown/"$myfile;
  done;
done
This one is a little fancier, as we’re doing a bunch of things with nested loops, file name trimming, etc. Let’s break it down:
for myfile in $( ls ./markdown ); do
First off, grab a list of the files in the ./markdown folder. Use the variable $myfile to store the current file’s name.
for fmt in html docx pdf; do
Now we’ve got a loop within a loop. We’re creating a list of the format’s we’ll be using (html, docx, and pdf) and storing the current format in the variable $fmt.
filetrim=${myfile%???};
Here’s a useful bit – we’re trimming the last three characters (using %???) from the string, which is the extension (.md). Another valid pattern would be:
filetrim=${myfile%.*};
which simply removes the entire extension, regardless of how long it is.
pandoc -o "./"$fmt"/"$filetrim"."$fmt -f markdown "./markdown/"$myfile;
Here we’re passing all of our variables we’ve assembled back to pandoc. We’re quoting the strings we want hardcoded in there, so that they’re not misinterpreted as part of the variable name, which would cause this to throw errors.
done;
done
And then we’re closing out both of our for loops.

Wrapping Up

You can also use builtin utilities to do simple tasks, like appending content to files:
echo "name=core1" >> ./solr/core.properties
That’s it for now, continue on to the next part, Unix Tools.

Read This

Setting the Standards for Open Government Data

2014.06.17 – City and state governments across America are adopting Open Data policies at a fantastic clip. Technology hotbed San Francisco has one, but places far from Silicon Valley – like Pittsburgh, PA – are also joining the municipal open data movement. The increasing availability of free tools and enthusiastic volunteer software developers has opened the door for a vast amount of new government data to be made publicly available in digital form. But merely putting this data out there on the Internet is only the first step. Much of this city data is released under assumption that a government agency must publish something – anything – and fast. In this rush to demonstrate progress, little thought is given to the how. But the citizens who care about this data – and are actually building websites and applications with it – need to access it in machine-readable, accessible, and standards-compliant formats, as the Sunlight Foundation explains here. This explains why most city open data sets aren’t seen or used. There is a vast difference between merely opening data, and publishing Open Data. By publishing data in good formats that adhere to modern, widely-accepted standards, users of the data may reuse existing software to manipulate and display the data in a variety of ways, without having to start from scratch. Moreover, it allows easy comparison between data from different sources. If every city and every organization chooses to adopt their own standard for data output, this task becomes absolutely insurmountable – the data will grow faster than anyone on the outside can possibly keep up.

It’s a Mess: Most “Open Government Data” Is Virtually Useless

Take, for example, the mess that is data.gov. Lots of data is available – but most of these datasets are windows-only self-extracting zip archives of Excel files without headings that are nearly useless. This is not what the community at large means by “Open Data” – there are lots of closed formats along the way. Similarly, data which is released with its own schema, rather than adopting a common standard, is just as problematic. If you’re forcing your users to learn an entirely new data schema – essentially, a brand new language – and to write entirely new parsing software – a brand new translator – just to interact with your data, you’ve added a considerable barrier to entry that undercuts openness and accessibility. A good standard lets anyone who wants to interact with the data do so easily, without having to learn anything new or build anything new. Standard programming libraries can be built, so that it’s as simple as opening a webpage for everyone. This means that in most programming languages, using a standards-based data source can be as simple as it is to interact with the web, import httplib and you’re done.

Evaluating Existing Standards

Every day at The OpenGov Foundation, I work with legal and legislative data. Laws, legal codes, legislation, regulations, and judicial opinions are a few examples. What standard do we use? Well, let’s look at the most common standards available for publishing legal data on the Internet:
  • Akoma Ntoso – a well-known XML-based format that is very verbose. The level of complexity presents a high barrier to entry for most users, and has prevented its wide adoption.
  • United States Legislative Markup (USLM) – another XML-based format used by the US House of Representatives. It has the advantage of being not very verbose, extensible, and easy to use.
  • State Decoded XML – the format used by The State Decoded project. Currently, this only support law code data, and is not widely adopted outside of this project.
  • JSON – JSON is not actually a standard, but a general format well suited to relational and tabular data, and chunks of plain text. A variant is JSON-LD which has all of the same properties, but is better for relational data. It is commonly used for transferring data on the web, but it is not practical for annotated or marked-up data.
None of these are ideal. But if I had to pick a single option to move forward, the USLM standard is the most attractive for several reasons:
  • It is older, established, and has good documentation
  • It is easily implemented and used
  • It is extensible, but not especially verbose
  • It is designed to handle inline markup and annotation, such as tables, mathematical formulas, and images
It also acts as a very good “greatest common factor” as a primary format – it can be translated easily into common formats such as HTML, Microsoft Word, plain text, and even JSON – but does not add any superfluous complexity to address most common needs (e.g., tables or annotations) that other formats require.

Setting the Standard for Open Law & Legislative Data

Moving forward, the next step beyond simply exporting USLM data from existing data sources would be to to have end-to-end solutions that speak USLM natively. Instead of editing Word or WordPerfect documents to craft legislation, lawmakers could write bills in new tools that look and feel like Word, but are actually crafting well-formatted USLM XML behind the scenes, instead of a closed-source, locked-in format. This is what we call “What You See Is What You Mean” – or WYSIWYM. Here at The OpenGov Foundation, we believe in a rich, standards-based data economy, and we are our actively doing our part to contribute. Our open-source online policymaking platform – Madison – already consumes USLM, and we are actively working on an WYSIWYM editor to make it easier to create and modify policy documents in Madison. We are also investigating USLM support for The State Decoded – both as an input and output format. Hopefully, other software projects will actively follow suit – creating an interoperable ecosystem of legal data for everyone in the United States.

Read This

The Iceberg of Knowledge: A Quick Parable on Learning

2014.06.04 – This was inspired by Dawn Casey’s article on learning programming. Personally, I like to think about knowledge on any particular topic as an iceberg. There are three types of knowledge, and learning is the process of moving around the iceberg of knowledge: What you know you know – this is the part of the iceberg you can see from where you’re standing at this very moment. What you know you don’t know – this is the other side of the iceberg, that you can’t see from where you’re standing but is obviously the other side of the iceberg, just over that peak in front of you. It may be slightly larger than you expect, but in you’ve got a good idea of the shape of it from looking from where you are. You may climb to a higher point on the iceberg to see more of it. What you don’t know that you don’t know – this is the majority of the iceberg, underneath your feet. It’s large, and epic. It’s always bigger than you can reasonably imagine. It takes specialized equipment and knowledge to be able cover every inch of it on the outside, and careful dissection to get to the center.

Read This

Building Govify.org

2014.04.10

A Brief Summary

After a hackathon a few months back, we were joking about creating an easy way to take the data we’d painstakingly parsed from PDFs, word documents, and XML files, and “translate” it back into a format that government agencies are used to. Many of us have been shell-shocked in dealing with PDFs from government agencies, which are often scanned documents, off kilter and photocopied many times over. Fundamentally, they’re very difficult to pry information out of. For the OpenGov Foundation’s April Fools’ prank, we created Govify.org, a tool to convert plain text into truly ugly PDFs.

Implementation

A quick, [one-line ImageMagick command](https://gist.github.com/krues8dr/9437567), was the first version. We quickly produced a few sample documents, and decided that it would be fantastic if users could upload their own files and convert them. Very quickly it became clear that the process might take a couple of seconds, and a decent amount of CPU – so to deal with any sort of load, we’d need a modular, decentralized process, rather than a single webpage to do everything.  

Hidden troll of @FoundOpenGov‘s Govify is that front end is PHP, deploy is Ruby, worker is Python, & conversion is a shell script. #nailedit

— Ben Balter (@BenBalter) April 1, 2014  

As Ben Balter points out, there are a lot of moving pieces to this relatively-simple setup. Govify.org is actually a combination of PHP, Compass + SASS, Capistrano, Apache and Varnish, Rackspace Cloud services and their graet API tools, Python and Supervisord, and ImageMagick with a bash script wrapper. Why in the world would you use such a hodgepodge of tools across so many languages? Or, as most people are asking these days, “why not just build the whole thing with Node.js?”  

The short answer is, the top concern was time. We put the whole project together in a weekend, using very small pushes to build standalone, modular systems. We reused components wherever possible and tried to wholly avoid known pitfalls via the shortest route around them. A step by step breakdown of those challenges follow.

Rough Draft

We started with just a single ImageMagick command, which:

  • Takes text as an input

  • Fills the text into separate images

  • Adds noise to the images

  • Rotates the images randomly

  • And finally outputs all of the pages as a PDF.

  • Using that to create a few sample documents, we began putting together a rough website to show them off. Like everyone else who needs to build a website in zero time, we threw Bootstrap onto a really basic template (generated with HTML5 Boilerplate. We use a few SASS libraries – Compass, SASS Bootstrap, and Keanu – to get some nice helpers, and copied in our standard brand styles that we use everywhere else. A few minutes in photoshop and some filler text later, and we had a full website.

    We needed a nice way to deploy the site as we make changes, and our preferred tool is Capistrano. There are other tools available, like Fabric for Python or Rocketeer for PHP, but Capistrano excels in being easy to use, easy to modify, and mostly standalone. It’s also been around for a very long time and the one that we’ve been using the longest.

    We’re using Rackspace for most of our hosting, so we stood up box with Varnish in front of Apache and stuck the files on there. Website shipped! ship it!

    Web Uploading

    Once that was done, we made the decision to allow users to upload their own files. At OpenGov, we’re primarily a PHP shop, so we decided to use PHP. OK, OK – stop groaning already. PHP is not the most elegant language in the world, and never will be. It has lots of horns and warts, and people like to trash it as a result. That being said, there are a few things it’s great at.

    First and foremost, it’s incredibly easy to optimize. Tools like APC and HipHop VM which allow you to take existing PHP scripts and make them run *very* well. The variety and diversity of optimization tools for PHP make it a very attractive language for dealing with high-performance apps, generally.  

    Second, it’s a “web-first” language, rather than one that’s been repurposed for the web – and as a result, it’s very quick to build handlers for common web-tasks without using a single additional library or package. (And most of those tasks are very well documented on the PHP website as well.) Handling file uploads in PHP is a very simple pattern.  

    So in no time at all, we were able to create a basic form where users could input a file to upload, have that file processed on the server, and output back our PDF. Using the native PHP ImageMagick functions to translate the files seemed like a lot of extra work for very little benefit, so we ran kept that part as a tiny shell script.  

    At this point however, we realized that the file processing iself was slow enough that any significant load could bring slow the server considerably. Rather than spinning up a bunch of identical servers, a job queue seemed like an ideal solution.

    Creating a Job Queue

    A very common pattern for large websites that do processing of data is the job queue, where single items that need processing are added to a list somewhere by one application, and pulled off the list to be processed by another. (Visual explanation, from the Wikipedia Thread Queue article.) Since we’re using Rackspace already, we were able to use Rackspace Cloud Files to store our files for processing, and the Rackspace Queue to share the messages across the pieces of the application. The entire Rackspace Cloud stack is controllable via their API, and there are nice libraries for many languages available.  

    On our frontend, we were able to drop in the php-opencloud library to get access to the API. Instead of just storing the file locally, we push it up to Rackspace Cloud Files, and then insert a message into our queue, listing the details of the job. We also now collect the user’s email address, so that we can email to let them know that their file is ready.  

    The backend processing, however, presented a different set of challenges. Generally, you want an always-running process that is constantly checking the queue for new files to process. For processes that take a variable amount of time, you don’t want just a Cron job, since the processes can start stacking up and choke the server – instead we just have a single run loop that runs indefinitely, a daemon or service.  

    For all the things that PHP is good at, memory management is not on the list. Garbage collection is not done very well, so large processes can start eating memory rapidly. PHP also has a hard memory limit, which will just kill the process in an uncatchable way when it dies.  

    Python, on the other hand, does a rather admirable job of this. Creating a quick script to get the job back out of the Rackspace Queue, pull down the file to be manipulared, and push that file back up was a rather simple task using the Rackspace Pyrax library. After several failed attempts in trying to use both the python-daemon and daemonize packages as a runner for the script, we reverted to using Supervisor to keep the script going instead.  

    Final Thoughts

    Obviously, this isn’t the most elegant architecture ever created. It would have made far more sense to use a single language for the whole application – most likely Python, even though very little is shared across the different pieces aside from the API.

    That being said, this thing scales remarkably well. Everything is nicely decentralized, and would perform well under significant load. However, we didn’t really get very significant load from our little prank – most people were just viewing the site and example PDFs, and very few were uploading their own. Sometimes overengineering is its own reward.

    Not bad for three days of work, if I do say so myself.

    All of the pieces are available on Github and GPL2 licensed for examining, forking, and commenting on.

  • Govify Shell Script

  • Govify Web Site & Uploader

  • Govify Worker Script

  • Read This

    Code for America Summit 2013

    2013.11.13 – There were far too many great talks to summarize them all, but CFA has posted the full agenda online as well as videos of all of the talks.  Here are a few of our favorites. Clay Shirky’s Keynote

  • Clay covered a lot of ground, and his whole speech is worth a listen.
  • One of my favorite takeaways:  “The output of a hackathon isn’t a solution to a problem, it’s a better understanding of the problem.”
  • He also covers that idea that Sir Winston Churchill summarized so well: “Success is the ability to go from one failure to another with no loss of enthusiasm.” 
  • Tim O’Reilly, Seamus Kraft, & Gavin Newsom – Open Gov Crossing Party Lines
    • Tim O’Reilly led this discussion between OpenGov’s own Seamus Kraft (standing in for Congressman Darrell Issa) and California Lt. Governor Gavin Newsom, on how Open Government transcends party politics, and both sides of the aisle come together on this topic.

    There were several talks about the problems with procurement in government, and how it is one of the fastest ways to reduce wasteful, inefficient government spending. Clay Johnson – Procurement
    • Clay has been featured in a lot of interviews lately talking about procurement and the new Healthcare.gov website. He points out that the vendors that win bids on RFPs typically are those that have good legal departments, not necessarily the ones that have the best  team for the job.  He showed off a few tools designed to fix that problem:
    •  Screendoor – a system for creating readable, usable RFPs
    •  WriteForHumans – a simple tool to gauge how difficult to read a piece of text is.
    Jeff Rubenstein – SmartProcure
    • Jeff showed us SmartProcure, a system designed for localities to find and share vendor information.

    There were several talks around using data and technology to improve the effectiveness of services. Nigel Jacobs, Joel Mahoney & Michael Evans – Boston
    • There were a few talks about the improvements being made in Boston.  This clip features Michael Evanstalking about City Hall To Go – a mobile van that brings City Hall services to underserved areas, allowing residents to pay taxes, get handicap parking spaces, register to vote, and do other things on the spot instead of making a trip all the way down to Boston City Hall.
    Cris Cristina & Karen Boyd – RecordTrac
    • Cris and Karen discussed the RecordTrac system, used in Oakland to respond to public records requests.  This project is opensource, so other cities can adopt this and start using it immediately.
    Saru Ramanan – Family Assessment Form
    • Historically, there have been few tools to actually tell how effective social services are in helping families.  The Family Assessment Form is a tool to record and investigate help to individual families, and track their progression over time.

    The second day featured many talks on user-friendly tools for local improvement.  This also focused on user experience as a form of social justice. Tamara Manik-Perlman – CityVoice
    • City Voice is a tool built on Twilio‘s API to collect feedback from residents over the phone, using a very easy-to-use system.  This was originally implemented in South Bend to address the housing situation, but could be used in any locality for a variety of topics where gathering feedback outside of a single forum is useful.
    Lou Huang and Marcin Wichary – StreetMix
    • Lou and Marcin gave one of the most entertaining talks of the conference on StreetMix, which allows citizens to propose layouts for streets, using simple drag and drop tools to create bike lanes, bus stops, medians, and more.
    Dana Chisnell – Redesigning Government
    • Dana and her team were set with the task of designing a ballot for *everyone*, from adults who have never used a computer, to those have low literacy and low education, those with cognitive disabilities, and other often ignored groups.  This is a must-watch for anyone who spends time on accessibility issues, or is interested in Good UX For All.
    Cyd Harrell, Ginger Zielinskie, Mark Headd, and Dana Chisnell – Redesigning Government
    • Cyd led a panel discussion talking about a variety of topics around UX and accessibility in technology for cities.  This was my favorite talk of the conference, and one covering topics that are often overlooked.

    Read This