posts

The End of the Second Act of Civic Tech

2016.09.27

The End of the Second Act

On September 20th, Sunlight Foundation announced that it is officially closing down Sunlight Labs, downsizing, and considering how best to further its mission through potentially merging with another organization, leaving the future of dozens of technology projects uncertain. This follows on the heels of OpenGov Foundation discontinuing most of its work on America Decoded, its open-law project [1] [2] [3], and Code for America ceasing direct funding of its Brigades after previously having shut down its technology incubator program. The major funders of many of these and other organizations in the space have also been changing how they’ve been funding. Frustration with projects that are not completed on time and within grant budgets have created an apparent wariness in funding opportunities from many. Some others simply no longer wish to fund transparency efforts, especially at the federal level, believing that transparency is no longer an achievable – or even good – goal. 1 2 Government, at all levels from city to federal, has also stepped up significantly in recent years, hiring more and more technologists in-house to solve problems from the inside, seemingly bypassing the need for civic tech entirely; of course, many of these hires came from the civic tech world outside of government. The days of million-dollar checks for “new” or “cool” ideas are effectively over. In some cases, the criteria these funders have chosen for evaluation of projects’ success – such as the number of “likes” an organization has on Facebook or Twitter – seem woefully out-of-touch with the work of advocacy organizations. Moreover, the funding cycles of these organizations are relatively short – a year or two at most – while it may take five or ten years at least to have the long-lasting impact many nonprofits aim for. This all has contributed to a lack of a clear path for relevant goals and sustainability for many organizations – and even small projects. But the mere merit of having a project labeled “civic tech” simply isn’t enough to justify piles of cash and free reign to use it. If anything, the failure of much of the civic tech movement can be attributed to a strong lack of experience, oversight, and planning. The larger nonprofit world, especially on the service and advocacy side of things, has been using logic models for program evaluation for decades – but this work has gone largely ignored in the civic tech community. Even within the few for-profits that have emerged, impact analysis frequently takes a back seat to promotion. Instead, most projects are started without clear goals, realistic evaluation criteria, or even a firm concept of real impact. “Let’s put some things online” is, in many cases, as sophisticated a model as many projects ever reach; the actual people being served are an afterthought at best most of the time. Publishing tools and data for their own sake is a fine goal, especially those that support journalistic, educational, and research pursuits – all of which are critical to transparency efforts. But we should also stop to consider if our tools will help those who need them most, or if they merely will help the privileged, educated, and well-off – especially for those that are intended to impact equity and access. Is more transparency having the impact on people’s lives that we want it to? Are we really engaging with the people and improving policy or are we merely talking to our friends? How do we know we’re making things better for everyone, working towards equity and not just paying lip service? Simply put, we cannot know if we don’t talk to the people directly, and if we don’t have a way to study our impact. There are some fantastic organizations following through, meeting people where they are – in churches, schools, libraries – but these are few and far between. A few more enthusiastic people with backgrounds in social work and degrees in nonprofit management would be hugely beneficial in our space. At the very least, more time spent on outreach would improve just about every project. Similarly, the technical limitations of many project teams hamper their long term success. The “anybody can do this!” attitude which pervades the civic tech community, while extremely democratic and empowering and wonderful for making the tent bigger, often results in many projects that are under-designed and lack foresight into technical sustainability. Meanwhile, just as many of our projects are the products of more experienced developers wanting to try out new, trending technologies – leading to us over-engineering solutions for niche problems written with ephemeral technologies. In both cases, any likelihood of reusability or sustainability is dramatically reduced by poor planning and decision-making. Many problems being solved today by custom applications could just as easily be solved with a git repository, or WordPress site, or even just a spreadsheet on Google Drive. There are many fine projects that do not fall into these categories, but they’re few and far between. And there is an insanely huge amount of reinventing the wheel in our space, despite efforts to the contrary. As developers, we must escape our own hubris to find the most simple solutions whenever possible, if our projects are to thrive. Engineering is hard; we don’t need to make it harder. We have largely forgotten that the purpose of all of this technology is not technology itself, but rather culture change. Culture changes much slower than fashion, which is (sadly) our most common unit of time-measurement these days. It takes years to build the relationships necessary to change culture. But this is the work 1 2. It’s very attractive to work in silos, programming without doing the hard part of talking to people, but this isn’t a good way to change the world. This is not to say that there is not a value in a proof-of-concept. For many of the changes that the civic tech world would like to see in government or society, simply proving that it can be done is a very powerful tool. But tools should generally solve an immediate need for real people, not imagined “users”. And at the very least they should be used to start conversations, not to spin off new product companies. (There are certainly other very good and valid reasons to hack as well.) After you’ve done the work though, definitely consider finding ways to be paid for it – without money, nothing is sustainable. And though civic tech will always rely on volunteerism, it should not simply become a form of government spec work. But not everything needs to last forever. In many, if not most, cases, all projects should have a firm timeline, with waypoints for evaluation, and a date to cease operation. Having ways to measure real success and failure in terms of impact, with a real schedule, is absolutely critical for projects. Trying to make something last forever that was only designed to last a few months – or simply not working – is a fool’s errand. Meanwhile, finding funding for a short-term project is vastly easier than one that will run indefinitely without clear goals. It is important to note that there have been some incredible successes during the first and second acts of civic tech. The passing of the DATA Act though a bipartisan effort and countless hours of work from The Sunlight Foundation and other groups was a huge milestone towards openness and transparency. Carl Malamud’s tireless work has led to the opening of the SEC EDGAR database, the IRS’ Form 990 data, and much more. And most recently, the hundreds of civic technologists from Sunlight, the brigades, and elsewhere that have gone to work for government agencies have brought their passion to new innovation inside of government. 1 2

Setting the Stage for The Third Act

The future of civic tech, given all of this, is very uncertain. Local brigades and volunteer groups have become in many ways the core of the movement. These groups have increasingly been working with government directly to introduce new tools and policies to increase transparency and equity. These spaces will continue to foster new ideas for their communities, and provide a ready source of civic-knowledgeable technologists for government and nonprofits – this is critical for supporting the infrastructure of transparency. And rather than being fueled by grant funding, this work is largely due to the passion of thousands of volunteers all over the country (and world!) However it’s also, in part, due to contributions from the private sector. In the same way that for-profit companies have sustained the open source movement for decades, so too are they stepping up to support the civic tech movement. Some companies provide direct financial support to brigades and community groups – paying for food and event spaces. Moreover, companies like Esri and GovDelivery are using open data, open source, civic tech as central parts of their platforms. This will only continue to grow as other funding sources become increasingly scarce and the interest in open data and transparency expands. Government is also changing to become more [civic] tech-savvy. The work of government units such as 18F in the General Services Administration and the U.S. Digital Services in the White House have impacted the Federal government in substantial ways 1 2 3. But moreover, this has created new enthusiasm in cities and states – causing an unprecedented rise in new CTO and CDO positions all over the country, and policies reflecting the changing landscape of data needs. (As an aside, it’s noteworthy that many of these positions are being filled by former GIS data team members, a field where open data and civic tech has particularly long and deep roots!) Tools and data are more and more often being produced in-house rather than outsourced. Another place I would love to see civic tech find a home outside of government is within libraries. I must say that this is idealistic, at best. In many places, libraries have become – and really, have always been – a bastion of innovation and civic focus 1 2 3. It’s a very natural fit for both parties, and – most importantly – the people being served are physically present! There’s no way to avoid doing the work of meeting people where they are if you start where they are. Increasing investment in civic technology from libraries would be a huge boon.

What You Can Do

Civic Organizations, Project Owners, and Dreamers

For any new projects just starting out, or years along their path, I would strongly recommend answering a few key questions before spending another minute on your work. You may have done many of these already, but writing them down and posting them publicly is to everyone’s benefit. They also provide a solid starting point to have an honest conversation with funders.
  • Identify your outcomes – what do you really want to change as a result of your project.
  • Evaluate your impact – what measurements will you use to judge that your work is causing the outcomes you intend?
  • Who is the audience? Where are they? How can you meet them where they are? Be specific, think about who is in need.
  • Talk to your audience – what do they think they need, and how well does your idea match up?
  • How long will the project run? What are the milestone for evaluation along the way? Set dates to evaluate your actions.
  • What is the real level of effort and cost for your project? Be realistic, think about the hours needed, not just to write code, but go to meetings, do user testing, and everything else.
  • How can you use the work of others to help you with your work? What tools, data, and research already exists in your problem space? Who else is already actively trying to solve this problem, and how can you work with them?
  • How can you make your work reproducible? What can you put online and publish for other people to be able to make an impact using what you’ve learned? How will you license your work so others can use it? 1 2 3
Just to reiterate that last item – civic tech works best when we collaborate. Collaboration works best if we embrace radical transparency, to share all of our research, methodology, and findings with the world. Even our failures can become victories if it helps someone else to do things better the next time. And this is true in the for-profit world as well! A single vendor in a space will struggle, but fostering an ecosystem will drive more interest, demand, and need. Open data and open source help make more open data and open source!

Board Members, Founders, and Executive Directors

Listen to your team. They’re there because they believe in what you’re doing! For an organization to grow and thrive, it must continue to change and evolve – while still staying true to its principles. Stop and look at the shape your organization is taking, you may find that it has outgrown your old vision and become something radical and new – and that’s ok! Don’t get stuck on one idea, be flexible. Listen to your team. You can’t do everything, find people that you trust and let them do what they’re good at. When they give you advice, take it. If most of your advisors are telling you one thing and you believe another, it’s probably time to step out of the way and let them do their jobs.

Government Officials at All Levels

If you’re a government official, you’re probably already aware of what you must do. First, make sure you understand the open data and policy push that’s happening today. And then, get excited about it – find a local civic tech group or brigade and chat with them about your job, and hear how passionate they are! If you put a problem in front of a technologist, no matter how big or small, we will immediately start coming up with ways of solving it – that’s where our passion comes from. Get excited with us. Some of the most passionate people I know in the civic tech movement started in government, not as technologists! Second, start viewing your job through a lens of transparency. Ask yourself, in everything that you do, how can I expose this work? How can I share with the community the problems and solutions I encounter? What data can I share with the world – whether I personally find it interesting or not? How can I work with other departments to improve communication and share knowledge outside of my silo? Try to find an issue, something that you can push on and make more transparent, and follow through with it from start to finish. And make sure to tell the community about it! No matter how mundane, we love this stuff.

Funders and Charitable Givers

If you happen to be a funder of nonprofit projects and organizations, first and foremost I ask you to renew your faith in the transparency and civic technology movement. We still have a lot of work left to do! And government can’t do it alone, small non-profits can’t do it alone. Innovation must be fostered from a variety of sources. And please keep in mind that this process takes a very long time! One- and two- year cycles are only long enough for a proof of concept, not nearly enough to effect real change. Organizations desperately need funding for general support in addition to individual projects – the day-to-day stuff like meetings, promotion, calls, and outreach are all just as important to running an organization as the fancy new releases. However, I encourage you to hold projects accountable to the criteria for organizations and projects listed above! There are real-world metrics of impact that can effectively track the progress of our projects, make sure you know what these are and so that we can better judge efficacy of programs. Hint: it’s not likes on social media – find real, meaningful ways of evaluating the work. If you’re a board member, your task above and beyond everything else is to understand real impact, and not just numbers. And – in many ways the most important ask I have of you – be aware of your own, personal blindspots and work to expand them. There are thousands of good ideas that could improve the world today and just need someone to take a chance on them. Just don’t get too distracted by every new and shiny thing – again, older projects still need long-term investment and growth. It is only by working together that we can make the world a better place than we found it.

Read This

What I Wish I Had Known About Decoding The Law

2016.08.01 – After a little over three years of working on The State Decoded (and a bit less than that on Madison), I’ve learned quite a few things about the law.  The process of translating legal code into machine-readable data is not an easy one, but after thousands of hours working on this problem we’ve made some solid progress in automating it. What follows are a few lessons about the law that I wish I’d known before starting, which may help other developers to make good decisions in open law and legislation projects. Update: In 2020, the Supreme Court ruled that legal codes, including annotations, are public domain works. The text below has been updated to strikethrough portions relevant to these changes.

Every Place is the Same

All of the legal code I’ve encountered so far is somewhat similar. The law in most places consists of many sections (denoted by the § symbol in most places), grouped together by subject area under a hierarchy of structures.  These structures are usually named things like article, chapter, title, appendix, subchapter.  Occasionally one finds things like subsection or subcode.  Sections are generally referred to by their identifier number both in the code and by external references. In many places, there are entirely separate codes for different legal concerns – the charter is frequently a separate body of code from the other public laws, the administrative code may be broken out, and so forth.  In San Francisco, the law is separated into over a dozen individual codes. Most structures and sections have some sort of short identifier – usually numeric – as well as a title.  Structures may have additional introductory or following text, or just be a list of sections.  Sections usually have a body consisting of paragraphs, sub-paragraphs, sub-sub-paragraphs, and so on – a sort of internal hierarchy.  These may be numbered with numerals, letters, or roman numerals, so be careful when parsing to determine whether i means 1 or 9.  Referring to a particular paragraph is common, so having permalinks to these is useful. Sections may also contain tables, images, drawings, maps, and other non-textual information.  These can be used to display zoning areas, show costs over time, or explain the makeup of the city’s flag or seal. Many structures will begin with a list of legal definitions that are used throughout the sections under that structure, occasionally these will apply to the entire legal code.  It is possible to provide inline definitions of these terms as they are defined by the law code, but you must take into account the scope given for these definitions – the scope is usually stated at the beginning of the definitions section.

Every Place is Different

From city to state to federal, almost every government agency and body has a different system of law.  This means that every new city or state will have a different system of naming, numbering, and structuring their laws. One city may use hyphens to segment numbering structures – such as in Chicago (e.g. §1-8-010) – another may use periods, and some places, like San Francisco, use both (§1.13-5).  There may be numbers or numbers and letters or in the case of appendices, letters only.  Numbers may be sequential within a section, or there may be numbers missing entirely.  Sometimes numbers are sequential for the entire legal code, more often they are not. Due to complexities in both the legislation and codification process, a legal code may change format in the middle.  For instance, in San Francisco’s Charter, Article XVI, §16.123 covers Civilian Positions with the Police Department, but all of the §16.123-* codes are for The Arts, Music, Sports, and Pre-school for Every Child Amendment of 2003.  In cases where codes are reused in whole or in part from other sources, such as building codes, the numbering may be entirely different in the middle of a structure. (We’ll come back to other problems with building codes in a minute.)

Data Consistency & Quality Issues

In many legal codes, a section identifier may be used multiple times, especially in those codes that have multiple bodies broken out.  It’s not unusual to have two or more sections numbered §1-1 in a legal code, with the article, chapter, and/or title, etc. needed to differentiate the actual section that the number refers to. This means it’s often not possible to use a single identifier to uniquely identify a law.  With The State Decoded, to solve this we provide an option to use permalinks that reflect the full structural hierarchy. In a few places, sections of law do not have titles (aka “catch lines”), only a numeric identifier.  Since titles are extremely useful, the legal publishers in a given place may add their own titles – but as of the time of writing this, they are able to claim copyright over these titles and not provide them as part of the open legal data itself.  When we encountered this in Maryland, we used Mechanical Turk to pay volunteers to create new, free, public domain titles for the entire legal code.  It didn’t cost us very much money to have the nearly 32,000 titles added, and now the entire code is much more usable. Frequently, sections of the code will be removed via legislation.  These may still appear in the published law code but labelled Removed, sometimes with a short explanation.  In some cases, particular numbered sections may be listed as Reserved, where the legislative body intends to put new code in the future but hasn’t done so yet.  The effect of this is that structures may end up having no actual sections, such as this one in San Francisco. Update: Harlan Yu points out, “Another fun example: 26 USC 401(e) & (j) were repealed but subsections not renumbered b/c everyone refers to 401(k).” Since legal terms and section numbers will appear repeatedly throughout a section, this can wreak havoc on weighted-term search engines, such as Solr/Elasticsearch/Lucene, which end up miscalculating averages.  This is especially problematic if you’re storing multiple historical versions of the law code (more on this below). Although in general the law is considered to be public domain – freely usable by anyone, without cost or restriction – there are ongoing legal battles attempting to restrain it with copyright and usage agreements.  If you want to avoid costly legal fees, it’s safest to make sure you have the official blessing of the place whose law you’re republishing before attempting to do so.  Georgia recently sued Carl Malamud for republishing their laws. Furthermore, many places cannot afford (or choose not to spend the money) to have all of their laws written from scratch.  Most notably, building codes are routinely re-used from the  International Code Council’s (a.k.a. ICC) publications, which are protected by copyright.  Although the 5th Circuit court has ruled that these codes included as part of the law are not protected, many places would rather not involve themselves in legal battles, and will not publish their own building codes! They may also take an alternate route, and simply make references to other building codes instead of including these texts directly – commonly called incorporation by reference.  San Francisco makes reference to California’s building codes, which itself uses the ICC’s codes. Furthermore, many legal publishers have added additional commentary and annotations to the existing law codes.  Since this supplementary information is not part of the actual legal code itself, it is not in the public domain, and they are under no obligation to provide it to anyone. In fact, many companies exist primarily to sell access to this data. Update: Kristin Hodgins says that Canada is similar to the US in most of these, except copyright: “Unlike in the US, s 12 of Canada’s Copyright Act provides for crown copyright, which applies to all govt publications. Some Cdn jurisdictions provide licence to freely reproduce legislation.”

The Law Changes Over Time

As with most things, the law is constantly growing and changing. Converting the law into a digital format is not a one-time event – it is an ongoing process that requires time and effort.  Some places update their legal code every week or two, some only once or twice a year. This is often because the process of codifying the law is usually separate from the legislative process.  In most places, a body of elected officials will vote on bills to determine changes to the law, but the bills do not always say exactly how the new law precisely will read.  This means that someone else will have to interpret the bill to create a new wording.  This may then be handed off to yet another group – frequently an outside vendor – to actually incorporate into the law, with numbering, etc. All of this means that digitizing the law once is not enough to stay current and relevant, it must be updated as often as the official code is.  You probably also will want to keep multiple versions of the code over time, to be able to show what the law was at any point in history.

The Law as Written Is Not Always The Law

The nature of our democracy in the United States allows for judicial review, which can overturn or interpret particular aspects of the law.  This means that even if you have the official text of the code of law, you might not know the actual law or how it is applied.  To determine the full law, you must refer to other sources. With The State Decoded, for states we are able to include opinions and cases via Court Listener’s API.  However, we don’t have a reliable way of showing these for cities.  And decisions in courts can frequently impact potential future decisions in other courts on similar topics, even if they’re not covering the same jurisdiction. Pending legislation may also change the law in the future, and that’s a good thing for people to know about – as the law as written today may change tomorrow.  A bill that’s already been enacted may not take effect until a particular date in the future.

The Law Does Not Mean The Same Thing in Different Places

Since the definitions of particular terms are specific to a given place – or even a specific section of code for a particular place – these terms are not universal, which makes it hard to compare laws directly.  In one city, a month may be 30 days, or 31, or specific to a given reference for a month. Update: Eric Mill and Jacob Kaplan-Moss pointed out that some legislative bodies have an even stranger interpretation of time. In cases where a law or mandate requires the legislature to perform an action by a given date, but they can’t complete the action in that time, they will extend that day legally past 24 hours. As a result, you may see a motion in legislative data along the lines of for the purposes of this legislative session, July 31st has 250 hours. This can play havoc with processing the data, as you may see timestamps such as July 31 107:24 which will have to be saved in some non-native-date format. The U.S. Congress uses something similar called a “legislative day” which continues until the body adjourns next - which may be days or months later! There have been some attempts at creating an ontology of law to provide a way of universally comparing similar ideas.  However, since the law is written in Word or WordPerfect in most places, this metadata has to be added downstream, and is not easily automated.

Standards, Formats, and Standard Formats

Standards in data are critical for making tools interoperable.  It’s important to use existing standards whenever possible to make sure that in the future everyone’s tools will be able to talk to each other and share data.  Whenever possible, you should try to leverage popular existing standards for your data interfaces, and should almost never invent your own! When work began on The State Decoded, there were no obvious standard formats for legal code.  I wrote in more detail about standards for the law a few years back.  Since that time, Akoma Ntoso has become the most popular standard format to distribute legal code internationally.  It’s an XML schema which provides everything you need to break up the law into usable data.  A similar format, USLM, is used by the US House of Representatives and we used to focus on that for compatability.  However, USLM lacks the flexibility of Akoma Ntoso for different types of documents, and A.N. has become much simpler to implement.  It also allows for additional microformats within the data, which helps with the ontology problem. In general, you’ll need a way to store highly structured data to properly represent the law.  XML is ideal because it can handle the nesting and inline markup associated with legal code.  JSON is not a good choice, since it’s designed for strict hierarchical structures and is awful at inline markup. For database storage, many groups use eXistdb, which stores XML documents natively – however, since both MySQL and PostgreSQL now have native support for XML and XPath similar to eXistdb, they are fantastic choices for this.  I strongly recommend breaking up the law code into sections for each record, rather than keeping structures together or breaking things into paragraphs, as this makes it much easier to work with the data.

And that’s just to start

There will undoubtedly be more issues that come up that I haven’t run into yet, but above are a lot of the ones I’ve run into over the last three years.  If you encounter others or know something I don’t, please get in touch and let me know!

Read This

Managing Data with Unix Tools

2015.05.10 – Most Unix systems (including OS X) provide a large number of fantastic tools for manipulating data right out of the box.  If you have been writing small Python/Ruby/Node scripts to perform transformations or just to manage your data, you’ll probably find that there are already tools written to do what you want. Let me start with the conclusion first. The next time you have to transform or manipulate your data, look around for what Unix tools already exist first.  It might take you a little longer to figure out all of the flags and parameters you need, and you’ll have to dig through some unfriendly documentation, but you’ll have a new, far more flexible tool in your toolbox the next time around. This is part 3 of a series on Unix tools. Read the other parts:

  1. Bash Basics
  2. Unix Tools
  3. Managing Data with Unix Tools

awk, a tool for spreadsheets & logs

awk is a tool to work with spreadsheets, logs, and other column-based data. Given some generic CSV data, we can manipulate the columns to get what we want out. For instance, if we want to remove some of the personally identifying information, we can drop the name and relationship columns:
awk -F ',' 'BEGIN { OFS=","} {print $2,$4,$5}' data.csv
> Returns the columns Age,Job,Favorite Thing from the csv.
Here, we’re telling awk that the input column separator is , with the -F flag, and we’re also telling it to use a comma to separate the output in the actual expression with { OFS=","}. Then we’re telling it to only output columns 2, 4, and 5 ({print $2,$4,$5}). We can also use it to parse log files. Given a standard Apache combined log, we can get a list of all IP addresses easily:
awk '{print $1}' access_log
And then we can pass that through the sort utility to put them in order, and the uniq utility to remove duplicates (with -u for unique) or even get a count of how many times each visitor has hit our site (with -c for count).
awk '{print $1}' ./access.log | sort | uniq -c
> Outputs each IP address with a count of how many visits.
You can do a whole lot more with awk too, including scripting inside the expression.

Filtering and Analysis with grep and sed

If you’re using more sophisticated tools like goaccess to analyze your logs, you can preprocess the logs with the tools we’ve covered in the previous articles in this series, Bash Basics, Unix Tools. To just get a couple of days from the logs:
sed -n '/04\/May\/2015/,/06\/May\/2015/ p'  /var/log/httpd/access.log | goaccess -a
> Passes log entries for May 4th and 5th only to goaccess
Or if you need to process multiple log files:
cat /var/log/httpd/*-access.log | grep "04\/May\/2015" | goaccess -a
> Parses only log rows that have the 4th of May in them, from all of our Apache log files.
</code>

Parsing Files with Bash

Going back to our initial post, you can actually do a lot with just bash alone. We can even use for loops to iterate over spreadsheets, instead of sed or awk. I sent a pull request the other day to remove a python script that fed into a bash script, so the bash script would do all the work. The original is designed to take a spreadsheet, and for each row it pulls the listed git repo and zips it up. Here’s the updated script. Keep in mind that though this is a bash script, this could also be run directly from the command line.
#!/bin/bash
INPUT=respondents.csv
[ ! -f $INPUT ] && { echo "$INPUT file not found"; exit 99; }

(tail -n +2 $INPUT ; echo )| while IFS=, read RESPONDENT REPOSITORY
do
	echo "Fetching $REPOSITORY"
	REPO="$(basename $REPOSITORY)"

	REPONAME="${REPO%.git}"

	git clone $REPOSITORY && tar -czf responses/$RESPONDENT.tar.gz $REPONAME && rm -rf $REPONAME
done
Let’s go through this a piece at a time:
INPUT=respondents.csv
[ ! -f $INPUT ] && { echo "$INPUT file not found"; exit 99; }
First, we hard code our input file, and if that file doesn’t exist in the current directory, we exit with an error code.
tail -n +2 $INPUT
We take the input file and skip the first line using tail by passing it -n +2, so that we don’t try to process the headers. The results of that might not have a trailing newline, but we need one for bash to process the last line in the file. We append an extra echo to output a blank newline. We then pipe this to while, which reads in the results of this operation.
while IFS=, read RESPONDENT REPOSITORY
Now we loop over each line of the file, and use IFS to tell the parser to use , as the column separator. read here takes the two columns and puts them into the next two variables, RESPONDENT and REPOSITORY.
REPO="$(basename $REPOSITORY)"
REPONAME="${REPO%.git}"
Here’re we’re doing some string manipulation, using basename to get just the name of the repo from the full repo path, and ${REPO%.git} drops the .git from the name and stores it in REPONAME
git clone $REPOSITORY && tar -czf responses/$RESPONDENT.tar.gz $REPONAME && rm -rf $REPONAME
Finally, we’re using all of the variables we’ve created to assemble our commands, to clone the repo, tar the results, and remove the cloned repo file. You can do even more with bash and unix tools, hopefully this is enough to get you started working with the many tools your system comes installed with! Mark Headd also wrote a great article on command line data science, and recommended this Sysadmin Casts episode on command line tools.

Read This

Unix Tools

2015.05.10 – Most Unix systems (including OS X) provide a large number of fantastic tools for manipulating data right out of the box.  If you have been writing small Python/Ruby/Node scripts to perform transformations or just to manage your data, you’ll probably find that there are already tools written to do what you want. Let me start with the conclusion first. The next time you have to transform or manipulate your data, look around for what Unix tools already exist first.  It might take you a little longer to figure out all of the flags and parameters you need, and you’ll have to dig through some unfriendly documentation, but you’ll have a new, far more flexible tool in your toolbox the next time around. This is part 2 of a series on Unix tools. Read the other parts:

  1. Bash Basics
  2. Unix Tools
  3. Managing Data with Unix Tools

find, a better ls

Now, savvy readers will get their hackles up over that last example, because we’re using ls to list our files before processing. ls is a great utility for listing files, but the results it outputs are potentially dangerous, as it doesn’t do any escaping. It’s also a rather single-purpose tool. Instead, we can use the find utility in our advanced commands, which is safer. By default, `find .` will give you a recursive directory listing. Adding the `-type` flag will allow you to filter to directories or files only. (Note that this has to come after the directory path!)
find ./test -type f
> Lists all files in the ./test directory, all the way down.
find . -type d
> Lists all directories in the current directory and below.
That last example is a bit problematic, because find will include the current directory (.) in the list, which is usually undesirable. Smartly, find can also do both regular expression and simple pattern matching. To just match the name of something, using standard wildcards, use the -name flag; to use a regular expression, use the -regex flag. You can even tell it what flavor of regular expression you want to use with -regextype – I usually prefer posix-extended, myself.
find ./temp -name "App*"
> Returns all files & directories from the ./temp directory down that start with "App"
find ./temp -type f -regextype posix-extended -regex '.*-[0-9]{1,6}.xml'
> Find all files that end in 1 to 6 digits and have a .xml extension
Now, once you have your files, you’ll probably want to manipulate them. find provides you a few different ways of manipulating the results it finds. The -exec flag allows you to run commands on the output:
find  ./* -type d -name "Temp*" -exec rm -R {} \;
> Removes all folders that start with Temp.
Note that you need that trailing \; to tell exec that you’re done writing your command. Although if you just want to remove some files, you can just use -delete:
find  ./* -type d -name "Temp*" -delete
> Removes all folders that start with Temp.

grep, the most important command

This is just a brief break to make sure you know about grep. Grep searches for matching text within the contents of files. It’s a fantastic first-pass tool to narrow down your results. For instance, if I wanted to find all of the config files in my current directory that had port 8080 set:
grep 8080 *
> Apple.cfg:8080
To make this more useful, there are a handful of flags you want to use. Most of the time, you probably want this search to be recursive, so you’ll add -R. You’ll also probably want to pass the output of this command to some other command to process the list, in which case the matched text that is returned after the : is actually a problem – so use -l (that’s lowercase-L) to only show the files matched, not the match text. -i will give you case-insensitive matches. And most importantly, -e <em>pattern</em> allows you to supply a regular expression pattern to match, or -E uses “extended” regular expressions.
grep -RilE "Ap{2}.*" .
> Returns all files that contain the "App" either upper or lowercase.
grep can also be used as a simple filter, to return only entries that match a given pattern:
cat /var/log/httpd/*-access.log | grep "04\/May\/2015"
> Returns only log rows that have the 4th of May in them, from all of our Apache log files.
You can also tell grep to negative-match with -v, to remove matching entries from the results. Will points out that there’s also fgrep, which is faster for fixed patterns, but cannot handle regular expressions.

xargs, the list handler

Now, once you’ve got your files from find or grep, you’ll want to manipulate them. xargs has a single purpose – it takes a list of files and performs a command (or several commands) on them. Whatever command you give it, xargs will pass each file in turn to, with the filename as the input. For instance, you can output the contents of all the files in a directory:
find ./* -type f | xargs cat
You can also construct more complicated commands by using the filename as a variable, which we assign with -I <em>variable</em>. It’s usually best to use something uncommon so it doesn’t cause other problems – the value of {} is what is usually used in examples:
find ./* -type f | xargs -I {} mv {} {}.bak
> Renames all files to add .bak to the end of the name
xargs can also fire off these commands in parallel, so if you have multiple processors (or multicore processors), you can fork separate threads and use them. We use the -P flag to tell it to run processes in parallel and give it a number for how many processes to run. Here’s an article on using xargs as a hadoop replacement. xargs is one of my most used tools, as you can construct very long and complicated commands using the output of other commands.

sed, the search-and-replace powertool

sed is an amazingly powerful utility – . It’s main use is to find and replace text across files. For instance, if I want to change the port of a service in a config file, I can do so easily:
sed -i '' 's/8080/8983/g' ./config.yml
No need to open the file and edit it, just edit in place. You can also combine this with find to edit multiple files:
sed -i '' 's/8080/8983/g' $(find ./config/ -type f -name '*yml' )
Here, we’re capturing the results of find and passing it to sed with $( ). You could also use xargs or find’s -exec flag as discussed above. It can also find values in a given file. One thing I often use it for is filtering down to a particular date range in a large log file. For instance, if I just care about a few days in an Apache log file, I can tell sed to get just the rows from the start date to the end date:
sed -n '/04\/May\/2015/,/06\/May\/2015/ p'  /var/log/httpd/access.log
> Returns lines from the file that start on 04/May/2015, and stops at the first instance of 06/May/2015
You’d need about 20 lines of Python to do the same thing. This is just a taste of what sed can do, it’s very useful. Ozzy also points out that there’s jq which is like sed for JSON. That’s it for now, continue on to the next part, Manipulating Data with Unix Tools.

Read This

Bash Basics

2015.05.10 – Most Unix systems (including OS X) provide a large number of fantastic tools for manipulating data right out of the box.  If you have been writing small Python/Ruby/Node scripts to perform transformations or just to manage your data, you’ll probably find that there are already tools written to do what you want. Let me start with the conclusion first. The next time you have to transform or manipulate your data, look around for what Unix tools already exist first.  It might take you a little longer to figure out all of the flags and parameters you need, and you’ll have to dig through some unfriendly documentation, but you’ll have a new, far more flexible tool in your toolbox the next time around. Before you settle on a policy, see if you can get the one on an insurance cash back deal from a comparison site. This is part 1 of a series on Unix tools. Read the other parts:

  1. Bash Basics
  2. Unix Tools
  3. Managing Data with Unix Tools

Assumptions

I’ll assume you know the very basics here – what a man page is, how to create an executable bash script, how to open a terminal window, and how to use basic utilities.  If you don’t know any of those, you should start with one of the many intros to the command line available.  This intro by Zed Shaw is a good place to start.

The Shell

Bash is the default shell on most systems these days, but what we’re covering here will mostly work for zsh or other shells – though some syntax elements will be different. First off, Bash is a powerful tool by itself. Even with no additional packages added, you get variables, loops, expansions & regular expressions, and much more. Here’s a good guide with more information on using bash. I’ll assume you know the basics from here on out, and show you what you can do with them.

Advanced Paths

If you want to work with several directory paths in a row that are very similar, you can pass a list to the shell using curly braces {} and it’ll expand that list automagically. Let’s say I wanted to setup a few directories for a new project’s test suite. Rather than running a lot of duplicated commands, I could pass a few lists instead.
mkdir -p ./test/{unit,fixtures}
> Creates ./test/unit and ./test/fixtures
mkdir -p ./test/unit/{controllers,models}
> Creates ./test/unit/controllers and ./test/unit/models
Note that we’ve passed the -p flag to mkdir so that it’ll create all of the directories up the chain, even ./test here. We can also use pattern matching with brackets []. For instance, if you’ve got a lot of files that you want to separate alphabetically, you use a letter pattern:
mv ./[A-H]* /Volumes/Library/A-H/
mv ./[I-O]* /Volumes/Library/I-O/
mv ./[P-Z]* /Volumes/Library/P-Z/
This will have broken your library up into three sets. You can also use that matching later in the string:
mv ./A[a-k]* /Volumes/Library/Aa-Ak/
mv ./A[l-z]* /Volumes/Library/Al-Az/
Now, by default most systems will be case sensitive, so you will have left behind all of your files starting with a lowercase letter. This is less than ideal, so we can set a flag to tell the system to be case insensitive for file matching. This type of matching is known as globbing, and to set this flag, we run shopt -s nocaseglob. (In zsh this would be unsetopt CASE_GLOB) If you just run that on your shell, it’ll stick on the current session until you unset it with shopt -u nocaseglob. You might even want to add that to your .bash_profile. Bash, however, also allows us to set environment variables for just the current execution, by wrapping the commands in parentheses:
(shopt -s nocaseglob ; mv ./[A-H]* /Volumes/Library/A-H/)
This will only use case insensitive globbing for that single command, and then will turn the value back off.

Loops

Bash allows you to make use of some rather powerful for loops. I frequently use loops to automate boring manual work, like converting a bunch of RAW image files into web-friendly JPEGs of appropriate size:
for i in *.CR2; do
    dcraw -c -a -w -v $i | convert - -resize '1000x1000' -colorspace gray $i.jpg;
done;
(You could run that as a one-liner as well, the line breaks are just here to make this readable.) Here, I’m taking all of the .CR2 files in the current directory, passing those to dcraw to translate the format from RAW into JPEG, then <a href="http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO-4.html">piping</a> the output to ImageMagick, which shrinks it to web-size of no more than 1000 pixels on a side and makes everything black and white, which is extra-artsy. I use a similar command in our legal docs repo to convert our source Markdown files into a variety of formats, using pandoc:
for myfile in $( ls ./markdown ); do
  echo Converting $myfile;
  for fmt in html docx pdf; do
    filetrim=${myfile%???};
    pandoc -o "./"$fmt"/"$filetrim"."$fmt -f markdown "./markdown/"$myfile;
  done;
done
This one is a little fancier, as we’re doing a bunch of things with nested loops, file name trimming, etc. Let’s break it down:
for myfile in $( ls ./markdown ); do
First off, grab a list of the files in the ./markdown folder. Use the variable $myfile to store the current file’s name.
for fmt in html docx pdf; do
Now we’ve got a loop within a loop. We’re creating a list of the format’s we’ll be using (html, docx, and pdf) and storing the current format in the variable $fmt.
filetrim=${myfile%???};
Here’s a useful bit – we’re trimming the last three characters (using %???) from the string, which is the extension (.md). Another valid pattern would be:
filetrim=${myfile%.*};
which simply removes the entire extension, regardless of how long it is.
pandoc -o "./"$fmt"/"$filetrim"."$fmt -f markdown "./markdown/"$myfile;
Here we’re passing all of our variables we’ve assembled back to pandoc. We’re quoting the strings we want hardcoded in there, so that they’re not misinterpreted as part of the variable name, which would cause this to throw errors.
done;
done
And then we’re closing out both of our for loops.

Wrapping Up

You can also use builtin utilities to do simple tasks, like appending content to files:
echo "name=core1" >> ./solr/core.properties
That’s it for now, continue on to the next part, Unix Tools.

Read This