posts

What I Wish I Had Known About Decoding The Law

2016.08.01 – After a little over three years of working on The State Decoded (and a bit less than that on Madison), I’ve learned quite a few things about the law.  The process of translating legal code into machine-readable data is not an easy one, but after thousands of hours working on this problem we’ve made some solid progress in automating it. What follows are a few lessons about the law that I wish I’d known before starting, which may help other developers to make good decisions in open law and legislation projects. Update: In 2020, the Supreme Court ruled that legal codes, including annotations, are public domain works. The text below has been updated to strikethrough portions relevant to these changes.

Every Place is the Same

All of the legal code I’ve encountered so far is somewhat similar. The law in most places consists of many sections (denoted by the § symbol in most places), grouped together by subject area under a hierarchy of structures.  These structures are usually named things like article, chapter, title, appendix, subchapter.  Occasionally one finds things like subsection or subcode.  Sections are generally referred to by their identifier number both in the code and by external references. In many places, there are entirely separate codes for different legal concerns – the charter is frequently a separate body of code from the other public laws, the administrative code may be broken out, and so forth.  In San Francisco, the law is separated into over a dozen individual codes. Most structures and sections have some sort of short identifier – usually numeric – as well as a title.  Structures may have additional introductory or following text, or just be a list of sections.  Sections usually have a body consisting of paragraphs, sub-paragraphs, sub-sub-paragraphs, and so on – a sort of internal hierarchy.  These may be numbered with numerals, letters, or roman numerals, so be careful when parsing to determine whether i means 1 or 9.  Referring to a particular paragraph is common, so having permalinks to these is useful. Sections may also contain tables, images, drawings, maps, and other non-textual information.  These can be used to display zoning areas, show costs over time, or explain the makeup of the city’s flag or seal. Many structures will begin with a list of legal definitions that are used throughout the sections under that structure, occasionally these will apply to the entire legal code.  It is possible to provide inline definitions of these terms as they are defined by the law code, but you must take into account the scope given for these definitions – the scope is usually stated at the beginning of the definitions section.

Every Place is Different

From city to state to federal, almost every government agency and body has a different system of law.  This means that every new city or state will have a different system of naming, numbering, and structuring their laws. One city may use hyphens to segment numbering structures – such as in Chicago (e.g. §1-8-010) – another may use periods, and some places, like San Francisco, use both (§1.13-5).  There may be numbers or numbers and letters or in the case of appendices, letters only.  Numbers may be sequential within a section, or there may be numbers missing entirely.  Sometimes numbers are sequential for the entire legal code, more often they are not. Due to complexities in both the legislation and codification process, a legal code may change format in the middle.  For instance, in San Francisco’s Charter, Article XVI, §16.123 covers Civilian Positions with the Police Department, but all of the §16.123-* codes are for The Arts, Music, Sports, and Pre-school for Every Child Amendment of 2003.  In cases where codes are reused in whole or in part from other sources, such as building codes, the numbering may be entirely different in the middle of a structure. (We’ll come back to other problems with building codes in a minute.)

Data Consistency & Quality Issues

In many legal codes, a section identifier may be used multiple times, especially in those codes that have multiple bodies broken out.  It’s not unusual to have two or more sections numbered §1-1 in a legal code, with the article, chapter, and/or title, etc. needed to differentiate the actual section that the number refers to. This means it’s often not possible to use a single identifier to uniquely identify a law.  With The State Decoded, to solve this we provide an option to use permalinks that reflect the full structural hierarchy. In a few places, sections of law do not have titles (aka “catch lines”), only a numeric identifier.  Since titles are extremely useful, the legal publishers in a given place may add their own titles – but as of the time of writing this, they are able to claim copyright over these titles and not provide them as part of the open legal data itself.  When we encountered this in Maryland, we used Mechanical Turk to pay volunteers to create new, free, public domain titles for the entire legal code.  It didn’t cost us very much money to have the nearly 32,000 titles added, and now the entire code is much more usable. Frequently, sections of the code will be removed via legislation.  These may still appear in the published law code but labelled Removed, sometimes with a short explanation.  In some cases, particular numbered sections may be listed as Reserved, where the legislative body intends to put new code in the future but hasn’t done so yet.  The effect of this is that structures may end up having no actual sections, such as this one in San Francisco. Update: Harlan Yu points out, “Another fun example: 26 USC 401(e) & (j) were repealed but subsections not renumbered b/c everyone refers to 401(k).” Since legal terms and section numbers will appear repeatedly throughout a section, this can wreak havoc on weighted-term search engines, such as Solr/Elasticsearch/Lucene, which end up miscalculating averages.  This is especially problematic if you’re storing multiple historical versions of the law code (more on this below). Although in general the law is considered to be public domain – freely usable by anyone, without cost or restriction – there are ongoing legal battles attempting to restrain it with copyright and usage agreements.  If you want to avoid costly legal fees, it’s safest to make sure you have the official blessing of the place whose law you’re republishing before attempting to do so.  Georgia recently sued Carl Malamud for republishing their laws. Furthermore, many places cannot afford (or choose not to spend the money) to have all of their laws written from scratch.  Most notably, building codes are routinely re-used from the  International Code Council’s (a.k.a. ICC) publications, which are protected by copyright.  Although the 5th Circuit court has ruled that these codes included as part of the law are not protected, many places would rather not involve themselves in legal battles, and will not publish their own building codes! They may also take an alternate route, and simply make references to other building codes instead of including these texts directly – commonly called incorporation by reference.  San Francisco makes reference to California’s building codes, which itself uses the ICC’s codes. Furthermore, many legal publishers have added additional commentary and annotations to the existing law codes.  Since this supplementary information is not part of the actual legal code itself, it is not in the public domain, and they are under no obligation to provide it to anyone. In fact, many companies exist primarily to sell access to this data. Update: Kristin Hodgins says that Canada is similar to the US in most of these, except copyright: “Unlike in the US, s 12 of Canada’s Copyright Act provides for crown copyright, which applies to all govt publications. Some Cdn jurisdictions provide licence to freely reproduce legislation.”

The Law Changes Over Time

As with most things, the law is constantly growing and changing. Converting the law into a digital format is not a one-time event – it is an ongoing process that requires time and effort.  Some places update their legal code every week or two, some only once or twice a year. This is often because the process of codifying the law is usually separate from the legislative process.  In most places, a body of elected officials will vote on bills to determine changes to the law, but the bills do not always say exactly how the new law precisely will read.  This means that someone else will have to interpret the bill to create a new wording.  This may then be handed off to yet another group – frequently an outside vendor – to actually incorporate into the law, with numbering, etc. All of this means that digitizing the law once is not enough to stay current and relevant, it must be updated as often as the official code is.  You probably also will want to keep multiple versions of the code over time, to be able to show what the law was at any point in history.

The Law as Written Is Not Always The Law

The nature of our democracy in the United States allows for judicial review, which can overturn or interpret particular aspects of the law.  This means that even if you have the official text of the code of law, you might not know the actual law or how it is applied.  To determine the full law, you must refer to other sources. With The State Decoded, for states we are able to include opinions and cases via Court Listener’s API.  However, we don’t have a reliable way of showing these for cities.  And decisions in courts can frequently impact potential future decisions in other courts on similar topics, even if they’re not covering the same jurisdiction. Pending legislation may also change the law in the future, and that’s a good thing for people to know about – as the law as written today may change tomorrow.  A bill that’s already been enacted may not take effect until a particular date in the future.

The Law Does Not Mean The Same Thing in Different Places

Since the definitions of particular terms are specific to a given place – or even a specific section of code for a particular place – these terms are not universal, which makes it hard to compare laws directly.  In one city, a month may be 30 days, or 31, or specific to a given reference for a month. Update: Eric Mill and Jacob Kaplan-Moss pointed out that some legislative bodies have an even stranger interpretation of time. In cases where a law or mandate requires the legislature to perform an action by a given date, but they can’t complete the action in that time, they will extend that day legally past 24 hours. As a result, you may see a motion in legislative data along the lines of for the purposes of this legislative session, July 31st has 250 hours. This can play havoc with processing the data, as you may see timestamps such as July 31 107:24 which will have to be saved in some non-native-date format. The U.S. Congress uses something similar called a “legislative day” which continues until the body adjourns next - which may be days or months later! There have been some attempts at creating an ontology of law to provide a way of universally comparing similar ideas.  However, since the law is written in Word or WordPerfect in most places, this metadata has to be added downstream, and is not easily automated.

Standards, Formats, and Standard Formats

Standards in data are critical for making tools interoperable.  It’s important to use existing standards whenever possible to make sure that in the future everyone’s tools will be able to talk to each other and share data.  Whenever possible, you should try to leverage popular existing standards for your data interfaces, and should almost never invent your own! When work began on The State Decoded, there were no obvious standard formats for legal code.  I wrote in more detail about standards for the law a few years back.  Since that time, Akoma Ntoso has become the most popular standard format to distribute legal code internationally.  It’s an XML schema which provides everything you need to break up the law into usable data.  A similar format, USLM, is used by the US House of Representatives and we used to focus on that for compatability.  However, USLM lacks the flexibility of Akoma Ntoso for different types of documents, and A.N. has become much simpler to implement.  It also allows for additional microformats within the data, which helps with the ontology problem. In general, you’ll need a way to store highly structured data to properly represent the law.  XML is ideal because it can handle the nesting and inline markup associated with legal code.  JSON is not a good choice, since it’s designed for strict hierarchical structures and is awful at inline markup. For database storage, many groups use eXistdb, which stores XML documents natively – however, since both MySQL and PostgreSQL now have native support for XML and XPath similar to eXistdb, they are fantastic choices for this.  I strongly recommend breaking up the law code into sections for each record, rather than keeping structures together or breaking things into paragraphs, as this makes it much easier to work with the data.

And that’s just to start

There will undoubtedly be more issues that come up that I haven’t run into yet, but above are a lot of the ones I’ve run into over the last three years.  If you encounter others or know something I don’t, please get in touch and let me know!

Read This

Managing Data with Unix Tools

2015.05.10 – Most Unix systems (including OS X) provide a large number of fantastic tools for manipulating data right out of the box.  If you have been writing small Python/Ruby/Node scripts to perform transformations or just to manage your data, you’ll probably find that there are already tools written to do what you want. Let me start with the conclusion first. The next time you have to transform or manipulate your data, look around for what Unix tools already exist first.  It might take you a little longer to figure out all of the flags and parameters you need, and you’ll have to dig through some unfriendly documentation, but you’ll have a new, far more flexible tool in your toolbox the next time around. This is part 3 of a series on Unix tools. Read the other parts:

  1. Bash Basics
  2. Unix Tools
  3. Managing Data with Unix Tools

awk, a tool for spreadsheets & logs

awk is a tool to work with spreadsheets, logs, and other column-based data. Given some generic CSV data, we can manipulate the columns to get what we want out. For instance, if we want to remove some of the personally identifying information, we can drop the name and relationship columns:
awk -F ',' 'BEGIN { OFS=","} {print $2,$4,$5}' data.csv
> Returns the columns Age,Job,Favorite Thing from the csv.
Here, we’re telling awk that the input column separator is , with the -F flag, and we’re also telling it to use a comma to separate the output in the actual expression with { OFS=","}. Then we’re telling it to only output columns 2, 4, and 5 ({print $2,$4,$5}). We can also use it to parse log files. Given a standard Apache combined log, we can get a list of all IP addresses easily:
awk '{print $1}' access_log
And then we can pass that through the sort utility to put them in order, and the uniq utility to remove duplicates (with -u for unique) or even get a count of how many times each visitor has hit our site (with -c for count).
awk '{print $1}' ./access.log | sort | uniq -c
> Outputs each IP address with a count of how many visits.
You can do a whole lot more with awk too, including scripting inside the expression.

Filtering and Analysis with grep and sed

If you’re using more sophisticated tools like goaccess to analyze your logs, you can preprocess the logs with the tools we’ve covered in the previous articles in this series, Bash Basics, Unix Tools. To just get a couple of days from the logs:
sed -n '/04\/May\/2015/,/06\/May\/2015/ p'  /var/log/httpd/access.log | goaccess -a
> Passes log entries for May 4th and 5th only to goaccess
Or if you need to process multiple log files:
cat /var/log/httpd/*-access.log | grep "04\/May\/2015" | goaccess -a
> Parses only log rows that have the 4th of May in them, from all of our Apache log files.
</code>

Parsing Files with Bash

Going back to our initial post, you can actually do a lot with just bash alone. We can even use for loops to iterate over spreadsheets, instead of sed or awk. I sent a pull request the other day to remove a python script that fed into a bash script, so the bash script would do all the work. The original is designed to take a spreadsheet, and for each row it pulls the listed git repo and zips it up. Here’s the updated script. Keep in mind that though this is a bash script, this could also be run directly from the command line.
#!/bin/bash
INPUT=respondents.csv
[ ! -f $INPUT ] && { echo "$INPUT file not found"; exit 99; }

(tail -n +2 $INPUT ; echo )| while IFS=, read RESPONDENT REPOSITORY
do
	echo "Fetching $REPOSITORY"
	REPO="$(basename $REPOSITORY)"

	REPONAME="${REPO%.git}"

	git clone $REPOSITORY && tar -czf responses/$RESPONDENT.tar.gz $REPONAME && rm -rf $REPONAME
done
Let’s go through this a piece at a time:
INPUT=respondents.csv
[ ! -f $INPUT ] && { echo "$INPUT file not found"; exit 99; }
First, we hard code our input file, and if that file doesn’t exist in the current directory, we exit with an error code.
tail -n +2 $INPUT
We take the input file and skip the first line using tail by passing it -n +2, so that we don’t try to process the headers. The results of that might not have a trailing newline, but we need one for bash to process the last line in the file. We append an extra echo to output a blank newline. We then pipe this to while, which reads in the results of this operation.
while IFS=, read RESPONDENT REPOSITORY
Now we loop over each line of the file, and use IFS to tell the parser to use , as the column separator. read here takes the two columns and puts them into the next two variables, RESPONDENT and REPOSITORY.
REPO="$(basename $REPOSITORY)"
REPONAME="${REPO%.git}"
Here’re we’re doing some string manipulation, using basename to get just the name of the repo from the full repo path, and ${REPO%.git} drops the .git from the name and stores it in REPONAME
git clone $REPOSITORY && tar -czf responses/$RESPONDENT.tar.gz $REPONAME && rm -rf $REPONAME
Finally, we’re using all of the variables we’ve created to assemble our commands, to clone the repo, tar the results, and remove the cloned repo file. You can do even more with bash and unix tools, hopefully this is enough to get you started working with the many tools your system comes installed with! Mark Headd also wrote a great article on command line data science, and recommended this Sysadmin Casts episode on command line tools.

Read This

Unix Tools

2015.05.10 – Most Unix systems (including OS X) provide a large number of fantastic tools for manipulating data right out of the box.  If you have been writing small Python/Ruby/Node scripts to perform transformations or just to manage your data, you’ll probably find that there are already tools written to do what you want. Let me start with the conclusion first. The next time you have to transform or manipulate your data, look around for what Unix tools already exist first.  It might take you a little longer to figure out all of the flags and parameters you need, and you’ll have to dig through some unfriendly documentation, but you’ll have a new, far more flexible tool in your toolbox the next time around. This is part 2 of a series on Unix tools. Read the other parts:

  1. Bash Basics
  2. Unix Tools
  3. Managing Data with Unix Tools

find, a better ls

Now, savvy readers will get their hackles up over that last example, because we’re using ls to list our files before processing. ls is a great utility for listing files, but the results it outputs are potentially dangerous, as it doesn’t do any escaping. It’s also a rather single-purpose tool. Instead, we can use the find utility in our advanced commands, which is safer. By default, `find .` will give you a recursive directory listing. Adding the `-type` flag will allow you to filter to directories or files only. (Note that this has to come after the directory path!)
find ./test -type f
> Lists all files in the ./test directory, all the way down.
find . -type d
> Lists all directories in the current directory and below.
That last example is a bit problematic, because find will include the current directory (.) in the list, which is usually undesirable. Smartly, find can also do both regular expression and simple pattern matching. To just match the name of something, using standard wildcards, use the -name flag; to use a regular expression, use the -regex flag. You can even tell it what flavor of regular expression you want to use with -regextype – I usually prefer posix-extended, myself.
find ./temp -name "App*"
> Returns all files & directories from the ./temp directory down that start with "App"
find ./temp -type f -regextype posix-extended -regex '.*-[0-9]{1,6}.xml'
> Find all files that end in 1 to 6 digits and have a .xml extension
Now, once you have your files, you’ll probably want to manipulate them. find provides you a few different ways of manipulating the results it finds. The -exec flag allows you to run commands on the output:
find  ./* -type d -name "Temp*" -exec rm -R {} \;
> Removes all folders that start with Temp.
Note that you need that trailing \; to tell exec that you’re done writing your command. Although if you just want to remove some files, you can just use -delete:
find  ./* -type d -name "Temp*" -delete
> Removes all folders that start with Temp.

grep, the most important command

This is just a brief break to make sure you know about grep. Grep searches for matching text within the contents of files. It’s a fantastic first-pass tool to narrow down your results. For instance, if I wanted to find all of the config files in my current directory that had port 8080 set:
grep 8080 *
> Apple.cfg:8080
To make this more useful, there are a handful of flags you want to use. Most of the time, you probably want this search to be recursive, so you’ll add -R. You’ll also probably want to pass the output of this command to some other command to process the list, in which case the matched text that is returned after the : is actually a problem – so use -l (that’s lowercase-L) to only show the files matched, not the match text. -i will give you case-insensitive matches. And most importantly, -e <em>pattern</em> allows you to supply a regular expression pattern to match, or -E uses “extended” regular expressions.
grep -RilE "Ap{2}.*" .
> Returns all files that contain the "App" either upper or lowercase.
grep can also be used as a simple filter, to return only entries that match a given pattern:
cat /var/log/httpd/*-access.log | grep "04\/May\/2015"
> Returns only log rows that have the 4th of May in them, from all of our Apache log files.
You can also tell grep to negative-match with -v, to remove matching entries from the results. Will points out that there’s also fgrep, which is faster for fixed patterns, but cannot handle regular expressions.

xargs, the list handler

Now, once you’ve got your files from find or grep, you’ll want to manipulate them. xargs has a single purpose – it takes a list of files and performs a command (or several commands) on them. Whatever command you give it, xargs will pass each file in turn to, with the filename as the input. For instance, you can output the contents of all the files in a directory:
find ./* -type f | xargs cat
You can also construct more complicated commands by using the filename as a variable, which we assign with -I <em>variable</em>. It’s usually best to use something uncommon so it doesn’t cause other problems – the value of {} is what is usually used in examples:
find ./* -type f | xargs -I {} mv {} {}.bak
> Renames all files to add .bak to the end of the name
xargs can also fire off these commands in parallel, so if you have multiple processors (or multicore processors), you can fork separate threads and use them. We use the -P flag to tell it to run processes in parallel and give it a number for how many processes to run. Here’s an article on using xargs as a hadoop replacement. xargs is one of my most used tools, as you can construct very long and complicated commands using the output of other commands.

sed, the search-and-replace powertool

sed is an amazingly powerful utility – . It’s main use is to find and replace text across files. For instance, if I want to change the port of a service in a config file, I can do so easily:
sed -i '' 's/8080/8983/g' ./config.yml
No need to open the file and edit it, just edit in place. You can also combine this with find to edit multiple files:
sed -i '' 's/8080/8983/g' $(find ./config/ -type f -name '*yml' )
Here, we’re capturing the results of find and passing it to sed with $( ). You could also use xargs or find’s -exec flag as discussed above. It can also find values in a given file. One thing I often use it for is filtering down to a particular date range in a large log file. For instance, if I just care about a few days in an Apache log file, I can tell sed to get just the rows from the start date to the end date:
sed -n '/04\/May\/2015/,/06\/May\/2015/ p'  /var/log/httpd/access.log
> Returns lines from the file that start on 04/May/2015, and stops at the first instance of 06/May/2015
You’d need about 20 lines of Python to do the same thing. This is just a taste of what sed can do, it’s very useful. Ozzy also points out that there’s jq which is like sed for JSON. That’s it for now, continue on to the next part, Manipulating Data with Unix Tools.

Read This

Bash Basics

2015.05.10 – Most Unix systems (including OS X) provide a large number of fantastic tools for manipulating data right out of the box.  If you have been writing small Python/Ruby/Node scripts to perform transformations or just to manage your data, you’ll probably find that there are already tools written to do what you want. Let me start with the conclusion first. The next time you have to transform or manipulate your data, look around for what Unix tools already exist first.  It might take you a little longer to figure out all of the flags and parameters you need, and you’ll have to dig through some unfriendly documentation, but you’ll have a new, far more flexible tool in your toolbox the next time around. Before you settle on a policy, see if you can get the one on an insurance cash back deal from a comparison site. This is part 1 of a series on Unix tools. Read the other parts:

  1. Bash Basics
  2. Unix Tools
  3. Managing Data with Unix Tools

Assumptions

I’ll assume you know the very basics here – what a man page is, how to create an executable bash script, how to open a terminal window, and how to use basic utilities.  If you don’t know any of those, you should start with one of the many intros to the command line available.  This intro by Zed Shaw is a good place to start.

The Shell

Bash is the default shell on most systems these days, but what we’re covering here will mostly work for zsh or other shells – though some syntax elements will be different. First off, Bash is a powerful tool by itself. Even with no additional packages added, you get variables, loops, expansions & regular expressions, and much more. Here’s a good guide with more information on using bash. I’ll assume you know the basics from here on out, and show you what you can do with them.

Advanced Paths

If you want to work with several directory paths in a row that are very similar, you can pass a list to the shell using curly braces {} and it’ll expand that list automagically. Let’s say I wanted to setup a few directories for a new project’s test suite. Rather than running a lot of duplicated commands, I could pass a few lists instead.
mkdir -p ./test/{unit,fixtures}
> Creates ./test/unit and ./test/fixtures
mkdir -p ./test/unit/{controllers,models}
> Creates ./test/unit/controllers and ./test/unit/models
Note that we’ve passed the -p flag to mkdir so that it’ll create all of the directories up the chain, even ./test here. We can also use pattern matching with brackets []. For instance, if you’ve got a lot of files that you want to separate alphabetically, you use a letter pattern:
mv ./[A-H]* /Volumes/Library/A-H/
mv ./[I-O]* /Volumes/Library/I-O/
mv ./[P-Z]* /Volumes/Library/P-Z/
This will have broken your library up into three sets. You can also use that matching later in the string:
mv ./A[a-k]* /Volumes/Library/Aa-Ak/
mv ./A[l-z]* /Volumes/Library/Al-Az/
Now, by default most systems will be case sensitive, so you will have left behind all of your files starting with a lowercase letter. This is less than ideal, so we can set a flag to tell the system to be case insensitive for file matching. This type of matching is known as globbing, and to set this flag, we run shopt -s nocaseglob. (In zsh this would be unsetopt CASE_GLOB) If you just run that on your shell, it’ll stick on the current session until you unset it with shopt -u nocaseglob. You might even want to add that to your .bash_profile. Bash, however, also allows us to set environment variables for just the current execution, by wrapping the commands in parentheses:
(shopt -s nocaseglob ; mv ./[A-H]* /Volumes/Library/A-H/)
This will only use case insensitive globbing for that single command, and then will turn the value back off.

Loops

Bash allows you to make use of some rather powerful for loops. I frequently use loops to automate boring manual work, like converting a bunch of RAW image files into web-friendly JPEGs of appropriate size:
for i in *.CR2; do
    dcraw -c -a -w -v $i | convert - -resize '1000x1000' -colorspace gray $i.jpg;
done;
(You could run that as a one-liner as well, the line breaks are just here to make this readable.) Here, I’m taking all of the .CR2 files in the current directory, passing those to dcraw to translate the format from RAW into JPEG, then <a href="http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO-4.html">piping</a> the output to ImageMagick, which shrinks it to web-size of no more than 1000 pixels on a side and makes everything black and white, which is extra-artsy. I use a similar command in our legal docs repo to convert our source Markdown files into a variety of formats, using pandoc:
for myfile in $( ls ./markdown ); do
  echo Converting $myfile;
  for fmt in html docx pdf; do
    filetrim=${myfile%???};
    pandoc -o "./"$fmt"/"$filetrim"."$fmt -f markdown "./markdown/"$myfile;
  done;
done
This one is a little fancier, as we’re doing a bunch of things with nested loops, file name trimming, etc. Let’s break it down:
for myfile in $( ls ./markdown ); do
First off, grab a list of the files in the ./markdown folder. Use the variable $myfile to store the current file’s name.
for fmt in html docx pdf; do
Now we’ve got a loop within a loop. We’re creating a list of the format’s we’ll be using (html, docx, and pdf) and storing the current format in the variable $fmt.
filetrim=${myfile%???};
Here’s a useful bit – we’re trimming the last three characters (using %???) from the string, which is the extension (.md). Another valid pattern would be:
filetrim=${myfile%.*};
which simply removes the entire extension, regardless of how long it is.
pandoc -o "./"$fmt"/"$filetrim"."$fmt -f markdown "./markdown/"$myfile;
Here we’re passing all of our variables we’ve assembled back to pandoc. We’re quoting the strings we want hardcoded in there, so that they’re not misinterpreted as part of the variable name, which would cause this to throw errors.
done;
done
And then we’re closing out both of our for loops.

Wrapping Up

You can also use builtin utilities to do simple tasks, like appending content to files:
echo "name=core1" >> ./solr/core.properties
That’s it for now, continue on to the next part, Unix Tools.

Read This

Setting the Standards for Open Government Data

2014.06.17 – City and state governments across America are adopting Open Data policies at a fantastic clip. Technology hotbed San Francisco has one, but places far from Silicon Valley – like Pittsburgh, PA – are also joining the municipal open data movement. The increasing availability of free tools and enthusiastic volunteer software developers has opened the door for a vast amount of new government data to be made publicly available in digital form. But merely putting this data out there on the Internet is only the first step. Much of this city data is released under assumption that a government agency must publish something – anything – and fast. In this rush to demonstrate progress, little thought is given to the how. But the citizens who care about this data – and are actually building websites and applications with it – need to access it in machine-readable, accessible, and standards-compliant formats, as the Sunlight Foundation explains here. This explains why most city open data sets aren’t seen or used. There is a vast difference between merely opening data, and publishing Open Data. By publishing data in good formats that adhere to modern, widely-accepted standards, users of the data may reuse existing software to manipulate and display the data in a variety of ways, without having to start from scratch. Moreover, it allows easy comparison between data from different sources. If every city and every organization chooses to adopt their own standard for data output, this task becomes absolutely insurmountable – the data will grow faster than anyone on the outside can possibly keep up.

It’s a Mess: Most “Open Government Data” Is Virtually Useless

Take, for example, the mess that is data.gov. Lots of data is available – but most of these datasets are windows-only self-extracting zip archives of Excel files without headings that are nearly useless. This is not what the community at large means by “Open Data” – there are lots of closed formats along the way. Similarly, data which is released with its own schema, rather than adopting a common standard, is just as problematic. If you’re forcing your users to learn an entirely new data schema – essentially, a brand new language – and to write entirely new parsing software – a brand new translator – just to interact with your data, you’ve added a considerable barrier to entry that undercuts openness and accessibility. A good standard lets anyone who wants to interact with the data do so easily, without having to learn anything new or build anything new. Standard programming libraries can be built, so that it’s as simple as opening a webpage for everyone. This means that in most programming languages, using a standards-based data source can be as simple as it is to interact with the web, import httplib and you’re done.

Evaluating Existing Standards

Every day at The OpenGov Foundation, I work with legal and legislative data. Laws, legal codes, legislation, regulations, and judicial opinions are a few examples. What standard do we use? Well, let’s look at the most common standards available for publishing legal data on the Internet:
  • Akoma Ntoso – a well-known XML-based format that is very verbose. The level of complexity presents a high barrier to entry for most users, and has prevented its wide adoption.
  • United States Legislative Markup (USLM) – another XML-based format used by the US House of Representatives. It has the advantage of being not very verbose, extensible, and easy to use.
  • State Decoded XML – the format used by The State Decoded project. Currently, this only support law code data, and is not widely adopted outside of this project.
  • JSON – JSON is not actually a standard, but a general format well suited to relational and tabular data, and chunks of plain text. A variant is JSON-LD which has all of the same properties, but is better for relational data. It is commonly used for transferring data on the web, but it is not practical for annotated or marked-up data.
None of these are ideal. But if I had to pick a single option to move forward, the USLM standard is the most attractive for several reasons:
  • It is older, established, and has good documentation
  • It is easily implemented and used
  • It is extensible, but not especially verbose
  • It is designed to handle inline markup and annotation, such as tables, mathematical formulas, and images
It also acts as a very good “greatest common factor” as a primary format – it can be translated easily into common formats such as HTML, Microsoft Word, plain text, and even JSON – but does not add any superfluous complexity to address most common needs (e.g., tables or annotations) that other formats require.

Setting the Standard for Open Law & Legislative Data

Moving forward, the next step beyond simply exporting USLM data from existing data sources would be to to have end-to-end solutions that speak USLM natively. Instead of editing Word or WordPerfect documents to craft legislation, lawmakers could write bills in new tools that look and feel like Word, but are actually crafting well-formatted USLM XML behind the scenes, instead of a closed-source, locked-in format. This is what we call “What You See Is What You Mean” – or WYSIWYM. Here at The OpenGov Foundation, we believe in a rich, standards-based data economy, and we are our actively doing our part to contribute. Our open-source online policymaking platform – Madison – already consumes USLM, and we are actively working on an WYSIWYM editor to make it easier to create and modify policy documents in Madison. We are also investigating USLM support for The State Decoded – both as an input and output format. Hopefully, other software projects will actively follow suit – creating an interoperable ecosystem of legal data for everyone in the United States.

Read This