posts

Upgrading Solr from 4.2 to 4.3+ on CentOS 6.4

2013.11.01 – I ran into an issue with a Solr configuration that was working for me locally, but not on our CentOS 6.4 server. I’ve documented below all of the issues I encountered along the way, as the upgrade from Solr 4.2 to 4.3+ is a pretty nasty one, due to major changes in the logging system (LOG4J / SLF4J)

Update: as of January 2015, I have been unable to get Solr 4.10 to run on Tomcat 6, so I’ve defaulted to just using the example Jetty server that’s bundled with it.  It’s less than ideal, but it works.  Here’s good documentation on that process.

Lucene/Solr 4.3+ : The Beginning

The awesome guys at Open Source Connections had created a custom tokenizer for Solr for The State Decoded, which was suddenly throwing this new error:
java.lang.IllegalArgumentException: No enum const class org.apache.lucene.util.Version.LUCENE_43
Which I guessed to mean that they’d specified to use Lucene 4.3 (or in my case, Solr 4.3) in the solr_home/conf/solrconfig.xml file
<luceneMatchVersion>LUCENE_43</luceneMatchVersion>
Since CentOS 6.4 only comes with Solr 4.2 out of the box, I tried changing that to LUCENE_42 but received even more intimidating errors:
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType "descendent_facet_hierarchical": Plugin init failure for [schema.xml] analyzer/tokenizer: Error instantiating class: 'com.o19s.RegexPathHierarchyTokenizerFactory'
The full stack dump is here We guessed that the package needed to be rebuilt from source.

Maven

To do that, first I had to install Maven, which was missing from CentOS. Following the instructions here, I went to the Maven site and found the download link for the latest version (in my case, that was 3.1.1). I grabbed that file and stored it in my /usr/local directory (for ease of use), unzipping as usual:
cd;
wget http://apache.tradebit.com/pub/maven/maven-3/3.1.1/binaries/apache-maven-3.1.1-bin.tar.gz
tar -xzf ./apache-maven-3.1.1-bin.tar.gz -C /usr/local
Then I created a link to it:
sudo ln -s apache-maven-3.1.1 maven
And added a native profile loader:
sudo vi /etc/profile.d/maven.sh
To add it to the path:
export M2_HOME=/usr/local/maven
export PATH=${M2_HOME}/bin:${PATH}

JDK

Now, after all of that, apparently I still did not have the Java Development Kit (aka JDK) installed. You’d think that the package OpenJDK would provide this, by the name, but it turns out that that only contains the Java Virual Machine (JVM). I discovered this as maven complained about a missing file tools.jar So, I checked yum for the latest version and installed it:
sudo yum list jdk
sudo yum install java-1.6.0-openjdk-devel
And added another profile loader:
sudo vi /etc/profile.d/java.sh
for that path:
export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64
A quick log out and log in fixed up my paths to use those. All that being done, I tried rebuilding the solr tokenizer, but it still didn’t work. I needed to upgrade Solr first.

Upgrading to Solr4.3+

Now, between Solr 4.2 and Solr 4.3, they removed the native LOG4J support that is by default needed to run Solr. So there are a lot of extra steps here. Seriously, why they would do this, I have no idea. You install the Solr package, and it just won’t run. And there’s absolutely no documentation on exactly how to fix it. The developers just broke it and left everyone to figure it out on their own. I cannot begin to express my frustration or count the number of hours I lost on this step.

A note on Tomcat logging

As you’re going through and starting and restarting tomcat (sudo service tomcat6 restart), you’ll probably want to keep an eye on the log file here:
less /var/log/tomcat6/localhost.[DATE].log
Most frustratingly, I found that any errors encountered during the startup of Tomcat were not actually logged until you shut it down. This is generally not how most Apache projects handle logging. (Again, WTF.) So the only way to know if you still had an issue was to restart *twice*. Unfortunately, every time that Tomcat encountered a configuration error, it would delete my web app configuration file entirely. (Aside: Seriously why would anyone think this is a good idea?) So I had to create it multiple times. That file lives under the Catalina configuration directory here:
/etc/tomcat6/Catalina/localhost/solr.xml
And looks like this:
<Context docBase="/opt/solr/solr.war" debug="0" crossContext="false" >
   <Environment name="solr/home" type="java.lang.String" value="/opt/solr" override="true"/>
</Context>

Now back to the good part

Now, I was finally ready to install the latest version of Solr. I headed over to the Solr downloads page and found that 4.5.1 was the latest. I grabbed the download from that page (after several redirects in between), and extracted it to my home directory:
wget http://mirrors.gigenet.com/apache/lucene/solr/4.5.1/solr-4.5.1.tgz
tar -xzf ./solr-4.5.1.tgz
Now, solr does not actually need to be built, just copied over. The previous installation of solr on the system was under /opt/solr, and just in case I broke everything, I wanted to keep a copy of that. If you’re not upgrading from a previous version, you don’t need to do this to move that out of the way:
sudo mv /opt/solr /opt/solr-4.2.1
Then I copied the new version of solr’s example to the opt directory, and symlinked it into place.
sudo cp -R ~/solr-4.5.1/example/solr /opt/solr-4.5.1
cd /opt
sudo ln -s solr-4.5.1 solr
I needed the solr.war file as well, and also symlinked that.
cd /opt/solr
sudo cp ~/solr-4.5.1/dist/solr-4.5.1.war ./
sudo ln -s solr-4.5.1.war solr.war
Next, I received nasty errors in the localhost.log about missing SLF4J:
SEVERE: Exception starting filter SolrRequestFilter
org.apache.solr.common.SolrException: <strong>Could not find necessary SLF4j logging jars.</strong> If using Jetty, the SLF4j logging jars need to go in the jetty lib/ext directory. For other containers, the corresponding directory should be used. For more information, see: http://wiki.apache.org/solr/SolrLogging
I assumed you needed to provide the extra libraries to tomcat in the obvious way:
sudo cp ~/solr-4.5.1/example/lib/ext/* /var/lib/tomcat6/lib/
sudo cp ~/solr-4.5.1/dist/solrj-lib/* /var/lib/tomcat6/lib/
And others to solr:
sudo cp -R ~/solr-4.5.1/dist/* /opt/solr/lib/
But apparently the libraries needed to be somewhere else also, which I found absolutely no documentation on anywhere. This post finally got me to the right answer.
sudo cp ~/solr-4.5.1/dist/solrj-lib/* /var/lib/tomcat6/webapps/solr/WEB-INF/lib/
Last, I needed to tweak the default solr file for the multicore support that we needed, allowing us to run multiple Solr indexes on one box.
sudo vi /opt/solr/solr.xml
and a added new line for the second core:
<cores adminPath="/admin/cores" defaultCoreName="collection1" host="${host:}" hostPort="${jetty.port:}" hostConte    xt="${hostContext:}" zkClientTimeout="${zkClientTimeout:15000}">
    <core name="collection1" instanceDir="collection1" />
    <core name="baltimorecode_dev_www" instanceDir="baltimorecode/www/staging/statedecoded" />
</cores>

Finally Rebuilding the custom Solr tokenizer

From there, I just needed to rebuild the custom tokenizer that we had.
cd /opt/solr/baltimorecode/www/staging/statedecoded/src/

sudo mvn package

cp target/regex-path-tokenizer-0.0.1-SNAPSHOT.jar ../lib/regex-path-tokenizer-0.0.1-SNAPSHOT.jar
A final restart of tomcat:
sudo service tomcat6 restart
And I could hit the localhost version of solr for that core!
lynx http://localhost:8080/solr/baltimorecode_dev_www/admin/ping
I created an SSH Tunnel to verify that this was indeed working as expected:
ssh -L 8009:localhost:8080 user@removtehost.org -N
And then fired up a browser to point at http://localhost:8009/solr/ and solr was running perfectly.

Finishing Up

The last little thing that caught me up in my sleep-deprived state was that the PHP Solarium package requires allow_url_fopen to be set to On. This seems rather obvious, but is not the default value on CentOS. That was a long day. Here are a few of the articles that I used to get this far:
  1. Patrick Reilly: How to install maven on CentOS
  2. Solr: MultiCore
  3. Apache Tomcat configuration: HTTP Connectors
  4. Andrew Jaimes: Installing Lucene/Solr on CentOS with Tomcat
  5. Hipsnip Cookbooks issues: Solr Fails to start missing SLF4j
  6. StackOverflow: Solr 4.3 Tomcat6 Ubuntu installation exception (More confusing than helpful.)
  • Xrigher: Install Lucene Solar with Tomcat on Windows (This was the one that got me to the right answer!)
  • Read This

    A brief history of the web

    2013.10.21 – A few months back, there was a rather interesting discussion on Reddit about different internet technologies. I put the following comment together as a bit of historical perspective.  In retrospect, it’s probably not 100% accurate, and a bit or a rant, but I figured other people might find it interesting. Your webserver back in the old days was probably NCSA HTTPd if your host hadn’t switched over to Apache. Apache is still one of the most popular webservers today. NCSA HTTPd, not so much. Apache is written in C, but you don’t need to know any C to configure it. Webservers generally deliver files, most often HTML which is just a marked up text file, but also images and videos and audio and pdfs and anything else you want. A few years later and you also commonly see Nginx as a webserver and Microsoft’s IIS on Windows servers. Programming on the web pretty much started with CGI scripts and was purely server-side. Back in the early nineties, this is how we did things – generally a C program that was compiled, and set to receive input from the webserver and do something with it – your form emailers and counters and guestbooks and so forth. You really don’t see much of this any more. Most people switched over to Perl as an easy way to write scripts, rather than using compiled executables. Perl is notoriously hard to read and write, but a pretty good language for string processing (which is basically all the web is at the core). It has declined in popularity significantly over the last ten years or so. Java began to show up more for web use towards the end of the nineties. These days, it’s synonymous with “Enterprise”, and generally used for large business applications. It’s a very dense language, and lots of frameworks exist for it, including ones like Spring, Hibernate, Struts, etc. Many, many other languages have grown up around Java, including ScalaClojure, and Groovy. PHP showed up about then as well, and quickly became the most widely-used language on the web. This was primarily due to its friendliness, ease of use, and ease of deployment – you could just drop scripts on a server the same as with html files and run them. Since PHP is so easy to use, lots of hobbyists have picked it up, and a large amount of awful code and poor tutorials have been written with PHP. Even so, the changes since version 5 have been increasingly bringing PHP towards the modern era, and it’s actually quite good these days. (Haters gonna hate.) There are many frameworks for PHP, including Symfony, CodeIgniter, Laravel, Cake, Zend, and others. Several of the most popular Content Management Systems (or CMS) are written in PHP, including Drupal, WordPress, and Joomla. Ruby was gaining steam by the early 2000s. Based off of Smalltalk, it’s a very “smart” language that’s actually a lot of fun to write. It’s got a lot of niceties that do help you write code with less effort. The joke is that hipsters use Ruby, but it’s really a great language. It also has a fantastic package manager RubyGems / gemRails is the most popular framework for it; most other web frameworks often tend to just be clones of Rails. It’s a very opinionated framework, though – with lots of, what we call in the programming world, “magic”. Another popular framework is Sinatra, which is a thinner framework. Python has technically been used on the web as long as PHP, but still doesn’t have very much marketshare. It is, however, used for not-web things frequently, especially among academics. In some schools, it has replaced Basicas the first language to teach children, due to its friendliness. The most popular framework is Django, though there is also Pyramid/Pylons and Flask. ASP also needs a mention, though it is not an OpenSource language like the others, and owned by Microsoft. You often see it in conjunction with .Net, and it’s very popular for use in large corporations (pretty much anywhere you might otherwise see Java). It’s not very friendly, IMHO. In the last few years, some older “purely functional” languages have come to popularity as well, including Haskelland Clojure (outgrowth of Lisp). Now, you still needed somewhere to store all of your data, so the first popular databases came about. Most of these were ANSI SQL variants, and so behave almost identically, with the main differences being under the hood. These include: MySQL : Still one of the most popular databases in use on the web. There’s a popular fork known as MariaDB as well, which gives you a few new tricks (like server clusters!). PostgreSQL: probably the second most popular. (I’m guessing.) MSSQL: the obligatory Microsoft version. and SQLite which actually just operates on text files and is super lightweight. And last, there’s Oracle which is a megalithic company all to itself, as well as a database, and a Business Intelligence service provider. Large organizations, colleges, and universities give them lots of money to make their systems go. And then you have the no-SQL data stores that came to popularity in the last ten years or so, which generally are non-relational document stores. There are lots of these, but a few of note are: Memcache was one of the first and most popular, especially since Facebook picked it up. MongoDB which gives you a lot of the abilities of a relational DB.  Mongo has recently received a lot of negative criticism due to fundamental flaws in implementation.  [1] [2] [3] CouchDB another Apache project that uses JSON to store data. Cassandra yet another Apache project, designed for high-availability of large amounts of data Tokyo Cabinet and Kyoto Cabinet successors of dbm. Now, going back for a second, we eventually wanted browsers to be able to do some neat things, so Javascriptcame about – a programming language that runs in your browser. It actually doesn’t have much in common with Java, regardless of the name, and is actually an outgrowth of ECMAScript, which also gave us ActionScript which is used in Adobe’s Flash multimedia platform. There are many, many Javascript frameworks. The old bunch, which mainly made writing code a bit easier and doing things like animations, included PrototypeScriptaculous, and jQuery, and newer ones such asUnderscore, and Zepto. The new bunch is are more similar to traditional web frameworks, giving Models and Views (and sometimes Controllers or otherwise just Routers), including BackboneEmber, Knockout, and Angular. In the last few years, Node has appeared, as a way of using Javascript on the server-side. It’s actually very performant for IO operations, but suffers from being relatively young, and people are still finding the best ways to use existing frameworks with it.  Still, it provides an excellent package manager called NPM which is ver similar to Ruby‘s.

    Read This

    Site Launch: Legal Aid Justice Center

    2013.09.23 – I’ve been working with our local Legal Aid office for most of my career – their website was one of the first ones I ever worked on, in fact. Over the last few months, I’ve been working with Charmed Designworks to create a new site for LAJC. Behind the scenes, we’ve switched over to WordPress and are using some really nice <a href=”http://sass-lang.com” SASS CSS Preprocessor”>SASS</a> + Compass to make styling a breeze. Legal Aid Justice Center new website

    Read This

    Apache JMeter: Part 2 – Remote Testing Configuration

    2011.03.20 – Let’s say you’ve already gotten through the basics of JMeter and you’re ready to start setting up your testing. If you’re doing any sort of remote testing, you’ll inevitably need to know how to setup your client/server relationships. The vast majority of JMeter’s configuration is done through a single file, the jmeter.properties file (which lives inside of the JMeter bin/ directory). Any of the properties in this file can be overridden by options on the command line – but since this is Java we’re talking about, the method is ridiculous (code for Linux/Mac): <br /> jmeter -J<em>propertyname1=value1</em> -J<em>propertyname1=value1</em><br /> Of course, that doesn’t work for every option. For instance, to tell the server that you want it to listen for the initial request on a port that’s not random, you have to do so as an environment variable set on the same line before you make the call to the server. Truly and utterly bizarre: <br /> SERVER_PORT=1660 jmeter-server<br /> As noted below, the server_port only sets the port used for the initial request from the client to the server to begin testing, the response from the server to the client will be sent on a totally random port. I also found that the setting above did not work for setting the listening port at all!

    Server (Slave) Configuration

    The server configuration is ridiculously easy. The server just does what it’s told, so you only have to tell it what port to listen on. In the jmeter.properties file, you only need to set the server.rmi.localport value to whatever port you want. (You can also set this from the command line using -J, see the example above.) Note that you MUST set this option to keep the server from listening on a random port! The jquery “default” is 1099. Also note that if you’re having trouble with the Client not being able to talk to the Server, you need to restart the Client after you’ve made any changes to the Server. This is very counter-intuitive! Now, you may encounter a case where the server is trying to respond to the client (master) but looking up the wrong IP address for the hostname. The fix is easy, simply add a record in your hosts file (often /etc/hosts) for the correct address. Last, you should be aware that the server responds to the client’s request on a different port than the server_port specified – it will return test results on a random high-numbered port (45000-70000?). If you haven’t opened up your firewall to account for this, it may cause the server to throw a nasty connection error. Note that there is no way to set the response port by default in JMeter. However, if you feel like getting your hands dirty, you can hack the source to add this option

    Client (Master) Configuration

    The client configuration is only a little more difficult. You’ll need to set the server_port to the same value you used for the server, again in the jmeter.properties file. You’ll also need to set the remote_hosts – this should be a comma-separated list of all of the servers’ hostnames or IP addresses. For example: <br /> remote_hosts=ec2-50-17-92-85.compute-1.amazonaws.com,ec2-50-17-94-90.compute-1.amazonaws <a href="http://biturlz.com/otO2oiE">team task management</a>.com<br /> That’s it! Once you’ve got that all setup, you should be able to start jmeter and have the list of servers appear under the Run > Remote Start menu. Note that this does not start up the servers – you’ll still need to start them manually on each machine – it merely runs the current test on the already-running servers. You can also use Run > Remote Start All to run the current test on all servers at once. As a final note, you might need to take a look at jmeter.log, also under the JMeter bin/ directory to debug anything that’s not working correctly. There are often a few useful messages in there to help you along the way.

    Read This

    Apache JMeter: Part 1 – The Basics

    2011.03.19 – Recently, I’ve been doing a bit of load testing on Amazon AWS after reading cloud storage reviews to determine how much abuse our web application can take without killing the server. I’ve been attempting to use Apache JMeter to do the hard part, but came up against a slew of problems. The documentation provided seems targetted at dyed-in-the-wool Java developers (that “J” at the beginning is clearly a warning shot), and makes pretty big assumptions about the knowledge of the audience. Here are the basic concepts of how to get started using it, targeted for us LAMP developers. The first thing to understand is that there are two main to go about using Jmeter. By default, Jmeter runs as a free-standing (GUI) application on which you run Tests directly from the machine it’s running on. You do this with the Run > Start option. You can also, however set it up to run on other machines, reporting the results back to the original GUI. In JMeter terms, the Master from which you send the tests is called the Client, and the Slaves that run the tests are the Hosts. You have to configure which hosts to run on – afterward you can use Run > Remote Start > <em>slave address</em> to run the test on a single machine, or Run > Remote Start All to run on all slaves. To get started, try running JMeter on your local machine, and writing a basic test. If you’re working locally, just running the bin/jmeter (or bin/jmeter.sh on Mac, bin/jmeter.bat on Windows) script should start up a java session and run the program on your machine. If you’re working remotely, Jmeter runs in an X-Windows environment – so if you’re on a Mac you’ll need to have X11 installed and running.

    Writing Tests

    There’s actually very good documentation on setting up a test by recording your actions clicking through a site, but here’s the short version. To create a new test, you really only need a few elements. Note: adding elements (Nodes) to a test is tricky because the elements are all context-sensitive. If you haven’t selected the right parent in the list, you won’t be able to add certain children. I’ve listed the correct element to click on as the parent below.
    • A Thread Group. Used to set the number of virtual users (Threads) and number of Loops (iterations, obv) for *each* slave to perform (if you’re using the local machine) Click on the Test Plan and then chose Add > Threads (Users) > Thread Group from the Edit menu or the right-click contextual menu. For the trial run, you might just give it 1 User, 1 for Ramp Up Period and 1 Loop
    • A Listener of some type. This is the what shows you the results of the test, either by a chart, table, or other medium. The simplest one is probably the Summary Report – to add it, click on the Thread Group and choose Add > Listener > Summary Report from the menu. No additional configuration is necessary for this type of Listener.
    • A Sampler – an actual test element. For instance, if you want to just grab one page off of a site to see if it’s working, you’ll add an HTTP Request Sampler. Click on the Thread Group again, then choose Add > Sampler > HTTP Request. We’ll want to test with a site we know is working first, so enter google.com for the Server Name or IP.
    Once all that is entered, you can choose to Start your test locally through the Run > Start option. If you have not saved your test, JMeter will prompt you to do so. After that, if you click on the Summary Report you should see that there has been 1 Sample, with (hopefully) an Error % of 0. If you’re getting an Error % greater than 0, you should probably check that you’re properly connected to the internet and you’ve followed all the steps correctly. Now that they system is working, you can try entering your own domain in, and maybe enter a Path of a particular page that you want to test. Note: be very careful when testing against your live site. Increasing your thread or loops too high can cause the server to stop responding. (Which is what we want to test in the first place!) It’s best to perform your load testing against a non-production machine. Like, say, one set up on Amazon AWS. From here, this would be a good time to read that above article on setting up a test by recording your actions with a proxy server, to create more complicated an thorough tests.

    Read This

    Defensive Programming

    2010.04.09 – As a web developer, the greater part of my job is not creating new apps, but hacking together disparate software packages into Frankensteinian amalgamations that (supposedly) work together seamlessly.  This is universally a headache, as the original authors tend to write code thinking that their app is the only one that will be installed.  WordPress, Vanilla, and Interspire’s Email Marketer are some of the worst offenders that I struggle with regularly. When coding your own brilliant application, there are a few simple things you can do to avoid potential collisions and headaches later, especially if anyone else will be using your code.  Here are a few areas to pay attention to.

    Namespace

    First and foremost, you need to watch out for collisions. If you’re not using a language with built-in namespacing (e.g. PHP <5.3), you’ll need to manage this manually. Some areas that you need to watch out for are:
    • Class Names
    • Session Variables
    • Local Variables
    • Constants & Globals
    • Database Tables
    Most databases already have a “users” table somewhere, and an app of any size is likely to have a variable named “args” or “params” (or two, or ten…). For most cases, it’s usually enough to prefix your names with your application name.  Keeping your names verbose helps, too.

    Program Flow

    When writing code, it’s always a good idea to keep everything to small, reusable functions.  This is especially true of published apps, because your users are very likely to be using your code in ways that your original app is not.  Try to break things down to the smallest possible chunks, even if it looks pedantic in your application.  For instance, break up your createUser() function into separate functions to add the user to the user table, subscribing the user to an email list, adding the user to the default group, etc. Assume that your code will be executing inside of someone else’s code.  Try not to use print and echo statements when you can simply pass returned values – only print as the final step.  (An easy way to fake this is to use output buffering.)  You never know where your output is going anyway – so don’t assume that it will be a particular format – it may end up as HTML, an email body, or in the error log depending on how it’s implemented. Pay special attention to any implicit defaults or rules that your code expects.  Don’t force the code to expect complicated series of objects or parameters that any one else wouldn’t immediately understand.  Don’t rely on database restrictions to impose your business rules. Assume that your code will be glued into an existing user system at some point – make your user system as user-friendly as possible.  Create big wide hooks that anyone can use later to interact with your user system.  Actually, do this for everything.  (Ok, WordPress, you got that one right at least.) Keep things wrapped up in nice containers.  Don’t just leave large procedural chunks lying around for others to trip over.  Don’t forget to give some attention to your configuration files on this front. Don’t use globals.  Don’t use globals.  For the love of God, don’t use globals.  I don’t care how clever you think you are (I’m looking at you, WordPress), you’re going to screw everyone else up if you use globals.

    Don’t Step On Toes

    When you’re managing your resources, it’s a good idea to be courteous of others.  If you’re using a database connection or local file, keep a copy of the handle around instead of relying on the implicit “last opened” scheme many languages offer.  And since you’re all such brilliant developers, I don’t have to remind you to make sure to clean up after yourself – closing any connections that you opened, deleting any temporary files you’ve created, and cleaning up any objects you’re done using.  Even better, write error handling into all of your code so these things are done automatically even if something fails! One particularly abused area is Session/Cookie management.  I cannot begin to list the number of applications that hijack the session and fill/clear it wantonly.  In general, you should never be destroying a session, or blanking out the entire cookie.  Always sandbox off your content into a hash (using namespaces again), so that you can remove only the content you added.  Also, don’t ever set the session name directly – just use the default.  (At least, in PHP you can’t use two different sessions simultaneously – setting the session name removes the ability to use any other session).

    In conclusion…

    Do be a considerate programmer.  Do keep good fences (as they make for good neighbors).  Don’t build giant monoliths (as they attract groups of violent monkeys).  Stick to these rules and you’ll save everyone a lot of trouble in the long run.

    Read This