Apache Solr powers enterprise search on sites from Ebay to Zappos. It also powers Carsabi, but when we reached 1.8M listings per month (passing Autotrader & Cars.com) our basic installation began to run about as fast as an octogenarian in congealing cement. I’d like to share the basics of Solr optimization, as well as some data on real world gains.
Very briefly, our stack has gone through a few iterations which may be sufficient for your corpus volume – no sense in over-engineering. Postgres tables had to be denormalized at 100k vehicles, and we switched to WebSolr’s extremely convenient Solr solution at 300k – their Heroku plugin will create an installation in minutes for just $20/month. This worked very well until about 1M listings, at which point even their beefiest plan was returning results with >800ms latency.
Hardware: Bigger Is Better. A Lot Better
Our previous Solr-as-a-Service had been hosted on an Amazon EC2 Large instance and returned in 800ms. Fortunately, we had spare capacity on an EC2 Cluster Compute Eight Extra Large, which we use for our webcrawler, and just moving to this machine dropped our query time to 282ms – a speed increase of 2.84x. Notice this corresponds to the processor speed increase of 2.75x between ...