Geographical Based Searching Solution Using Lucene

by Patrick O'Leary (pjaol at pjaol.com)

05/2007

 

Introduction

Within a few years of the internet being opened to the general public, it was quickly identified there was a need to create a service which could locate web pages based upon topics or keyword, and the first web search engine was created. Over the years many search engines evolved and extended usage to search for specific niches, such as images, videos, news / sports / weather articles and so on. A need was also discovered to provide a web based search engine to provide results of information beyond the web, such as a geographical based search service, and so MapQuest was born.

 

MapQuest was the original geographical based search service providing routing information between two locations. Subsequently it also provided places of interest based upon the address a user submitted.

 

After that, a new need was identified to provide businesses, venues and events within specific catchments. Another product called Digital City sprung up to fill this need, and later AOL Local Search was a spin off that allowed users to search within a radius of their specified location.

 

Google maps has since revolutionized geographical based search, allowing results to be retrieved purely upon user input, rather than specifying a location.

 

The purpose of this document is to describe how to implement geographical based searching using available technologies on the web.

 

The technologies we choose are:

 

Lucene is a search engine framework, but shouldn't be confused with a full-blown search engine per se. Lucene offers a rich indexing and a formal search API, all based around a core object structure offering high performance text based searching.

 

Geotools-2 is an implementation of the OpenGIS standards, providing precise calculations.

 

Fulcrum API is a simple set of functions offering less precise geographical calculations based upon avionics.

 

Local Lucene is an extension of Lucene providing an implementation of geographical searching using the above technologies.

 

 

Theory Behind Geographical Searching

 

The Ask:

Given a series of items at specific latitudes and longitudes, find all items within 5 miles of a current location sorted upon distance from that location.

 

The Assumptions:

 

The Methodology:

 

In order to get optimal performance, there are 3 basic steps that must be performed in the following sequence.

  1. Search within a boundary box centered upon a user's location with a width and height twice that of the asking radius.
  2. A second pass based on the results of step 1, calculates the precise distance using WGS84; anything greater than the requested radius is removed.
  3. Results are sorted based upon the WGS84 calculations of step 2.

 


Step 1

Using the aviation formulary, create a boundary box 5 x 2 miles wide and 5 x 2 miles high. Centered up the users location, so 5 miles due north/south, and 5 miles due east/west of a center point.

 

Figure 1

 

Set lat1, long1 to the position of the top left hand corner of the box respectively.

Set lat2, long2 to the position of the bottom right corner of the box respectively.

 

Perform a range search where items latitude is between lat1 and lat2, and where items longitude is between long1 and long2.

 

This basic step, called a minimal boundary search, reduces the set of results to a manageable set of items allowing more intensive calculations to be performed on the smaller set of results.

 


Step 2

Using the formula for orthodromic based upon the WGS84 ellipsoid calculate the distance between each item and the center point.

Figure 2

 

Any result outside of the radius specified is dropped from the result set.

e.g. Such as i1, i7 and i9.

 

All distances are stored along with the identity of the result item for sorting and display.

 

Step 3

Using the distances provided by step 2, perform a sort on the distances and return result items ordered by distance.

 

This provides results in the following order: i5, i4, i2, i8, i3, and i6.

 


Conclusion

Based upon the assumption that more items exist outside of the search area than inside it, one can see the boundary box step reduces the set of results to a manageable number of items.  That allows a more intensive orthodromic calculation to be performed, providing a much more accurate set of results.

 

A second conclusion is that boundary box alone is not accurate enough.  With an even distribution of data, any number of results can exist outside of the radial distance but within the boundary box range. This changes with the longitude position -- the closer to either pole that the center exists, the less of an issue this is.

 

An example can be taken from the local Lucene implementation:

"com.pjaol.search.test.UnitTests.TestDistance.java"

20 venues with latitudes and longitudes are provided in the example, all based near Reston, VA.

 

Performing a search of 5 miles from Reston, VA:

  1. Boundary box returns 12 results, based within the minimal boundary box defined by the aviation algorithm.
  2. Orthodromic distance filtering reduces that to 5 results.

 

The results show that boundary box solutions while acceptable for most web applications falls down when it comes to precision. Boundary box also fails when there is a requirement to perform a sort by distance function.

 

The avionics calculations used for the boundary box step assume the earth is spherical, resulting in deviations of close to 3 miles per 25 mile radius for points which are not at a 90 degree norm to a central, and below the 30th parallel, which  covers 30% of the united states.

 

Before the document ends a comparison with other technologies has to be made, in particular how does geographical based searching in standard Lucene, Local Lucene and MySQL stand up against each other.

 

Taking 3 data sets:

  1. 20 venues provided with the local lucene extension
  2. 3,000+ venues through out the US
  3. 300,000+ venues through out the US

 

The following search was performed, show all venues within 5 miles of a point in Reston, VA.

 

The MySQL comparison was done using a sub-query Euclidian formula, for ease of comparison, it's faster than an orthodromic calculation but less exact. However it meets our needs for the comparison.

 

The results are as follows

 

 

Documents

20

3756

313909

MySql

0.01

0.1

0.59

Lucene

0.01

0.189

8.7

local Lucene

0.008

0.014

0.21

 

 

 

 

 

So as the comparison shows, while standard Lucene implementation of geographical based searches can be tempered with Local Lucene, MySQL is another option, local lucene can out perform it, but MySQL does offer advantages in terms of data management.

Appendix

 

  1. Software referenced

 

Lucene

Apache released search libraries

http://lucene.apache.org/

Geotools 2

Implementation of OpenGIS standard

http://geotools.codehaus.org/

Fulcrum API

A simple GIS set of functions

http://www.traversetechnologies.com/fulcrum/

Local Lucene

A lucene extension providing geographical based searching

http://www.nsshutdown.com/viewcvs/viewcvs.cgi/locallucene/

MySQL

Open source Database

http://www.mysql.com/

 

  1. Algorithms referenced

 

Boundary Box

performed by Fulcrum API

 

Based upon Edward Williams aviation v1.30 algorithms published http://williams.best.vwh.net/avform.htm

Provides the top left and bottom right corner positions of a box surrounding a circle of X miles radius.

 

Orthodromic Distance

performed by Geotools 2

 

The distance between 2 points on the globe based upon the ellipsoid formula using the major and minor axis of the WGS84 ellipsoid.

 

WGS84

performed by Geotools 2

 

The global standard measurement of the earth used in GPS navigation, expected to be accurate at sea level to within 10 feet at a precision of 8 decimal places.

The earth is measured as an ellipsoid rather than a sphere due to centripetal force caused by the earths rotation.

 

MySQL Euclidian Function

performed by MySQL 5.1

 

The following sub-query was performed against a table with event / venue details

 

SELECT eventname,

ACOS(SIN(@lat_A) * SIN(lat)  + COS(@lat_A) * COS(lat) * COS(lng - @long_A)) * 3956

AS distance

from ( select eventname, lat, lng

                            from events

                            where lat >= 38.88055075

                            and lat <= 39.05434124999999

                            and lng >= -77.47226092

                            and lng <= -77.24873708)

as boundary order by distance;