Geographical Based Searching Solution Using Lucene
by Patrick O'Leary (pjaol at pjaol.com)
05/2007
Within a few years of the internet being opened to the general public, it was quickly identified there was a need to create a service which could locate web pages based upon topics or keyword, and the first web search engine was created. Over the years many search engines evolved and extended usage to search for specific niches, such as images, videos, news / sports / weather articles and so on. A need was also discovered to provide a web based search engine to provide results of information beyond the web, such as a geographical based search service, and so MapQuest was born.
MapQuest was the original geographical based search service providing routing information between two locations. Subsequently it also provided places of interest based upon the address a user submitted.
After that, a new need was identified to provide businesses, venues and events within specific catchments. Another product called Digital City sprung up to fill this need, and later AOL Local Search was a spin off that allowed users to search within a radius of their specified location.
Google maps has since revolutionized geographical based search, allowing results to be retrieved purely upon user input, rather than specifying a location.
The purpose of this document is to describe how to implement geographical based searching using available technologies on the web.
The technologies we choose are:
Lucene is a search engine framework, but shouldn't be confused with a full-blown search engine per se. Lucene offers a rich indexing and a formal search API, all based around a core object structure offering high performance text based searching.
Geotools-2 is an implementation of the OpenGIS standards, providing precise calculations.
Fulcrum API is a simple set of functions offering less precise geographical calculations based upon avionics.
Local Lucene is an extension of Lucene providing an implementation of geographical searching using the above technologies.
Given a series of items at specific latitudes and longitudes, find all items within 5 miles of a current location sorted upon distance from that location.
In order to get optimal performance, there are 3 basic steps that must be performed in the following sequence.
Using the aviation formulary, create a boundary box 5 x 2 miles wide and 5 x 2 miles high. Centered up the users location, so 5 miles due north/south, and 5 miles due east/west of a center point.

Figure 1
Set lat1, long1 to the position of the top left hand corner of the box respectively.
Set lat2, long2 to the position of the bottom right corner of the box respectively.
Perform a range search where items latitude is between lat1 and lat2, and where items longitude is between long1 and long2.
This basic step, called a minimal boundary search, reduces the set of results to a manageable set of items allowing more intensive calculations to be performed on the smaller set of results.
Using the formula for orthodromic based upon the WGS84 ellipsoid calculate the distance between each item and the center point.

Figure 2
Any result outside of the radius specified is dropped from the result set.
e.g. Such as i1, i7 and i9.
All distances are stored along with the identity of the result item for sorting and display.
Using the distances provided by step 2, perform a sort on the distances and return result items ordered by distance.
This provides results in the following order: i5, i4, i2, i8, i3, and i6.
Based upon the assumption that more items exist outside of the search area than inside it, one can see the boundary box step reduces the set of results to a manageable number of items. That allows a more intensive orthodromic calculation to be performed, providing a much more accurate set of results.
A second conclusion is that boundary box alone is not accurate enough. With an even distribution of data, any number of results can exist outside of the radial distance but within the boundary box range. This changes with the longitude position -- the closer to either pole that the center exists, the less of an issue this is.
An example can be taken from the local Lucene implementation:
"com.pjaol.search.test.UnitTests.TestDistance.java"
20 venues with latitudes and longitudes are provided in the example, all based near Reston, VA.
Performing a search of 5 miles from Reston, VA:
The results show that boundary box solutions while acceptable for most web applications falls down when it comes to precision. Boundary box also fails when there is a requirement to perform a sort by distance function.
The avionics calculations used for the boundary box step assume the earth is spherical, resulting in deviations of close to 3 miles per 25 mile radius for points which are not at a 90 degree norm to a central, and below the 30th parallel, which covers 30% of the united states.
Before the document ends a comparison with other technologies has to be made, in particular how does geographical based searching in standard Lucene, Local Lucene and MySQL stand up against each other.
Taking 3 data sets:
The following search was performed, show all venues within 5 miles of a point in Reston, VA.
The MySQL comparison was done using a sub-query Euclidian formula, for ease of comparison, it's faster than an orthodromic calculation but less exact. However it meets our needs for the comparison.
The results are as follows
|
Documents |
20 |
3756 |
313909 |
|
MySql |
0.01 |
0.1 |
0.59 |
|
Lucene |
0.01 |
0.189 |
8.7 |
|
local Lucene |
0.008 |
0.014 |
0.21 |

So as the comparison shows, while standard Lucene implementation of geographical based searches can be tempered with Local Lucene, MySQL is another option, local lucene can out perform it, but MySQL does offer advantages in terms of data management.
|
Lucene |
Apache released search libraries |
|
|
Geotools 2 |
Implementation of OpenGIS standard |
http://geotools.codehaus.org/ |
|
Fulcrum API |
A simple GIS set of functions |
http://www.traversetechnologies.com/fulcrum/ |
|
Local Lucene |
A lucene extension providing geographical based searching |
|
|
MySQL |
Open source Database |
http://www.mysql.com/ |
performed by Fulcrum API
Based upon Edward Williams aviation v1.30 algorithms published http://williams.best.vwh.net/avform.htm
Provides the top left and bottom right corner positions of a box surrounding a circle of X miles radius.
performed by Geotools 2
The distance between 2 points on the globe based upon the ellipsoid formula using the major and minor axis of the WGS84 ellipsoid.
performed by Geotools 2
The global standard measurement of the earth used in GPS navigation, expected to be accurate at sea level to within 10 feet at a precision of 8 decimal places.
The earth is measured as an ellipsoid rather than a sphere due to centripetal force caused by the earths rotation.
performed by MySQL 5.1
The following sub-query was performed against a table with event / venue details
SELECT eventname,
ACOS(SIN(@lat_A) * SIN(lat) + COS(@lat_A) * COS(lat) * COS(lng - @long_A)) * 3956
AS distance
from ( select eventname, lat, lng
from events
where lat >= 38.88055075
and lat <= 39.05434124999999
and lng >= -77.47226092
and lng <= -77.24873708)
as boundary order by distance;