Dealing with Geospatial data

Dealing with Geospatial data -- pitfalls:
Our application has a number of geospatial tools built into it. As we've built these up, we've learned a number of things that seem common-sense in retrospect, but don't seem to be well documented anywhere.

Common-sense rules:
1. Never let anything cross the Anti-meridian (the International Date Line)
2. Never let any bounding box be more than 180º Longitude or 90º Latitude
3. Put everything into the same projection
4. Pick a consistent order and position for your bounding box points (lat1 <= lat2, long1 <= long2; bottom right -> top left)

A bit about our stack:
1. Postgres and PostGIS
2. Lucene (via Hibernate Search)
3. Java

Our initial goals were for two different types of queries:
1. bounding box queries that would allow us to search for documents that overlapped. This part could be done in PostGIS, but can really best be done in Lucene as we want to be able to combine the bounding box query with other criteria such as title or status
2. to be able to use the geospatial data for keyword searching -- eg. you draw a box around the US, then this document is in the US. This part is best done in the PostGIS database as it requires us to look at overlap of boxes and it's optimized for those sorts of things.

For the most part, all of our queries are okay, but, built up, a bunch of the results really start to look funny or wrong. Some examples are when things in Alaska start showing up in results for Europe or Asia. These sorts of problems are symptoms of a larger issue -- dealing with where the chart goes from negative longitude to positive or visa-versa. Once you understand these problems, they're not hard to fix with a bit of adjustment.

Following the basic rule that nothing can ever cross the anti-meridian, if you take any shape that crosses it and break it into two (one for either side), this will solve half the problem. Example, Both Alaska and Russia cross the IDL, if you separate the shapes into two one for the negative longitude section and one for the positive, any overlap query will be more accurate. However, this is not enough, you must also do this with the bounding box itself. if you have a bounding box that crosses the IDL, you must break it into two as well. These rules go for Lucene as well.

So, now you've complicated matters, any query may end up being the overlap of four boxes instead of two -- but, this is okay. This will start to give you accurate results for these queries at least as much for basic overlapping. But, what if you want something more, what if, for example, the overlap you have is a bit more intelligent. Eg. I draw a bounding box around Washington DC, it's likely you've gotten some overlap of Virginia and Maryland, but do you really want to bring back those shapes too?

Options:
* you could simply look at overlaps
* you could calculate percent overlap of the bounding box to the shape
* you could calculate percent overalp from the shape to the bounding box

Developer Documentation

Dealing with Geospatial data

Analytics