Big data has many applications, but none perhaps as ubiquitous and overlooked as geolocation. All of the mapping and navigation systems we use daily depend upon geo data–data that is constantly being refreshed. The range of things we map is mind boggling, from submarine cables to Napoleon’s retreat from Moscow, and you can find many of them at ESRI, a private company founded in 1969 that works in the Global Information Systems (GIS) space.
ESRI has a popular solution called ArcGIS for managing big data that enables companies to visualize and analyze information at terabyte scale–revealing patterns, trends, and relationships in a way that reports don’t, or can’t. Since big data is stored in many different places, a common challenge in creating GIS applications is getting fast access to disparate data. Developers have to constantly change the application to accommodate different data stores.
I recently spoke with Mansour Raad, senior software architect at ESRI and a regular speaker at big data conferences, to see how he works with diverse datasets when he creates GIS applications.
Geo data and apps
TechRepublic: What kind of apps are you building?
Mansour Raad: Our GIS software is used by 70% of the Fortune 500. We develop computer systems for capturing, storing, checking, and displaying data related to positions on the planet. Big data by definition, GIS displays a lot of different kinds of data on one map. That helps people see, analyze, and understand patterns and relationships.
SEE: Big data policy template (Tech Pro Research)
Our ArcGIS, the platform we use for big data apps, has a unique set of capabilities for applying location-based analysis to your business practices. You can easily share insights and collaborate with others via maps, apps and reports. Specifically, you can perform spatial analytics, mapping and visualization, 3D GIS, real-time GIS, and imagery and remote sensing. It all relies on massive amounts of data.
GIS data gets gnarly
TechRepublic: Why is GIS data particularly hard?
Raad: There is a plethora of backend distributed data stores. I am always using S3, Apache Hadoop HDFS, or OpenStack Swift with my GIS applications to read from these backends’ geospatial data or to save into these backends my data. Some of these distributed data stores are not natively supported by ESRI’s ArcGIS platform (what we use to create GIS apps using big data). You can extend the platform with ArcPy to handle these situations.
But depending on the data store, I will have to use a different API, mostly Python based, to handle these situations. It’s not optimal. Accessing and storing data in unsupported data stores requires developers to constantly change their program for each data store. This slows development cycles and makes it much longer for customers to get insights from the data.
Dealing with GIS data
TechRepublic: How did you get around this problem of constantly changing APIs and apps to accommodate different data stores?
SEE: How to build a successful data scientist career (free PDF) (TechRepublic)
Raad: We found an interesting open source project out of UC Berkeley’s AMPlabs that is now developed and supported by a commercial company in Silicon Valley called Alluxio. Alluxio provides a memory-speed distributed system that virtualizes data across disparate storage systems and, most important, provides a unified global namespace that enables new workflows across data in any storage system.
This means that, at the application level, the code to access the data requires no change to the app. As a bonus, with its REST endpoint, Alluxio simplifies the integration with ArcGIS to write, read, and visualize GIS data. With Alluxio in the data architecture, accessing data from data stores not natively supported by ArcGIS becomes easier.