Everyone seems to be talking about big data now (except at the birthday parties I go to, where they are still discussing Teenaged Mutant Ninja Turtles - an acceptable topic when you have a five year old). Big data is a subject which reaches across all avenues of IT, from system admins trying to store it, network admins trying to provide access to it, and business analysts trying to make sense of it.
Like the rumbling of a nearby plane starting to take off, I wasn’t really aware of the concept of big data until the noise surrounding it became so loud that it was hard to hear anything else.
Wikipedia states: “Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools.” What defines “large and complex” can be subjective, but in a general best possible scenario this would be information to the tune of exabytes. I think it’s safe to say many of us probably own or have owned a 1 terabyte drive. Picture a million of these drives hooked up to a system that is trying to analyze what’s on them and provide meaningful results in a short amount of time. Right now, that’s big data.
What’s being analyzed?
When I first started reading about big data I assumed it meant large databases full of marketing details or customer records. This may apply, but big data can also refer to email, office documents, logs, picture files - anything that constitutes valuable information to your organization.
What options does Google offer?
On the Google front, that’s where BigQuery comes in.
Google states that “BigQuery is a web service that lets you do interactive analysis of massive datasets - up to billions of rows. Scalable and easy to use, BigQuery lets developers and businesses tap into powerful data analytics on demand. BigQuery works best for interactive analysis of very large datasets, typically using a small number of very large, append-only tables. For more traditional relational database scenarios, you might consider using Google Cloud SQL instead.”
This latter sentence might be tricky. Google offers a comparison to help clarify the differences between the two services:
Because BigQuery is a closer fit to the concept of “Big Data” I’ll analyze it further for the purpose of this article.
What can BigQuery specifically do?
Let’s face it: data and the handling thereof, can be a dry topic without subjective examples. Google offers several case studies to outline the benefits of BigQuery. One such study discusses how a company named Crystalloids built a BigQuery-based application for a client (PDF). This client, Center Parcs Europe, ran a resort network and wanted to figure out which marketing techniques would work the best in order to reach prospective guests before busy vacation periods.
Crystalloids set up the new web application for Center Parcs Europe to allow them to “focus on specific data by simply clicking a button - for example, to see booking information for a particular country or to isolate a certain time frame.” The query results were then turned into charts and graphs via the Google Visualization API.
BigQuery chugs through millions of records within a few seconds. As a point of comparison, other options might take eight minutes. As a result Center Parcs Europe could “access booking information, set pricing and maximize income.” They saved $150K per year in operational costs and - best of all - since the application is cloud-based they did not need to spend almost $800K to run it locally.
Other specific examples from the case studies regarding the uses of BigQuery:
- “spot emerging trends in data.”
- “assess the effectiveness of user acquisition campaigns.”
- “look at user reach, retention and revenue metrics to see if a game or app is gaining traction before investing too heavily in it.”
- “identify the most active users and entice them to introduce new players to games”
- “learn how many times customers searched for seats and found none or very few available, indicating more seats should be added to a route.”
- “investigate decreases in bookings and notify engineers if a technical problem is the cause.”
- “identify server problems by quickly analyzing data related to server activity.”
- “discover how many unique visitors are viewing blogs and websites.”
How do I feed my data to BigQuery?
You can upload your data in CSV or JSON format. Multiple files are allowed (up to 500) with a maximum job size of 1 terabyte at a time. You can import this information from your computer or by using Google Cloud Storage (which Google recommends for more efficient data upload). They also have a quota policy for queries and import/export jobs, which states that you can run up to 20 concurrent queries or queries totaling 200GB and one unlimited size query.
Google provides the full details on uploading your data via a page called “Big Query Data Ingestion Best Practices and Cookbook.” It sounds like something that might be on Jabba the Hutt’s reading shelf, but is actually a very comprehensive guide. Google also offers a best practices page for BigQuery which is worth checking out; new updates may be added later.
How I use BigQuery to access my data?
How much does BigQuery cost?
Pricing for Big Data is as follows. Note the first 100GB of data you process each month is free:
In addition when you run queries, you are only charged on a per-column basis (rather than per-table) for processing data.
As a system administrator, it’s been a goal of mine for many years to change the perception that IT merely exists to “fix broken computers” or “change passwords.” That’s like keeping a 1968 Ford Mustang in the garage for the sole purpose of a quarterly trip to the dentist. In reality, the purpose of information technology and its workers should be to enrich businesses; to help them get the most out of their data and increase profits, promote growth, and cut costs. BigQuery is a perfect example of this concept.
Now if it could just help me analyze my 12 years of vacation photos to figure out where the heck we were, I’d have one more project crossed off my to-do list.