Big data isn’t new. We’ve actually had fairly sophisticated data infrastructure long before Hadoop, Spark, and such came into being. No, the big difference in big data is that all this fantastic data infrastructure is open source software running on commodity servers.

Over a decade ago, entrepreneur Joe Kraus’ declared that “There’s never been a better time to be an entrepreneur because it’s never been cheaper to be one,” and he was right, though he couldn’t have foreseen how much so. Though Kraus extolled the virtues of Linux, Tomcat, Apache HTTP server, and MySQL, today’s startups have access to a dazzling array of the best big data infrastructure that money doesn’t need to buy.

In this way, startups are able to put a target on the backs of much better-funded enterprise rivals.

Arbitraging eyeballs

Take Bidtellect, for example, an adtech startup. The Bidtellect platform helps advertisers, agencies, and media companies deliver targeted native ads across all devices, in any format. In practice, this means that Bidtellect must track and analyze the potential inventory of ad placements–which number in the millions daily–to see how each is affected by numerous variables. Once ads start running, it’s essential to track their performance against client KPIs.

SEE Top 10 priorities for a successful Hadoop implementation (TechRepublic)

As Jeremy Kayne, Bidtellect’s CTO, told me in an interview, Bidtellect is engaged in “a kind of arbitrage,” whereby the company buys inventory on a per-impression (per-display) basis, but then sells ads on a per-click basis. In order to build a viable business and not a candidate for bankruptcy protection, “It’s essential that we’re able to predict how many clicks an ad will generate on a given site, on a certain device type, at a certain time of day, and across scores of other variables–so we can price it right and make a fair profit.”

This is where big data comes in.

“To accurately make these predictions, identify viable advertising opportunities, and negotiate workable rates and pricing, we had to find a practical way to collect, manage, and understand the billions of transactions and data points involved,” Kayne said.

The system that collects and tracks all of this information amounts to petabytes in data volumes. This is big, but it’s about to get bigger. As Kayne detailed, Bidtellect is currently ramping up its daily data capture from one billion to five billion transactions, with the goal to shortly reach 15 billion transactions a day.

Scaling to 15 billion

Founded in 2014, Bidtellect didn’t initially focus on scale, preferring to optimize for performance. This allowed the company to outsource its analytics to a third-party service provider, Metamarkets. Metamarkets replicated data from Bidtellect’s advertising ecosystem, managing it in separate databases used to run queries and produce reports. Bidtellect analysts requiring new insights had to request them from Metamarkets, then wait as queries were developed and executed.

While this worked in the company’s early days, the arrangement began to pose three problems. The first was cost, with Bidtellect spending over $300,000 a year, and projections pegged at over $1 million a year as Bidtellect looked to scale. Making matters worse, “A lot of this cost was simply storing the same data twice and keeping it in sync,” Kayne said.

The next problem was simply inaccuracies. Data between Bidtellect’s system and Metamarkets would frequently become inconsistent, which undermined confidence in analyses. Finally, accessibility was “a real pain” as Bidtellect “couldn’t easily access raw data to query the source directly.”

Something had to give.

That something was Bidtellect’s relationship with Metamarkets, as the company scrapped the increasingly expensive relationship for a modern architecture built on Cloudera and Zoomdata, which can not only scale to meet its volumes but is already saving Bidtellect nearly a million dollars a year. Importantly, the choice to go with Cloudera and Zoomdata meant that Bidtellect was also embracing some incredibly powerful open source software.

SEE Apache Spark rises to become most active open source project in big data (TechRepublic)

To support the ingestion of 50 million records per hour, Bidtellect is relying on popular open source big data frameworks, including:

  • Apache Kafka to create a consistent and dependable data flow for distributed messaging.
  • Apache Spark to perform fast, large-scale data aggregation.
  • Apache Hadoop (HDFS) is used for distributed processing and storage.
  • Apache Impala as a query engine that uses massively parallel processing to provide high-performance, high-scale analytic access directly from HDFS (Hadoop) data stores.

This is the world we live in, one fueled by increasingly powerful open source technology, made more easily consumable by vendors like Zoomdata and Cloudera. It means that startups like Bidtellect can punch well above their weight, reshaping industries just as open source-powered Uber has done to the taxi and car rental industries.