Big data and analytics are moving into more mature stages of deployment.
This is good news, especially for small to mid-sized companies that are deploying the technology and have been struggling to define an architecture for big data in their companies.
Uncertainty about how to define an overarching architecture for big data and analytics is one of the reasons why mid- and small-sized companies have lagged in their big data and analytics deployments. In many cases, they have chosen to wait on the sidelines to see how trends like hybrid computing, data marts and master databases, and control over security and governance were going to play out.
At last, there seems to be an emerging best practice data architecture that everyone can follow. In this architecture:
Cloud services are being used to store and process big data; and
On-premise computing is being used to develop local data marts throughout companies where companies perform their own analytics.
SEE: Hiring kit: Data architect (Tech Pro Research)
Let’s take a closer look at the reasoning behind this big data and analytics architecture:
The role of cloud
If your company is small or mid-sized, it is cost-prohibitive to start buying clusters of servers that parallel process big data in your data center–not to mention hiring or cross-training the very expensive professionals who know how to optimize, upgrade and maintain a parallel processing environment. Companies opting to process and store their data onsite also have considerable investments into hardware, software and storage. All of this produces economics that point to outsourcing your big data hardware, software, processing and storage to the cloud.
Governance (e.g., security and compliance concerns) is one reason why companies remain reluctant to consign all of their mission-critical data to the cloud, where it is more difficult to oversee the stewardship of this data. Consequently, many companies opt to move data into their own on-premise data centers once the data has been processed in the cloud.
There is also a second reason why many companies opt to go on-premise with their processed data: concern about the proprietary applications and algorithms developed to mine this data, because many cloud providers have a policy that any applications that their customers develop in the cloud may be shared with other customers.
By keeping their apps in-house, and developing an on-premise master dataset that smaller data marts can be splintered from, companies maintain direct control over their data and apps.
SEE: The cloud v. data center decision (free PDF) (ZDNet/Tech Republic special report)
What are the takeaways for analytics managers?
1. You should understand and agree with how your cloud provider is going to process and protect your data.
For instance, if your organization is required to anonymize data, the process for doing this should be documented and agreed to with your cloud provider, since the cloud provider will be doing the anonymization. If you want your data cleaned, the cleansing process should also be detailed out in writing with your cloud provider, For example, do you only want all state abbreviations to be uniform (e.g., “Tenn” and “Tennessee” = “TN”) or do you want other edits done to your data so that it is uniform and easy to process? Finally, whether you are in a dedicated tenant or a multi-tenant environment at the cloud provider, the provider should be able to guarantee that your data will never be shared with other clients.
2. Your on-premises big data and analytics architecture should be documented with new policies and procedures that fit the needs of big data.
Many IT departments miss this task altogether. They just get going on their big data projects and forget that existing policies and procedures for application development come from the transactional application world. Don’t make this mistake at your shop. Instead, revise policies and procedures in the areas that are highly likely to interact with big data, like storage, database administration, and applications.
3. Disaster recovery plans should be updated and tested for big data both on-premise and in the cloud.
In the case of cloud-based disaster recovery (DR) testing, you should include a provision for documenting and performing DR with the vendor in your contract. DR plans, which focus on transactional data and systems, should also be updated to include test scripts for big data and analytics recovery and restoration.
- Special report: The art of the hybrid cloud (free PDF) (ZDNet/TechRepublic)
- Big data: Three ways to make your project a success (ZDNet)
- 85% of big data projects fail, but your developers can help yours succeed (TechRepublic)
- Infographic: More than half of companies haven’t started a big data project (TechRepublic)
- How big data won the 2017 World Series (TechRepublic)