The Underlying Big Data Problem: Unnecessary Growth

Our proficiency in collecting data for the purpose of gaining knowledge is actually making it harder to derive value from it, thanks to data silos. Consider this: Data silos store copies of captured data in separate, disparate units, from which insights cannot be gathered quickly or effectively because they lack basic backend integration. This has created the need to deploy expensive, custom and time-consuming ETL/BI solutions that tie data from silos together in order to gain knowledge from them. Unfortunately, this process often creates additional copies of the data. So, in essence, the very process of managing and leveraging big data has inherently made the big data problem bigger!

Existing approaches in this space have proven to be quite costly and ineffective on an operational level, and the value gained from insights into enterprise data is inversely limited in comparison — especially at scale. Data silo platforms are large, complex and expensive, and grow in lockstep along with production system data. As these platforms and the associated data grow, so do the issues with integrating the data. No matter how well an ETL solution is defined, the time required to copy and transform the data is significant. ETL models limit the size and complexity of the data to be analyzed and innately devalue the data due to the increased total time to analysis.

Another inherent issue with this model is “copy data.” As big data grows, the issue is exacerbated by the requirement to copy data — in most cases several times — into and out of various data silos and analytics platforms. Even with deduplication technologies deployed within these silos, the time and money required to handle the data copy is enormous and will continue to grow.

A New Species Evolves: Effective Data Integration and Data Availability

The panacea businesses are searching for is Data Availability, a simplified, unified, integrated view of business data in real time or near real time for the purpose of quickly and efficiently gaining knowledge and insight from all these data sets. Data virtualization solutions deliver data availability by integrating data from disparate sources, locations and formats, without replicating the source data. This provides a single virtual data layer delivering unified data services, resulting in faster access to all your data, less data copy and cost, and more agility to change based on data sources and business needs. In this model, a business’ data analysis needs can be met, refined and modified in near real time, as any changes to backend data models and sources are obfuscated by the virtual data layer.

There are obvious benefits to implementing a data virtualization solution that provides data availability:

Data services provisioning

Data virtualization promotes big data access in the API economy. Any data source can be made accessible in a different format or protocol than the original, with controlled access in a matter of minutes. This enables businesses to more easily and securely extend themselves further into what is being referred to as the “data economy.”

Unified data governance and security

A single virtual data layer more easily exposes redundancy and quality issues. Data virtualization can also impose data model governance and security, providing consistency in integration and data quality rules.

Decreased Cost/Improved ROI

With a unified interface that does not require expanses of infrastructure to house multiple copies of an ever-growing data set, IT costs are dramatically reduced. A standard interface based on non-vendor-specific knowledge means more of your organization can effectively leverage this wealth of information and extract benefit from it with minimal specialized IT staff/knowledge.

As has been the case with so many successful core enterprise technologies, a simplified, standards-based interface/platform has been the end state achieved, either through design or organically through open source development and use. At Gemini Data, our platform exists because big data is at a tipping point where businesses and technology providers would greatly benefit from a virtual, standards-based layer that simplifies access to myriad data sources, exposing larger sets of data to larger audiences through data availability.