Data Virtualization

Data Virtualization is a software approach that allows organizations to access and manipulate data from multiple sources as if it were a single, unified data source. It enables users to access data in real-time, without having to physically move or replicate the data, thus reducing data redundancy and enhancing data consistency. This approach provides a virtual layer that integrates data from various sources and allows users to interact with it through a unified interface, providing seamless access to the data they need. Data virtualization helps to optimize data management and analysis, which ultimately leads to better decision-making and improved business outcomes.

Data virtualization presents a modern approach to data integration. Unlike ETL solutions, which replicate data, data virtualization leaves the data in source systems, simply exposing an integrated view of all the data to data consumers. As business users drill down into reports, data virtualization fetches the data in real-time from the underlying source systems. Data virtualization proves that connecting to data is far superior to collecting it.

== Capabilities Needed in a Data Virtualization System^[1] Four components are needed to meet urgent business needs with data virtualization

Agile design and development: You need to be able to introspect available data, discover hidden relationships, model individual views/services, validate views/services, and modify as required. These capabilities automate difficult work, improve time to solution, and increase object reuse.
High-performance runtime: The application invokes a request, the optimized query executes a single statement, and the result is delivered in proper form. This capability allows for up-to-the-minute data, optimized performance, and less replication.
Use of caching when appropriate: Caching essential data, the application invokes a request, an optimized query (leveraging cached data) executes, and data is delivered in the proper form. This capability boosts performance, avoids network constraints, and allows 24x7 availability.
Business directory/catalog to make data easy to find: This capability includes features for search and data categorization, browsing all available data, selecting from a directory of views, and collaborating with IT to improve data quality and usefulness. This capability empowers business users with more data, improves IT/business user effectiveness, and enables data virtualization to be more widely adopted.

The Uses of Data Virtualization^[2]

With real-time access to holistic information, businesses across many industries can efficiently execute complex processes.

Analyze current business performance against previous years.
Comply with regulations that require the traceability of historical data.
Search for and discover interrelated data.
Modernize business applications while replacing legacy systems.
Migrate from on-premises applications to cloud applications.
Monetize data by delivering it as a service.

== How data virtualization works: Main architecture layers In a nutshell, data virtualization happens via middleware which is nothing but a unified, virtual data access layer built on top of many data sources. This layer presents information, regardless of its type and model, as a single virtual (or logical) view. It all happens on demand and in real-time. Let’s elaborate on the data virtualization architecture to get the complete picture of how things work here. Typically, there are three building blocks comprising the virtualization structure, namely

Connection layer — a set of connectors to data sources in real time;
Abstraction layer — services to present, manage, and use logical views of the data; and
Consumption layer — a range of consuming tools and applications requesting abstract data.

Connection layer: This layer is responsible for accessing the information scattered across multiple source systems, containing both structured and unstructured data, with the help of connectors and communication protocols. Data virtualization platforms can link to different data sources including
- SQL and NoSQL databases like MySQL, Oracle, and MongoDB;
- CRMs and ERPs;
- cloud data warehouses like Amazon Redshift or Google Big Query;
- data lakes and enterprise data warehouses;
- streaming sources like IoT and IoMT devices;
- SaaS (Software-as-a-Service) applications like Salesforce;
- social media platforms and websites;
- spreadsheets and flat files like CSV, JSON, and XML;
- big data systems like Hadoop, and many more.

When connecting, data virtualization loads metadata (details of the source data) and physical views if available. It maps metadata and semantically similar data assets from different autonomous databases to a common virtual data model or schema of the abstraction layer. Mappings define how the information from each source system should be converted and reformatted for integration needs.

Abstraction layer: The cornerstone of the whole virtualization framework is the abstraction (sometimes referred to as virtual or semantic) layer that acts as the bridge between all data sources on one side and all business users on the other. This tier itself doesn’t store any data: It only contains logical views and metadata needed to access the sources. With the abstraction layer, end users only see the schematic data models whereas the complexities of the bottom data structures are hidden from them. So, once the data attributes are pulled in, the abstraction layer will allow you to apply joins, business rules, and other data transformations to create logical views on top of the physical views and metadata delivered by the connection layer. These integration processes can be modeled with the help of a drag-and-drop interface or a query language like SQL, depending on the data virtualization tool. Usually, there are various prebuilt templates and components to do all the modeling, matching, converting, and integration jobs. The essential components of the virtual layer are
- Metadata management — to import, document, and maintain metadata attributes such as column names, table structure, tags, etc.
- Dynamic data catalog — to organize data by profiling, tagging, classifying, and mapping it to business definitions so that end-users can easily find what they need.
- Query optimization — to improve query processing performance by caching virtual entities, enabling automatic joins, and supporting push-down querying (pushing down request operation to the source database).
- Data quality control — to ensure that all information is correct by applying data validation logic.
- Data security and governance — to provide different security levels to admins, developers, and consumer groups as well as define clear data governance rules, removing barriers for information sharing.

Consuming layer: Another tier of the data virtualization architecture provides a single point of access to data kept in the underlying sources. The delivery of abstracted data views happens through various protocols and connectors depending on the type of consumer. They may communicate with the virtual layer via SQL and all sorts of APIs, including access standards like JDBC and ODBC, REST and SOAP APIs, and many others. Most data virtualization software enables access for a wide range of business users, tools, and applications including such popular solutions as Tableau, Cognos, and Power BI.

As mentioned above, such a structure produces self-service capabilities for all consumers. Instead of making queries directly, they interact with the software used in their day-to-day operations, and that software, in turn, interacts with the virtual layer, getting the required data. In this way, consumers don’t need to care about the format of the data or its location as these complexities are masked from them.

Data Virtualization Vs. Data Warehousing^[3]

Some enterprise landscapes are filled with disparate data sources including multiple data warehouses, data marts, and/or data lakes, even though a Data Warehouse, if implemented correctly, should be unique and a single source of truth. Data virtualization can efficiently bridge data across data warehouses, data marts, and data lakes without having to create a whole new integrated physical data platform. Existing data infrastructure can continue performing its core functions while the data virtualization layer just leverages the data from those sources. This aspect of data virtualization makes it complementary to all existing data sources and increases the availability and usage of enterprise data.[citation needed]

Data virtualization may also be considered as an alternative to ETL and data warehousing but for performance considerations, it's not really recommended for a very large data warehouse. Data virtualization is inherently aimed at producing quick and timely insights from multiple sources without having to embark on a major data project with extensive ETL and data storage. However, data virtualization may be extended and adapted to serve data warehousing requirements also. This will require an understanding of the data storage and history requirements along with planning and design to incorporate the right type of data virtualization, integration, storage strategies, and infrastructure/performance optimizations (e.g., streaming, in-memory, hybrid storage)

References

[1] Required Data Virtualization Components for Business Needs

[2] What are the Uses of Data Virtualization?

[3] Data virtualization and data warehousing

[1]

[2]

[3]

Data Virtualization

The Uses of Data Virtualization[2]

Data Virtualization Vs. Data Warehousing[3]

See Also

References

The Uses of Data Virtualization^[2]

Data Virtualization Vs. Data Warehousing^[3]