Actions

Data Warehouse

A data warehouse is a central repository of information that can be analyzed to make better-informed decisions. Data flows into a data warehouse from transactional systems, relational databases, and other sources, typically on a regular cadence. Business analysts, data scientists, and decision-makers access the data through business intelligence (BI) tools, SQL clients, and other analytics applications.[1]


In a data warehouse, data from many heterogeneous sources is extracted into a single area, transformed according to the decision support system needs, and stored in the warehouse. For example, a company stores information pertaining to its employees, their salaries, developed products, customer information, sales, and invoices. The CEO might want to ask a question pertaining to the latest cost-reduction measures; the answers will involve an analysis of all of this data. This is a main service of the data warehouse, i.e., allowing executives to reach business decisions based on all these disparate raw data items. Thus, a data warehouse contributes to future decision-making. As in the above example, a firm administrator can query warehouse data to find out the market demand of a particular product, sales data by geographical region, or answer other inquiries. This provides insight into the required steps to market a particular product. Unlike an operational data store, a data warehouse contains aggregate historical data, which may be analyzed to reach critical business decisions. Despite associated costs and effort, most major corporations today use data warehouses.[2]


In the broadest sense, therefore, the term data warehouse is used to refer to a database that contains very large stores of historical data. The data is stored as a series of snapshots, in which each record represents data at a specific time. By analyzing these snapshots you can make comparisons between different time periods. You can then use these comparisons to help make important business decisions. Data warehouse databases are optimized for data retrieval. The duplication or grouping of data, referred to as database denormalization, increases query performance and is a natural outcome of the dimensional design of the data warehouse. By contrast, traditional online transaction processing (OLTP) databases automate day-to-day transactional operations. OLTP databases are optimized for data storage and strive to eliminate data duplication. Databases that achieve this goal are referred to as normalized databases. An enterprise data warehouse (EDW) is a data warehouse that services the entire enterprise. An enterprise data warehousing environment can consist of an EDW, an operational data store (ODS), and physical and virtual data marts. A data warehouse can be implemented in several different ways. You can use a single data management system for both transaction processing and business analytics. Or, depending on your system workload requirements, you can build a data warehousing environment that is separate from your transactional processing environment.[3]


Data Warehouse
source: Intellipat


Data Warehouse - Historical Background[4]
The concept of data warehousing dates back to the late 1980s when IBM researchers Barry Devlin and Paul Murphy developed the "business data warehouse". In essence, the data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to decision support environments. The concept attempted to address the various problems associated with this flow, mainly the high costs associated with it. In the absence of a data warehousing architecture, an enormous amount of redundancy was required to support multiple decision support environments. In larger corporations, it was typical for multiple decision support environments to operate independently. Though each environment served different users, they often required much of the same stored data. The process of gathering, cleaning and integrating data from various sources, usually from long-term existing operational systems (usually referred to as legacy systems), was typically in part replicated for each environment. Moreover, the operational systems were frequently reexamined as new decision support requirements emerged. Often new requirements necessitated gathering, cleaning and integrating new data from "data marts" that was tailored for ready access by users. Key developments in early years of data warehousing were:

  • 1960s – General Mills and Dartmouth College, in a joint research project, develop the terms dimensions and facts.
  • 1970s – ACNielsen and IRI provide dimensional data marts for retail sales.
  • 1970s – Bill Inmon begins to define and discuss the term: Data Warehouse.
  • 1975 – Sperry Univac introduces MAPPER (MAintain, Prepare, and Produce Executive Reports) is a database management and reporting system that includes the world's first 4GL. First platform designed for building Information Centers (a forerunner of contemporary data warehouse technology)
  • 1983 – Teradata introduced the DBC/1012 database computer specifically designed for decision support.
  • 1984 – Metaphor Computer Systems, founded by David Liddle and Don Massaro, released a hardware/software package and GUI for business users to create a database management and analytic system.
  • 1985 - Sperry Corporation publish an article (Martyn Jones and Philip Newman) on information centres, where they introduce the term MAPPER data warehouse in the context of information centres.
  • 1988 – Barry Devlin and Paul Murphy publish the article An architecture for a business and information system where they introduce the term "business data warehouse".
  • 1990 – Red Brick Systems, founded by Ralph Kimball, introduces Red Brick Warehouse, a database management system specifically for data warehousing.
  • 1991 – Prism Solutions, founded by Bill Inmon, introduces Prism Warehouse Manager, software for developing a data warehouse.
  • 1992 – Bill Inmon publishes the book Building the Data Warehouse.
  • 1995 – The Data Warehousing Institute, a for-profit organization that promotes data warehousing, is founded.
  • 1996 – Ralph Kimball publishes the book The Data Warehouse Toolkit.
  • 2000 – Dan Linstedt release in the public domain the Data vault modeling conceived in 1990 as an alternative to Inmon and Kimball to provide long-term historical storage of data coming in from multiple operational systems, with emphasis on tracing, auditing and resiliance to change of the source data model.
  • 2012 – Bill Inmon developed and made public technology known as "textual disambiguation". Textual disambiguation applies context to raw text and reformats the raw text and context into a standard data base format. Once raw text is passed through textual disambiguation, it can easily and efficiently be accessed and analyzed by standard business intelligence technology. Textual disambiguation is accomplished through the execution of textual ETL. Textual disambiguation is useful wherever raw text is found, such as in documents, Hadoop, email, and so forth.


The Importance of Data Warehouse[5]
Many global corporations have turned to data warehousing to organize data that streams in from corporate branches and operations centers around the world. Cloud-based technology has revolutionized the business world, allowing companies to easily retrieve and store valuable data about their customers, products and employees. This data is used to inform important business decisions. Data warehousing is an increasingly important business intelligence tool, allowing organizations to:

  • Ensure Consistency: Data warehouses are programmed to apply a uniform format to all collected data, which makes it easier for corporate decision-makers to analyze and share data insights with their colleagues around the globe. Standardizing data from different sources also reduces the risk of error in interpretation and improves overall accuracy.
  • Make Better Business Decisions: Successful business leaders develop data-driven strategies and rarely make decisions without consulting the facts. Data warehousing improves the speed and efficiency of accessing different data sets and makes it easier for corporate decision-makers to derive insights that will guide the business and marketing strategies that set them apart from their competitors.
  • Improve Bottom Line: Data warehouse platforms allow business leaders to quickly access their organization's historical activities and evaluate initiatives that have been successful — or unsuccessful — in the past. This allows executives to see where they can adjust their strategy to decrease costs, maximize efficiency and increase sales to improve their bottom line.


Data Warehouse Design Methods[6]
In addition to Inmon's top-down approach to data warehouses and Kimball's bottom-up method, some organizations have also adopted hybrid options.

  • Top-down Approach: Inmon's method calls for building the data warehouse first. Data is extracted from operational and possibly third-party external systems and may be validated in a staging area before being integrated into a normalized data model. Data marts are created from the data stored in the data warehouse.
  • Bottom-up Method: Kimball's data warehousing architecture calls for dimensional data marts to be created first. Data is extracted from operational systems, moved to a staging area and modeled into a star schema design, with one or more fact tables connected to one or more dimensional tables. The data is then processed and loaded into data marts, each of which focuses on a specific business process. Data marts are integrated using a data warehouse bus architecture to form an enterprise data warehouse.
  • Hybrid Method: Hybrid approaches to data warehouse design include aspects from both the top-down and bottom-up methods. Organizations often seek to combine the speed of the bottom-up approach with the integration achieved in a top-down design.


Key Features of Data Warehouse (DW)[7]

  • Subject-Oriented: Data warehousing gives you an option of building your warehouse including the data as and what you want to extract and analyze. Thus, a subject matter expert can answer relevant questions from the da For example, a sales executive for an online website can develop a subject-oriented database including the data fields he wants to query. The salesperson can then excerpt data writing different queries like, “How many customers purchased Database Architect Course today?”
  • Integrated: Similar to the concept of subject-orientation, data warehouses supports consistency by arranging data from diverse sources in a uniform and rational forma It should not allow any conflicts in field names and other units of measure. Having accomplished this steadiness, we can refer it to be an integrated data warehouse.
  • Nonvolatile: As the name suggests, nonvolatile DW refers to the no data change once created. It is an important and relevant feature since the aim of developing a warehouse if to evaluate what has occurred until then.
  • Time-variant: DW believes in adopting and adapting to the changing trends and thus, allows inclusion of novel business patterns and also identifies what’s trending in business relationships, involving large volumes of data.


Data Warehouse Models[8]
From the perspective of data warehouse architecture, we have the following data warehouse models −

  • Virtual Warehouse: The view over an operational data warehouse is known as a virtual warehouse. It is easy to build a virtual warehouse. Building a virtual warehouse requires excess capacity on operational database servers.
  • Data Mart: Data mart contains a subset of organization-wide data. This subset of data is valuable to specific groups of an organization. In other words, we can claim that data marts contain data specific to a particular group. For example, the marketing data mart may contain data related to items, customers, and sales. Data marts are confined to subjects. Points to remember about data marts −
    • Window-based or Unix/Linux-based servers are used to implement data marts. They are implemented on low-cost servers.
    • The implementation data mart cycles is measured in short periods of time, i.e., in weeks rather than months or years.
    • The life cycle of a data mart may be complex in long run, if its planning and design are not organization-wide.
    • Data marts are small in size.
    • Data marts are customized by department.
    • The source of a data mart is departmentally structured data warehouse.
    • Data mart are flexible.
  • Enterprise Warehouse
    • An enterprise warehouse collects all the information and the subjects spanning an entire organization
    • It provides us enterprise-wide data integration.
    • The data is integrated from operational systems and external information providers.
    • This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or beyond.


Data Warehouse Vs. Other Storage Systems

Data Warehouse Vs. Other Storage Systems
source: SAS


Benefits of Data Warehousing[9]
Fundamentally, a data warehouse helps solve the on-going problem of pulling data out of transactional systems quickly and efficiently and converting that data into actionable information. Additionally, the data warehouse allows for processing of large and complex queries in a highly-efficient manner. Upon succesful implementation of a data warehouse or data mart, business will realize numerous improvements and positive gains. Benefits from a successful implementation of a data warehouse include: • Enhanced Business Intelligence: Insights will be gained through improved information access. Managers and executives will be freed from making their decisions based on limited data and their own “gut feelings”. Decisions that affect the strategy and operations of organizations will be based upon credible facts and will be backed up with evidence and actual organizational data. Moreover, decision makers will be better informed as they will be able to query actual data will retrieve information based upon their personal needs. In addition, data warehouses and related business intelligence can be used to can be applied directly to business processes including marketing segmentation, inventory management, financial management, and sales. • Increased Query and System Performance: Data warehouses are purposely designed and constructed with a focus on speed of data retrieval and analysis. Moreover, a data warehouse is designed for storing large volumes of data and being able to rapidly query the data. These analtical systems are constructed differently from operational systems which focus on creation and modification of data. In contrast, the data warehouse is built for analysis and retrieval of data rather than efficient upkeep of invidual records (i.e. transactions). Further, the data warehouse allows for a large system burden to be taken off the operational environment and effectively distributes system load across an entire organization’s technology infrastructure • Business Intelligence from Multiple Sources: For many organizations, enterprise information systems are comprised of multiple subsystems, physically separated and built on different platforms. Moreover, merging data from multiple disparate data sources is a common need when conducting business intelligence. To solve this problem, the data warehouse performs integration of existing disparate data sources and makes them accessible in one place. Consolidating enterprise data into a single data repository alleviates the burden of duplicating data gathering efforts, and enables the extraction of information that would otherwise be impossible. Additionally, the data warehouse becomes the “single view of truth” for the enterprise rather than the multiple truths that can come from reporting on individual subsystems. • Timely Access to Data: The data warehouse enables business users and decision makers to have access to data from many different sources as they need to have access to the data. Additionally, business users will spend little time in the data retrieval process. Scheduled data integration routines, known as ETL, are leveraged within a data warehouse environment. These routines consolidate data from multiple source systems and transform the data into a useful format. Subsequently, business users can then easily access data from one interface. Further, consumers of data will be able to query data directly with less information technology support. The wait time for information technology professionals to develop reports and queries is greatly diminished as the business users are given the ability to generate reports and queries on their own. The use of query and analysis tools against a consistent and consolidated data repository enables business users to spend more time performing data analysis and less time gathering data. • Enhanced Data Quality and Consistency: A data warehouse implementation typically includes the conversion of data from numerous source systems and data files and transformation of the disparate data into a common format. Data from the various business units and departments is standardized and the inconsistent nature of data from the unique source systems is removed. Moreover, individual business units and departments including sales, marketing, finance, and operations, will start to utilize the same data repository as the source system for their individual queries and reports. Thus each of these individual business units and departments will produce results that are consistent with the other business units within the organization. Subsequently the overall confidence in the organization’s data is substantially increased. • Historical Intelligence: Data warehouses generally contain many years worth of data that can neither be stored within nor reported from a transactional system. Typically transactional systems satisfy most operating reporting requirements for a given time-period but without the inclusion of historical data. In contrast, the data warehouse stores large amounts of historical data and can enable advanced business intelligence including time-period analysis, trend analysis, and trend prediction. The advantage of the data warehouse is that it allows for advanced reporting and analysis of multiple time-periods. • High Return on Investment: Return on investment (ROI) refers to the amount of increased revenue or decreased expenses a business will be able to realize from any project or investment of capital. Subsequently, implementations of data warehouses and complementary business intelligence systems have enabled business to generate higher amounts of revenue and provide substantial cost savings. According to a 2002 International Data Corporation (IDC) study “The Financial Impact of Business Analytics”, analytics projects have been achieving a substantial impact on a business’ financial status. Additionally, the study found that business analytics implementations have generated a median five-year return on investment of 112% with a mean payback of 1.6 years. Of the businesses included in the study, 54% have had a return on investment of 101% or more.


Challenges of Data Warehousing[10]
The problems associated with developing and managing a data warehousing are as follows:

  • Underestimation of resources of data loading: Some times we underestimate the time required to extract, clean, and load the data into the warehouse. It may take the significant proportion of the total development time, although some tools are there which are used to reduce the time and effort spent on this process.
  • Hidden problems with source systems: Some times hidden .problems associated with the source systems feeding the data warehouse may be identified after years of being undetected. For example, when entering the details of a new property, certain fields may allow nulls which may result in staff entering incomplete property data, even when available and applicable.
  • Required data not captured: In some cases the required data is not captured by the source systems which may be very important for the data warehouse purpose. For example the date of registration for the property may be not used in source system but it may be very important analysis purpose.
  • Increased end-user demands: After satisfying some of end-users queries, requests for support from staff may increase rather than decrease. This is caused by an increasing awareness of the users on the capabilities and value of the data warehouse. Another reason for increasing demands is that once a data warehouse is online, it is often the case that the number of users and queries increase together with requests for answers to more and more complex queries.
  • Data homogenization: The concept of data warehouse deals with similarity of data formats between different data sources. Thus, results in to lose of some important value of the data.
  • High demand for resources: The data warehouse requires large amounts of data.
  • Data ownership: Data warehousing may change the attitude of end-users to the ownership of data. Sensitive data that owned by one department has to be loaded in data warehouse for decision making purpose. But some time it results in to reluctance of that department because it may hesitate to share it with others.
  • High maintenance: Data warehouses are high maintenance systems. Any reorganization· of the business processes and the source systems may affect the data warehouse and it results high maintenance cost.
  • Long-duration projects: The building of a warehouse can take up to three years, which is why some organizations are reluctant in investigating in to data warehouse. Some only the historical data of a particular department is captured in the data warehouse resulting data marts. Data marts support only the requirements of a particular department and limited the functionality to that department or area only.
  • Complexity of integration: The most important area for the management of a data warehouse is the integration capabilities. An organization must spend a significant amount of time determining how well the various different data warehousing tools can be integrated into the overall solution that is needed. This can be a very difficult task, as there are a number of tools for every operation of the data warehouse.


See Also

A Data Warehouse is a centralized repository that stores historical data from various sources within an organization, making it available for analysis and reporting. This system is designed to facilitate business intelligence activities, supporting decision-making processes by providing a unified, consistent view of data across the organization.

  • ETL (Extract, Transform, Load): Discussing the process used to gather data from multiple sources, transform it into a format suitable for analysis, and load it into the data warehouse.
  • Business Intelligence (BI): Covering the strategies and technologies used by enterprises for data analysis of business information, where data warehouses serve as the foundational infrastructure.
  • Data Lake: Exploring another data storage strategy, focusing on storing raw data in its native format. Comparison with data warehouses can highlight differences in data structure, processing, and analysis.
  • Data Modeling: Discussing the process of creating a data model for the data to be stored in a data warehouse, including dimensional modeling, which is commonly used in data warehouse design.
  • Online Analytical Processing (OLAP): Covering technologies that allow users to analyze data from multiple database systems at the same time, often used in conjunction with data warehouses for complex analytical queries.
  • Data Mart: A subset of a data warehouse focused on a specific business line or team. Links to this topic can illustrate how data marts and data warehouses are used together within an organization's data strategy.
  • Big Data Technologies: Discussing the tools and methodologies for processing and analyzing large datasets, such as Hadoop and Spark, and their relation to data warehousing solutions.
  • Cloud Data Warehousing: Exploring modern data warehouse solutions offered as cloud services, such as Amazon Redshift, Google BigQuery, and Microsoft Azure Synapse Analytics, highlighting the benefits of scalability, flexibility, and cost-efficiency.
  • Data Governance: Covering the overall management of data availability, usability, integrity, and security in enterprise systems, a crucial aspect of managing a data warehouse.
  • Master Data Management (MDM): Discussing the process of creating a single source of truth for all critical business data, which is essential for ensuring the accuracy and consistency of data in a data warehouse.
  • Data Security: Highlighting the importance of securing data within a data warehouse, including access controls, encryption, and compliance with data protection regulations.
  • Data Warehouse Automation: Exploring tools and practices that reduce the time and effort required to maintain and update data warehouse systems, enabling faster delivery of data insights.
  • Data Cleansing
  • Data Management
  • Predictive Analytics
  • Customer Data Management (CDM)
  • Data Analysis
  • Data Analytics
  • Hybrid Online Analytical Processing (HOLAP)


References

  1. Defining Data Warehouse AWS
  2. What is Data Warehousing? Techopedia
  3. Understanding Data Warehouse IBM Knowledge Center
  4. The History of Data Warehouse Wikipedia
  5. Why is Data Warehousing Important? Herzing University
  6. Data Warehouse Design Methods Techtarget
  7. What are the Key Features of Data Warehouse (DW)? Intellipaat
  8. Data Warehouse Models Tutorials Point
  9. What are the Benefits of Data Warehousing? BI-Insider
  10. Problems of Data Warehousing ecomputernotes


Further Reading