Data Architecture

What is Data Architecture?

Data Architecture is a set of rules, policies, standards, and models that govern and define the type of data collected and how it is used, stored, managed, and integrated within an organization and its database systems. It provides a formal approach to creating and managing the flow of data and how it is processed across an organization’s IT systems and applications.^[1] Businesses Use Data Architecture To:

Strategically prepare organizations to evolve quickly and to take advantage of business opportunities inherent in emerging technologies
Translate business needs into data and system requirements
Facilitate alignment of IT and business systems
Manage complex data and information delivery throughout the enterprise. Acts as agents for change transformation and agility
Depict the flow of information between people (users) and business processes

Overview of Data Architecture^[2]

A data architecture should set data standards for all its data systems as a vision or a model of the eventual interactions between those data systems. Data integration, for example, should be dependent upon data architecture standards since data integration requires data interactions between two or more data systems. A data architecture, in part, describes the data structures used by a business and its computer applications software. Data architectures address data in storage, data in use, and data in motion; descriptions of data stores, data groups, and data items; and mappings of those data artifacts to data qualities, applications, locations, etc.

Essential to realizing the target state, Data Architecture describes how data is processed, stored, and utilized in an information system. It provides criteria for data processing operations so as to make it possible to design data flows and also control the flow of data in the system.

The data architect is typically responsible for defining the target state, aligning during development, and then following up to ensure enhancements are done in the spirit of the original blueprint.

During the definition of the target state, the Data Architecture breaks a subject down to the atomic level and then builds it back up to the desired form. The data architect breaks the subject down by going through 3 traditional architectural processes:

Conceptual - represents all business entities.
Logical - represents the logic of how entities are related.
Physical - the realization of the data mechanisms for a specific type of functionality.

Data architecture should be defined in the planning phase of the design of a new data processing and storage system. The major types and sources of data necessary to support an enterprise should be identified in a manner that is complete, consistent, and understandable. The primary requirement at this stage is to define all of the relevant data entities, not to specify computer hardware items. A data entity is any real or abstracted thing about which an organization or individual wishes to store data.

Components in Data Architecture^[3]

“Data Lake”, “Data Warehouse”, and “Data Mart” are typical components in the architecture of a data platform. In this order, data produced in the business is processed and set to create another data implication.

The three components take responsibility for three different functionalities as such:

Data Lake: holds an original copy of data produced in the business. Data processing from the original should be minimal if any; otherwise, in case some data processing turned out to be wrong in the end, it will not be possible to fix the error retrospectively.
Data Warehouse: holds data processed and structured by a managed data model, reflecting the global (not specific) direction of the final use of the data. In many cases, the data is in tabular format.
Data Mart: holds a subpart and/or aggregated data set for the use of a particular business function, e.g. specific business unit or specific geographical area. A typical example is when we prepare the summary of KPIs for a specific business line followed by visualization in BI tool. Especially, preparing this kind of separate and independent component after the warehouse is worthwhile when the user wants the data mart regularly and frequently updated. On contrary, this portion can be skipped in cases the user only wants some set of data for ad hoc analysis done only once.

Roughly speaking, data engineers cover from data extraction produced in business to the data lake and data model building in the data warehouse as well as establishing the ETL pipeline; while data scientists cover from data extraction out of the data warehouse, building data mart, and to lead to further business application and value creation.

Of course, this role assignment between data engineers and data scientists is somewhat ideal and many companies do not hire both just to fit this definition. Actually, their job descriptions tend to overlap.

Data Architecture Principles

According to Joshua Klahr, vice president of product management, and core products, at Splunk, and former vice president of product management at AtScale, six principles form the foundation of modern data architecture:

Data is a shared asset. A modern data architecture needs to eliminate departmental data silos and give all stakeholders a complete view of the company.
Users require adequate access to data. Beyond breaking down silos, modern data architectures need to provide interfaces that make it easy for users to consume data using tools fit for their jobs.
Security is essential. Modern data architectures must be designed for security and they must support data policies and access controls directly on the raw data.
Common vocabularies ensure common understanding. Shared data assets, such as product catalogs, fiscal calendar dimensions, and KPI definitions, require a common vocabulary to help avoid disputes during analysis.
Data should be curated. Invest in core functions that perform data curation (modeling important relationships, cleansing raw data, and curating key dimensions and measures).
Data flows should be optimized for agility. Reduce the number of times data must be moved to reduce cost, increase data freshness, and optimize enterprise agility.^[4]

Data Architecture Frameworks

There are multiple enterprise architecture frameworks that are used as the foundation for building the data architecture framework of an organization.

DAMA-DMBOK 2: This refers to DAMA International's Data Management Body of Knowledge – a framework designed specifically for data management. It includes standard definitions of data management terminology, functions, deliverables, and roles, and also presents guidelines on data management principles.
Zachman Framework for Enterprise Architecture: John Zachman created this enterprise ontology at IBM during the 1980s. The 'data' column of this framework includes multiple layers like key architectural standards for the business, a semantic model or conceptual/enterprise data model, an enterprise or logical data model, a physical data model, and actual databases.
The Open Group Architecture Framework (TOGAF): TOGAF is the most used enterprise architecture methodology that offers a framework for designing, planning, implementing, and managing data architecture best practices. It helps define business goals and align them with architectural objectives.^[5]

Characteristics of Effective Data Architecture

Data architecture is “modern” if it's built around certain characteristics:

User-driven: In the past, data was static and access was limited to a select group of IT experts and data scientists. Decision makers didn't necessarily get what they wanted or needed — only what was available. In modern data architecture, business users can confidently define the requirements, because data architects can pool data into a data lake or data warehouse and create solutions to access it in ways that meet business objectives.
Built on shared data: Effective enterprise data architecture is built on data structures that encourage collaboration. Good data architecture eliminates silos by combining data from all parts of the organization, along with external sources as needed, into one place to eliminate competing versions of the same data. In this environment, data is not bartered among business units or hoarded, but is seen as a shared, companywide asset.
Automated: Automation removes the friction that made legacy data systems tedious to configure. Data processing functions that took months to build can now be completed in hours or days using cloud-based tools. If a user wants access to different data, automation enables the architect to quickly design data pipelines to deliver it. As new data is sourced, data architects can quickly integrate it into the architecture.
Driven by AI: Smart data architecture takes automation to a new level, using machine learning (ML) and artificial intelligence (AI) to adjust, alert, and recommend solutions to new conditions. ML and AI can identify data types, identify and fix data quality errors, create structures for incoming data, identify relationships for fresh insights, and recommend related datasets and analytics.
Elastic: Elasticity allows companies to scale up or down as needed. Here, the cloud is your best friend, as it allows on-demand scalability quickly and affordably. Elasticity allows administrators and data analysts to focus on troubleshooting and problem-solving rather than on exacting capacity calibration or overbuying hardware to keep up with demand.
Simple: Simplicity trumps complexity when considering efficient data architecture. Do you need a show dog or a workhorse? Strive for simplicity in data movement, data platforms, data assembly frameworks, and analytic platforms.
Secure: Security is built into modern data architecture, restricting data access and ensuring that sensitive data is available on a need-to-know basis as defined by the business. Good data architecture also recognizes existing and emerging threats to data security while ensuring regulatory compliance with legislation such as HIPAA and GDPR.^[6]

The Risks of Weak Data Architecture?

Rapid technological advances and changes in business needs can undermine effective data architectures. When we add new lines and shapes to a data architecture’s data flow, it becomes very difficult to keep them updated and consistent. This can lead to duplicate data processes, high costs, and increased maintenance time for businesses. Organizations should strive to balance the complexity of data architectures, business needs, and the new tools/platforms for data management by carefully adopting new technologies that align with their vision and goals.^[7]

References

[1] Defining Data Architecture

[2] An Overview of Data Architecture

[3] Three Components in Data Architecture

[4] The Principles of Data Architecture

[5] Data Architecture Frameworks

[6] Characteristics of Effective Data Architecture

[7] What Are the Risks of Weak Data Architecture?

[1]

[2]

[3]

[4]

[5]

[6]

[7]