Data Architecture

Data Architecture is a set of rules, policies, standards, and models that govern and define the type of data collected and how it is used, stored, managed, and integrated within an organization and its database systems. It provides a formal approach to creating and managing the flow of data and how it is processed across an organization’s IT systems and applications.[1]

Overview of Data Architecture[2]
A data architecture should set data standards for all its data systems as a vision or a model of the eventual interactions between those data systems. Data integration, for example, should be dependent upon data architecture standards since data integration requires data interactions between two or more data systems. A data architecture, in part, describes the data structures used by a business and its computer applications software. Data architectures address data in storage, data in use, and data in motion; descriptions of data stores, data groups, and data items; and mappings of those data artifacts to data qualities, applications, locations, etc.

Essential to realizing the target state, Data Architecture describes how data is processed, stored, and utilized in an information system. It provides criteria for data processing operations so as to make it possible to design data flows and also control the flow of data in the system.

The data architect is typically responsible for defining the target state, aligning during development, and then following up to ensure enhancements are done in the spirit of the original blueprint.

During the definition of the target state, the Data Architecture breaks a subject down to the atomic level and then builds it back up to the desired form. The data architect breaks the subject down by going through 3 traditional architectural processes:

  • Conceptual - represents all business entities.
  • Logical - represents the logic of how entities are related.
  • Physical - the realization of the data mechanisms for a specific type of functionality.

Data architecture should be defined in the planning phase of the design of a new data processing and storage system. The major types and sources of data necessary to support an enterprise should be identified in a manner that is complete, consistent, and understandable. The primary requirement at this stage is to define all of the relevant data entities, not to specify computer hardware items. A data entity is any real or abstracted thing about which an organization or individual wishes to store data.

Components in Data Architecture[3]
“Data Lake”, “Data Warehouse”, and “Data Mart” are typical components in the architecture of a data platform. In this order, data produced in the business is processed and set to create another data implication.

components of data architecture

The three components take responsibility for three different functionalities as such:

  • Data Lake: holds an original copy of data produced in the business. Data processing from the original should be minimal if any; otherwise, in case some data processing turned out to be wrong in the end, it will not be possible to fix the error retrospectively.
  • Data Warehouse: holds data processed and structured by a managed data model, reflecting the global (not specific) direction of the final use of the data. In many cases, the data is in tabular format.
  • Data Mart: holds a subpart and/or aggregated data set for the use of a particular business function, e.g. specific business unit or specific geographical area. A typical example is when we prepare the summary of KPIs for a specific business line followed by visualization in BI tool. Especially, preparing this kind of separate and independent component after the warehouse is worthwhile when the user wants the data mart regularly and frequently updated. On contrary, this portion can be skipped in cases the user only wants some set of data for ad hoc analysis done only once.

Roughly speaking, data engineers cover from data extraction produced in business to the data lake and data model building in the data warehouse as well as establishing ETL pipeline; while data scientists cover from data extraction out of the data warehouse, building data mart, and to lead to further business application and value creation.

Of course, this role assignment between data engineers and data scientists is somewhat ideal and many companies do not hire both just to fit this definition. Actually, their job descriptions tend to overlap.

See Also