The Architecture of Data Warehouse
architecture of data warehouse
Data Warehouses can be architected in many different ways, depending on the specific needs of a business. The model shown below is the “hub-and-spokes” Data Warehousing architecture that is popular in many organizations.
In short, data is moved from databases used in operational systems into a data warehouse staging area, then into a data warehouse and finally into a set of conformed data marts. Data is copied from one database to another using a technology called ETL (Extract, Transform, Load).
- Operational System:
The principal reason why businesses need to create Data Warehouses is that their corporate data assets are fragmented across multiple, disparate applications systems, running on different technical platforms in different physical locations. This situation does not enable good decision making.
When data redundancy exists in multiple databases, data quality often deteriorates. Poor business intelligence results in poor strategic and tactical decision making.
Individual business units within an enterprise are designated as “owners” of operational applications and databases. These “organizational silos” sometimes don’t understand the strategic importance of having well integrated, non-redundant corporate data. Consequently, they frequently purchase or build operational systems that do not integrate well with existing systems in the business.
Data Management issues have deteriorated in recent years as businesses deployed a parallel set of ebusiness and ecommerce applications that don’t integrate with existing “full service” operational applications.
Operational databases are normally “relational” – not “dimensional”. They are designed for operational, data entry purposes and are not well suited for online queries and analytics.
Due to globalization, mergers and outsourcing trends, the need to integrate operational data from external organizations has arisen. The sharing of customer and sales data among business partners can, for example, increase business intelligence for all business partners.
The challenge for Data Warehousing is to be able to quickly consolidate, cleanse and integrate data from multiple, disparate databases that run on different technical platforms in different geographical locations.
In the next section, the ETL portion of the Data Warehousing architecture is discussed.
- Etl Process:
The ETL software extracts data, transforms values of inconsistent data, cleanses “bad” data, filters data and loads data into a target database. The scheduling of ETL jobs is critical. Should there be a failure in one ETL job, the remaining ETL jobs must respond appropriately.
ETL Technology (shown below with arrows) is an important component of the Data Warehousing Architecture. It is used to copy data from Operational Applications to the Data Warehouse Staging Area, from the DW Staging Area into the Data Warehouse and finally from the Data Warehouse into a set of conformed Data Marts that are accessible by decision makers.
- DW Stagigng area:
The Data Warehouse Staging Area is temporary location where data from source systems is copied. A staging area is mainly required in a Data Warehousing Architecture for timing reasons. In short, all required data must be available before data can be integrated into the Data Warehouse.
Due to varying business cycles, data processing cycles, hardware and network resource limitations and geographical factors, it is not feasible to extract all the data from all Operational databases at exactly the same time.
For example, it might be reasonable to extract sales data on a daily basis, however, daily extracts might not be suitable for financial data that requires a month-end reconciliation process. Similarly, it might be feasible to extract “customer” data from a database in Singapore at noon eastern standard time, but this would not be feasible for “customer” data in a Chicago database.
Data in the Data Warehouse can be either persistent (i.e. remains around for a long period) or transient (i.e. only remains around temporarily).
Not all business require a Data Warehouse Staging Area. For many businesses it is feasible to use ETL to copy data directly from operational databases into the Data Warehouse.
- Data Warehouses:
The purpose of the Data Warehouse in the overall Data Warehousing Architecture is to integrate corporate data. It contains the “single version of truth” for the organization that has been carefully constructed from data stored in disparate internal and external operational databases.
The amount of data in the Data Warehouse is massive. Data is stored at a very granular level of detail. For example, every “sale” that has ever occurred in the organization is recorded and related to dimensions of interest.
This allows data to be sliced and diced, summed and grouped in unimaginable ways.Contrary to popular opinion, the Data Warehouse does not contain all the data in the organization.
Its purpose is to provide key business metrics that are needed by the organization for strategic and tactical decision making.
Decision makers don’t access the Data Warehouse directly. This is done through various front-end tools that read data from subject specific Data Marts.
The Data Warehouse can be either “relational” or “dimensional”. This depends on how the business intends to use the information.
- Data Marts:
ETL (Extract Transform Load) jobs extract data from the Data Warehouse and populate one or more Data Marts for use by groups of decision makers in the organizations. The Data Marts can be Dimensional (Star Schemas) or relational, depending on how the information is to be used and what “front end” Data Warehousing Tools will be used to present the information.
Each Data Mart can contain different combinations of tables, columns and rows from the Enterprise Data Warehouse. For example, a business unit or user group that doesn’t require a lot of historical data might only need transactions from the current calendar year. The Personnel Department might need to see all details about employees, whereas data such as “salary” or “home address” might not be appropriate for a Data Mart that focuses on Sales.