Table of Contents
Data Vault, in it's own creator's words, is a system of Business Intelligence, which includes modeling, methodology, architecture, and implementation best practices.
It is a data modeling technique, which relies on standartized entities - Hubs, Links, Satellites - building up layered Vaults in a strictly defined manner.
The components also known as pillars of Data Vault 2.0 are following:
- Modeling - focused on process and data models;
- Methodology - following Scrum and Agile ways of working;
- Architecture - includes NoSQL and big data systems;
- Implementation - pattern-based automation and generation.
The term "data vault" is merely a marketing term chosen in 2001 to represent the system to the market. The true name for the data vault system of business intelligence (BI) is common foundational warehouse modeling, methodology, arthitecture, and implementation1 or Common Foundational Warehouse Architecture2.
Pros
- Model facilitates reacting to business changes
- While has 3-6 times more objects than normalized Data Warehouse, data transformations can be easily automated and often run in parallel
- Most changes are compartmentalized to a specific layer
Cons
- Requires strong commitment and adherence to the standard
- Hardly possible without automation tools
Concepts
- Hub - represents a business element and holds a unique list of Business Keys.
- Link - represents unique identifiable business element (Hub) relationships (intersections). Similar to Facts in Dimensional Modeling as it contains associations or transactions between Business Keys. Does not contain descriptive attributes.
- Satellite - stores descriptive information in delta-driven (changes over time) manner. Satellites should only store attribute changes, not the attribute state of each loading cycle2. Does not contain Business Keys (references to Hubs or Links via Hash Keys). Satellite is the only type of entity to store history.
- Satellite on Hub - stores the descriptive information and the change of history of a business element.
- Satellite on Link - stores the descriptive information and the change of history of the relationship of business elements.
- Satellite on Same As Link - stores the descriptive information and the change of history of the relationship of the Same As Link.
- Satellite on Reference table - stores classifier detailed descriptive information.
- System-driven Satellite - a structure that is populated by hard business rules and is driven by systematic or system regulated process.
- Point-in-Time (PIT) table - query assistant structure, and is geared towards query performance. Structure can be modified to include computed columns as necessary. PIT table creates snapshots of data for dates specified by the data consumers upstream.
- Bridge table - another type of query assistant table. Their purpose is to improve the performance of queries on the Raw Data Vault by reducing the number of required joins for the query.
Business rules
- Hard Business Rules - technical rules that align the data domains, so called data type matching. Enforced when data is extracted from source systems and loaded into staging area. These business rules only affect only the enforcement of data types but don't convert any values to fit the analytical requirements of the business. Hard business rules never change the meaning of the incoming data, only the way data is stored;
- Soft Business Rules - enforces the business requirements that are stated by the business user. These rules change the data or the meaning of the data, e.g.: by modifying the grain or interpretation, aggregation, allocating data into categories, consolidation of data from multiple sources, transformations to meet business requirements.
Example
Entity-Relationsip Diagram, following Transaction example from Dimensional modeling.
erDiagram
"Hub Account" {
string AccountHashKey PK
timestamp LoadDate
string RecordSource
timestamp LastSeenDate
string AccountIBAN UK
}
"Hub Channel"
"Hub Customer"
"Hub Payment Method"
"Hub Transaction" {
string TransactionHashKey PK
timestamp LoadDate
string RecordSource
number TransactionId UK
}
%% Links together Hubs making up the Transaction
"Link Transaction" {
string TransactionHashKey PK
timestamp LoadDate
string RecordSource
string AccountHashKey PK, FK
string ChannelHashKey PK, FK
string CustomerHashKey PK, FK
string PaymentMethodHashKey PK, FK
}
"Sat Customer" {
string CustomerHashKey PK, FK
timestamp LoadDate PK
timestamp LoadEndDate
string RecordSource
string HashDiff
string FullName
string Address
string EmailAddress
}
"Hub Customer" ||--|{ "Sat Customer" : ""
%% Links are always many-to-many!
"Hub Account" ||--o{ "Link Transaction" : ""
"Hub Channel" ||--o{ "Link Transaction" : ""
"Hub Customer" ||--o{ "Link Transaction" : ""
"Hub Payment Method" ||--o{ "Link Transaction" : ""
"Hub Transaction" ||--|{ "Link Transaction" : ""
"Hub Transaction" ||--|{ "Sat Transaction" : ""
"Hub Account" ||--|{ "Sat Account" : ""
"Hub Channel" ||--|{ "Sat Channel" : ""
"Hub Payment Method" ||--|{ "Sat Payment Method" : ""
Notes:
- Transaction is modeled as both Hub and Link! While in general, linking Hubs together is done via adding a Link, in this case it is not enough: Transaction has a Business Key (
TransactionId), therefore it must exist as Hub as well. - Links are implemented by tables that represent many-to-many relationships, which provide benefits, such as flexibility to the model - changes to Business Rules don't require re-engineering of links.
LoadDateindicates when the Business Key initially arrived to the Data Warehouse. Should be the same for all data that arrived in the same batch.- For
RecordSourceavoid generalized data source - use the lowest level of granularity. When Business Key has multiple sources, it should indicate master data source. LastSeen(optional) - indicates when Business Key was last seen in source system.- When designing Satellites, it is recommended to split data by source system (into separate tables).
HashDiffin Satellite is optional.LoadEndDateindicates when Satellite entry becomes invalid.HashDiff- hash value of all descriptive data of a Satellite's entry. It is possible to ignore some of the descriptive attributes by leaving them out of the hash value calculation for the hash diff attribute.
Copyright © 2025-2026 Gytis Repečka and Inretio® with ❤️ from 🇱🇹. No AI slop here!
Content is licenced under Creative Commons BY-NC-ND 3.0 unless stated otherwise. Terms of use.