Data Warehouse Physical Design

Data Warehouse Physical Design

October 15, 2019

Data Warehouse Physical Design

Data files guidelines

Create files with an initial size

Based on the eventual size of the objects that will be stored on them
This pre-allocates sequential disk blocks and helps avoid fragmentation.

Disable autogrowth

If you begin to run out of space in a data file, it is more efficient to explicitly increase the file size by a large amount rather than rely on incremental autogrowth.

Filegroups guidelines

Create at least one filegroup in addition to the primary one, and then set it as the default filegroup so you can separate data tables from system tables.
Create dedicated filegroups for extremely large fact tables and using them to place those fact tables on their own logical disks.
If some tables in the data warehouse are loaded on a different schedule from others, consider using filegroups to separate the tables into groups that can be backed up independently.
If you intend to partition a large fact table, create a filegroup for each one so that older, stable rows can be backed up, and then set as read-only.

Staging tables

Separate staging database

Create it on a logical disk distinct from the data warehouse files.

Into the data warehouse database

Create a file and filegroup for them on a logical disk
Separate from the fact and dimension tables.
An exception to the previous guideline is made for staging tables that will be switched with partitions to perform fast loads.

These must be created on the same filegroup as the partition with which they will be switched.

TempDB

To avoid fragmentation of data files

Place it on a dedicated logical disk
Set its initial size based on how much it is likely to be used.

Set the growth increment to be quite large to ensure that performance is not interrupted by frequent growth of TempDB.
Creating multiple files for TempDB to help minimize contention during page free space (PFS) scans as temporary objects are created and dropped.

Transaction logs

Set the transaction mode of the Data Warehouse, Staging Database and TempDB to Simple
Helps to avoid having to truncate transaction logs
Additionally, most of the inserts in a data warehouse are typically performed as bulk load operations which are not logged.
To avoid disk resource conflicts between data warehouse I/O and logging, place the transaction log files for all databases on a dedicated logical disk.

Data Compression

SQL Server Enterprise Edition supports data compression at both page and row level.
Data compression benefits in a data warehouse

Reduced storage requirements.
Improved query performance

Best practices for data compression in a data warehouse

Use page compression on all dimension tables and fact table partitions.
If performance is CPU-bound, revert to row compression on frequently accessed partitions.

Table Partitioning

Improved query performance
More granular manageability
Improved data load performance
Best practices for partitioning in a DW

Partition Large Fact Tables
Partition on an incrementing date key
Design the partition scheme for ETL and manageability.
Maintain an empty partition at the start and end of the table

Indexes in DW

Indexes maximize query performance
Planning Indexes is the most important part of the database design process
Some inexperienced BI professionals are tempted to create many indexes on all tables to support queries.

Dimension table indexes

Create a clustered index on the surrogate key column.

This column is used to join the dimension table to fact tables, and a clustered index will help the query optimizer minimize the number of reads required to filter fact rows.

Create a non-clustered index on the alternate key column and include the SCD current flag, start date, and end date columns.

This index will improve the performance of lookup operations during ETL data loads that need to handle slowly-changing dimensions.

Create nonclustered indexes on frequently searched attributes, and consider including all members of a hierarchy in a single index.

Fact table indexes

Create a clustered index on the most commonly-searched date key.

Date ranges are the most common filtering criteria in the most data warehouse workloads, so a clustered index on this key should be particularly effective in improving overall query performance.

Create nonclustered indexes on other, frequently-searched dimension keys.
Columnstore index on all columns.

Using Views in a DW

Create a view for each dimension and fact table with NOLOCK query hint in the view definition Create views with a user-friendly view and column names
Do not include metadata columns in views
Create views to combine snowflake dimension tables
Partition-align indexed views
Use the SCHEMABINDING option
Security

Comments