Data Warehouse Physical Design

Data Warehouse Physical Design

Data files guidelines

  • Create files with an initial size 
    • Based on the eventual size of the objects that will be stored on them 
    • This pre-allocates sequential disk blocks and helps avoid fragmentation. 
  • Disable autogrowth 
    • If you begin to run out of space in a data file, it is more efficient to explicitly increase the file size by a large amount rather than rely on incremental autogrowth. 

Filegroups guidelines

  • Create at least one filegroup in addition to the primary one, and then set it as the default filegroup so you can separate data tables from system tables.
  • Create dedicated filegroups for extremely large fact tables and using them to place those fact tables on their own logical disks.
  • If some tables in the data warehouse are loaded on a different schedule from others, consider using filegroups to separate the tables into groups that can be backed up independently.
  • If you intend to partition a large fact table, create a filegroup for each one so that older, stable rows can be backed up, and then set as read-only.

Staging tables

  • Separate staging database  
    • Create it on a logical disk distinct from the data warehouse files. 
  • Into the data warehouse database 
    • Create a file and filegroup for them on a logical disk 
    • Separate from the fact and dimension tables. 
    • An exception to the previous guideline is made for staging tables that will be switched with partitions to perform fast loads. 
      • These must be created on the same filegroup as the partition with which they will be switched.

TempDB

  • To avoid fragmentation of data files 
    • Place it on a dedicated logical disk 
    • Set its initial size based on how much it is likely to be used. 
  • Set the growth increment to be quite large to ensure that performance is not interrupted by frequent growth of TempDB. 
  • Creating multiple files for TempDB to help minimize contention during page free space (PFS) scans as temporary objects are created and dropped.

Transaction logs

  • Set the transaction mode of the Data Warehouse, Staging Database and TempDB to Simple 
  • Helps to avoid having to truncate transaction logs 
  • Additionally, most of the inserts in a data warehouse are typically performed as bulk load operations which are not logged. 
  • To avoid disk resource conflicts between data warehouse I/O and logging, place the transaction log files for all databases on a dedicated logical disk.

Data Compression

  • SQL Server Enterprise Edition supports data compression at both page and row level.
  • Data compression benefits in a data warehouse
    • Reduced storage requirements.
    • Improved query performance 
  • Best practices for data compression in a data warehouse 
    • Use page compression on all dimension tables and fact table partitions. 
    • If performance is CPU-bound, revert to row compression on frequently accessed partitions.

Table Partitioning

  • Improved query performance 
  • More granular manageability
  • Improved data load performance
  • Best practices for partitioning in a DW
    • Partition Large Fact Tables
    • Partition on an incrementing date key
    • Design the partition scheme for ETL and manageability.
    • Maintain an empty partition at the start and end of the table

Indexes in DW

  • Indexes maximize query performance 
  • Planning Indexes is the most important part of the database design process 
  • Some inexperienced BI professionals are tempted to create many indexes on all tables to support queries.

Dimension table indexes 

  • Create a clustered index on the surrogate key column.
    • This column is used to join the dimension table to fact tables, and a clustered index will help the query optimizer minimize the number of reads required to filter fact rows.
  • Create a non-clustered index on the alternate key column and include the SCD current flag, start date, and end date columns.
    • This index will improve the performance of lookup operations during ETL data loads that need to handle slowly-changing dimensions.
  • Create nonclustered indexes on frequently searched attributes, and consider including all members of a hierarchy in a single index.

Fact table indexes

  • Create a clustered index on the most commonly-searched date key.
    • Date ranges are the most common filtering criteria in the most data warehouse workloads, so a clustered index on this key should be particularly effective in improving overall query performance.
  • Create nonclustered indexes on other, frequently-searched dimension keys.
  • Columnstore index on all columns.

Using Views in a DW

  • Create a view for each dimension and fact table with NOLOCK query hint in the view definition Create views with a user-friendly view and column names
  • Do not include metadata columns in views
  • Create views to combine snowflake dimension tables
  • Partition-align indexed views
  • Use the SCHEMABINDING option
  • Security

Comments

Popular posts from this blog

Maxpooling vs minpooling vs average pooling

Understand the Softmax Function in Minutes

Percentiles, Deciles, and Quartiles