« Airline data
Locale names »


DWH optimization

Posted by Oleg Solovyev on Apr 23, 2011

Indexes

The first thing one should start with is indexes. They decrease table read time if one of the columns in the where statement is indexed. They also decrease tables join/merge time if one of the ID columns is indexed. The list of DWH indexes is available in the system table DICTIONARY.INDEXES:

proc sql;
  create table work.indexes_list as
  select *
  from dictionary.indexes;
quit;

Indexes can be simple and combined. Simple index is created on one column. Combined index involves several columns. The main difference is that combined index is more efficient then simple index when query involves filter or join based on several columns. SAS compiler decides what index to use depending on the query code and indexes available.

Indexes have advantages and draw backs. Indexes do reduce query time but it takes time and additional disk space to calculate and store them. Ideally indexes should be created on every column that can be used in a query where or join statements. But calculation of all possible indexes can take too much time. One has to rely on his own experience and gut feel to decide what columns to index.

Index renewal

DWH is updated with new data every day and old indexes become useless. That is why indexes are modified each time the table updates. There are two ways of updating indexes:

  • remove the index and calculate it again
  • update the existing index without removal

The first way takes the most time but results in better index quality. The second approach updates index with the new information that takes less time but results in worse index quality. SAS “optimizer” can delete and recalculate the index automatically if it decides that its quality is too low.

Compress option

SAS file format is very rarefied. One can compress the SAS table 10 times using RAR or ZIP archiver. SAS also has built in algorithm that is activated with compress option:

data work.test(compress = yes);
  ...
run;

or in the libname statement:

libname example 'c:\dwh\data\example' compress = yes;

SPDE

SPDE (Scalable Performance Data Engine) is a new SAS library type. The old and default type is V9. But SPDE can process data faster than V9. That is done by splitting SAS tables into several files and using of several CPU to process the table. SPDE does not require additional license and comes with Base SAS starting from SAS 9.1. More about SPDE here.

Defragmentation

One can imagine hard disk as a huge line. When new file is saved the beginning of the file is written in any free segment of the line. When the file is too big to fit the segment the file is being split into several fragments. The more fragments the more time it takes to read the file. When the DWH is updated a lot of new files are being written to the disk.

Defragmentation is a process that rewrites the files to reduce the number of fragmented ones. It is probably the most simple and the most efficient DWH optimization technique.

Optimizing hard disk usage

DWH server hard disks as a rule have different characteristics. The most important for the DWH are the size and read/write speed. The fastest disk should be used for the WORK library as it is the most extensively used by users.

The slowest disks should contain OS files, DWH software (SAS) and the tables being copied from sources as network speed is usually slower than the disk write speed.

One can also store tables within one library but on different disks. It is useful when library size grows bigger than any disk available:

libname example ('c:\DWH\data\example' 'd:\DWH\data\example');
libname sample (sasuser sashelp);

Metrics

The article started with the optimization methods. But one should start optimization project with defining the DWH metrics. Good examples are process load and the volume of data red/written to the DWH. Metrics are used to assess whether the optimization methods work or not. A lot of metrics are available in the Windows Performance Monitor: Start → Control Panel → Administrative Tools → Performance.

Leave a Reply

Please type the word if you are not a spam-robot