« Pass-Through Facility
Enterprise Miner vs SAS/STAT »


Text mining log files

Posted by Oleg Solovyev on Dec 13, 2009

The problem described in this article was solved during the DWH optimization project. At the beginning of the project the DWH was too slow and users often complained that it takes too much time to execute a query. We have started DWH optimization with the development of a set of indexes. There were very few indexes at that time.

One should create indexes for those tables that are read most often. The problem is that many SAS DBA’s don’t know how often the tables are queried. We also had neither data schema nor lists of users and tables that users have access to.

Therefore we had to collect that information ourselves. And we decided to use DWH log files for that. Log files contain a lot if information including code submitted to the DWH, users’ names, submission time, etc. The program described bellow imports log files into SAS and creates the following reports.

Number of queries by user and table

Number of queries by user and table This report contains information about how many times each table was queried by each user. First column contains tables’ names. Second column contains total number of quires that users submitted into the DWH. Usually most frequently queried tables are indexed first.

CPU usage

CPU usage

 The report shows server CPU, user CPU and real time taken to execute users queries. The report helps to find users that consume DWH resources more than others. Some of them may execute queries that are not optimal and need some refactoring to reduce execution time. Some users may need additional, separate server. Therefore they won’t influence others and won’t slow down their queries.

Most popular tables

Most popular tables

The report contains list of tables that are not only queried most often but also consume a lot of CPU time. If the number of times the table is queried is high but the total CPU time is low one should think twice before indexing them. We will discuss three main issues in this article:

  • configuring log files output,
  • log files structure,
  • importing log files into SAS

Management Console Settings

Spawner “create log files” option is turned off by default. You can turn it on using SAS Management Console:

  1. Run SAS Management Console: Start → All programs → SAS → SAS Management Console.
  2. Enter you user name and password in the Enter window and click Ok.
  3. On the left side window click right mouse button on the list item SASMain – Workspace Server: Environment Management → Server Manager → SAS Main → SAS Main – Logical Workspace Server → SAS Main – Workspace Server.
  4. Click on the Properties menu item.
  5. Open Options tab.
  6. In the command field type the command for Spawner to run Base SAS: sas -noxwait -xcmd -config “C:\SAS\EMiner\Lev1\SASMain\sasv9.cfg” -log “C:\SAS logs\log_%Y_%m_%d_%H_%M_%s_%v.log” -logparm “rollover=session write=immediate”.
  7. Type “APPLEVEL=2” in the Object Server Parameters. It sets the log files detail level.
  8. Click Ok.

SAS Management ConsoleRun Enterprise Guide/Miner to test new settings. You should find SAS log files in the “C:\SAS logs” catalog. The catalog that you specified in the settings.

Spawner settings

If you don’t have SAS Management Console installed you can configure Spawner manually. This document describes Spawner installation and configuration. In brief you will need to put command described in the above section into Spawner’s config file.

Log file structure

Let’s take a look at the information saved into the log file after the execution of the following code through Enterprise Guide:

data work.sales;
  set sales.year2008(where = (date = '01mar2008'd));
run;

Log file contains the following:

  • Date and time SAS session have started.
20081012:11.16.48,06: ... Inherited client connection (1) for user OLEG-LAPTOP\Oleg.
  • User name
20081012:11.16.49,39: ... %LET _EGUSERID = %NRBQUOTE(Oleg);
20081012:11.16.49,39: ... %LET _EGUSERNAME = %NRBQUOTE(Oleg Solovyev);
  • Submitted code
20081012:11.17.34,92: ... data work.sales;
20081012:11.17.35,09: ... set sales.year2008(where = (date = "01mar2008"d));
20081012:11.17.35,11: ... run;
  • Information about index usage
20081012:11.17.35,11: ... INFO: Index date selected for WHERE clause optimization.
  • Execution time
20081012:11.17.35,80: ... NOTE: DATA statement used (Total process time):
20081012:11.17.35,80: ... real time 0.87 seconds
20081012:11.17.35,80: ... cpu time 0.23 seconds
  • Date and time SAS session was closed
20081012:11.27.50,42: 00000005: Client connection (1) closed.

Reading log files

Spawner creates separate log file for every SAS session. Before parsing the file one should create a list of all log files. You can do it the following way.

%let dir_name = "C:\SAS logs";
data work.file_names(drop = rc did memcount i);
  length dir_name $200 file_name $200;
  dir_name = &dir_name;
  if fileref("dref")<=0 then
    rc = filename("dref");
  if filename("dref", dir_name) then do;
    put "Error assigning fileref!";
    abort;
  end;
  did = dopen("dref");
  if ^did then do;
    put "Cannot access directory" dir_name;
    rc = filename("dref");
    abort;
  end;
  memcount = dnum(did);
  do i=1 to memcount;
    file_name = dread(did, i);
    output;
    end;
  rc = filename("dref");
run;

This data step was written by my colleague Mikhail Konovalov. Next you can read each log file one by one the following way.

data work.log_data;
  set work.file_names;
  length fname $256;
  fname = catx('\',dir_name,file_name);
  infile dummy filevar=fname end=eof lrecl=512;
  do until(eof);
    input;
    put _infile_;
  end;
run;

This code was proposed by Alan Churchill. Mine was too complicated.

The data step outputs a string from a log file into the SAS Log window using put _infile_ operator. We need to put the data from that file into the SAS table. There are a lot of ways to do that with SAS string functions. Below is described a way to do the same thing using regular expressions.

Regular expressions

Regular expressions is a language to find patterns in text. One sets a pattern he is looking for and expression returns the number of the first letter of the pattern in the string.

Let’s take a look how to find user name in the log file that is described in the section “Log files structure”. You can see that user name is preceded by the string “_EGUSERNAME = %NRBQUOTE(” and contains only letters and one space. User name is also followed by the symbols “);”.Below is an example of regular expression that finds user name in the log file.

data _null_;
  input;

  regexp = '/_EGUSERNAME = %NRBQUOTE\([A-z, ]{1,20}/';
  RegularExpressionId = prxparse(regexp);

  call PRXSUBSTR(RegularExpressionId, _INFILE_, pos, len);

  put _infile_;
  put pos= len=;

  user_name = substr(_infile_,pos+24,len-24);
  put user_name=;

datalines4;
20081012:11.16.49,39: 00000002:         :: IOM LOG: 38         %LET _EGUSERNAME = %NRBQUOTE(Oleg Solovyev);
;;;;
run;

After execution of the code one should find user name “Oleg Solovyev” in the SAS Log window. Below are some other regular expressions to parse Spawner log files.

Information example from Log file regular expression
User name _EGUSERNAME = %NRBQUOTE(Oleg Solovyev); ‘/_EGUSERNAME = %NRBQUOTE\([A-z, ]{1,20}/’
table name sales.year2008 ‘/[A-z_][A-z_\d]{0,7}[.][A-z_][A-z_\d]{0,31}[ (,;.]/’
execution time real time 0.87 seconds ‘/real time[ ]{0,11}[\d:.]{0,11}/’

Conclusion

There was a guy that worked at the same company when I was working on the project. It took several hours to execute one of his reports. He submitted his code in the morning and got results only in the evening.

During the project we have indexed the DWH, changed the structure of historic tables and scheduled his report in the morning when ETL-processes are finished and users haven’t yet started working with the DWH. The next morning that guy called my boss to thank him because his report was ready in several seconds.

We used log files to figure out what are the most queried tables. However log files can be used for a lot of other purposes. For instance one can monitor how often are the current indexes are used or how optimal is the submitted code. I would like to continue working on the program described in this article but I need spawner log files for testing.

If you can get those log files send me an email to oleg@sasbi.net. You will get my code and reports in response. I promise not to distribute your files and guarantee its safety. Al the important information on my computer is encrypted. It is unreadable even if one steals the hard drive.

It often happens that the same idea occurs to different people at the same time. Working on the program I found three other guys thinking about the same problem:

Leave a Reply

Please type the word if you are not a spam-robot