<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>

<channel>
	<title>SASBI.net</title>
	<atom:link href="http://sasbi.net/feed/" rel="self" type="application/rss+xml" />
	<link>http://sasbi.net</link>
	<description>STATS and Business Intelligence</description>
	<pubDate>Sun, 14 Aug 2011 15:54:21 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Social Network Analysis</title>
		<link>http://sasbi.net/social-network-analysis/</link>
		<comments>http://sasbi.net/social-network-analysis/#comments</comments>
		<pubDate>Sun, 14 Aug 2011 15:54:21 +0000</pubDate>
		<dc:creator>Oleg Solovyev</dc:creator>
		
		<category><![CDATA[Data Mining]]></category>

		<guid isPermaLink="false">http://sasbi.net/?p=296</guid>
		<description><![CDATA[Social Network Analysis uses graphs to understand relationships between people.]]></description>
			<content:encoded><![CDATA[<p>One of the newest fields in data mining is Social Network Analysis (SNA). The task is to find out your friends (first circle), then friends of your friends (second circle) etc. Mathematicians call it “to develop a graph” made of nodes (the people) and edges (ties between people).</p>
<p><iframe title="YouTube video player" width="489" height="390" src="http://www.youtube.com/embed/oLto_eY03rg" frameborder="0" allowfullscreen></iframe></p>
<p>For example in Telecom graphs can be built using phone calls data. The people you call are your first circle. They are relatives, colleagues or friends. You value those people and listen to their opinions. If one of your friends uses mobile internet the telecom operator can offer this service to you with a high probability of purchase.</p>
<p><span id="more-296"></span></p>
<p>Social networks like Facebook can find out your first circle using your “friends list” or monitoring the personal pages you visit. The advertising you saw on Facebook could be shown to you because one of your friends clicked on it earlier.</p>
<p>Social Networks are also important in debt collection. The colleagues and friends can influence the debtor and make him pay the debt. This is why banks and collection agencies do actively collect contact information of your friends, neighbors and colleagues. It sometimes happens that the debt is payed by the friends or relatives, not the debtor.</p>
<p>For software companies their experts are the most valued asset. Every person in the company should have access to the expert’s knowledge. The social network graph can show whether the expert is actively helping other colleagues or is he isolated from others. This graph can use data on internal mail and phone conversations.</p>
<p>For example the graph above is based on internet forum <a href="http://www.sql.ru/forum/actualtopics.aspx?bid=26" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.sql.ru');">SQL.ru&nbsp;&rarr; OLAP and DWH</a>. The nodes are the forum members and the edges show whether the member took part in other member’s thread. Every edge has a weight that equals the number of member A posts in member’s B thread plus number of member B posts in member’s A thread.</p>
<p>At first I made a list of all the 3 000+ forum members and added edges. The graph looked like a black spot on a monitor. I removed all the edges with the weights less than 10 and deleted all the members left without edges. That is the last graph in the video. Then I continued to delete edges with a minimal weights till there was only one edge left. That is the first graph in the video. Then I put the graphs in the video in the reverse order, starting with the smallest graph to the biggest one.</p>
<p>The video bellow is based on the forum <a href="http://www.sql.ru/forum/actualtopics.aspx?bid=16" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.sql.ru');">SQL.ru&nbsp;&rarr; Просто треп</a> (just&nbsp;chat).</p>
<p><iframe title="YouTube video player" width="489" height="390" src="http://www.youtube.com/embed/hSKW0EoImks" frameborder="0" allowfullscreen></iframe></p>
]]></content:encoded>
			<wfw:commentRss>http://sasbi.net/social-network-analysis/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Banks and Credit bureaus</title>
		<link>http://sasbi.net/banks-and-credit-bureaus/</link>
		<comments>http://sasbi.net/banks-and-credit-bureaus/#comments</comments>
		<pubDate>Sun, 07 Aug 2011 08:16:49 +0000</pubDate>
		<dc:creator>Oleg Solovyev</dc:creator>
		
		<category><![CDATA[Text Mining]]></category>

		<guid isPermaLink="false">http://sasbi.net/?p=289</guid>
		<description><![CDATA[According to Russian legislation every bank has to report to one of the credit bureaus (CB). The reported credit histories (CH) contain info on credit amount, monthly payments and other information. Any bank can request your credit information to assess consumer credit worthiness and decide whether to issue you another loan or not.

Consumers with good [...]]]></description>
			<content:encoded><![CDATA[<p>According to <a href="http://base.consultant.ru/cons/cgi/online.cgi?req=doc;base=LAW;n=70212" onclick="javascript:pageTracker._trackPageview('/outbound/article/base.consultant.ru');">Russian legislation</a> every bank has to report to one of the credit bureaus (CB). The reported credit histories (CH) contain info on credit amount, monthly payments and other information. Any bank can request your credit information to assess consumer credit worthiness and decide whether to issue you another loan or not.</p>
<p><img src="http://sasbi.net/wp-content/uploads/2011/08/bank_cb.png" alt="banks and credit bureaus" title="banks and credit bureaus" width="480" height="633"/></p>
<p>Consumers with good credit histories can get a new loan with a lower interest rate. But one has to know which CB stores its credit history and what banks can request that history from CB. If your credit history is poor for example you had delinquent loans you better look for a bank that don’t request your credit history.</p>
<p><span id="more-289"></span></p>
<p>Russian Central Bank (RCB) <a href="http://ckki.www.cbr.ru/?m_ParsSelectorState=1" onclick="javascript:pageTracker._trackPageview('/outbound/article/ckki.www.cbr.ru');">web site</a> allows finding the CB’s where ones histories are stored. According to RCB web site there are 800+ credit organizations and 30+ credit bureaus in Russia. Most of the credit histories are stored in the five biggest CB’s: Equifax, Expirian-Interfax, NBKI, Infocredit and MBKI.</p>
<p>Banks don’t like to publish the list of CB’s they work with. But some information is available online. My task was to develop the schema of banks and credit bureaus cooperation. For instance if one sentence contains both bank and credit bureau names it is very probable they do exchange information with each other. But I had to exclude all the sentences containing more than one bank or CB names because that sentence can be a list of some forum participants.</p>
<p>The schema above was developed using 5 000+ html pages that contain at least one of the five biggest CB’s. Unfortunately schema doesn’t show the direction of the information exchange. The bank can request information from one CB and report credit histories to the other. The calculation of data exchange direction is a next task as well as the increase of the number of banks and CB’s.</p>
]]></content:encoded>
			<wfw:commentRss>http://sasbi.net/banks-and-credit-bureaus/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Gender cleaning</title>
		<link>http://sasbi.net/gender-cleaning/</link>
		<comments>http://sasbi.net/gender-cleaning/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 06:59:11 +0000</pubDate>
		<dc:creator>Oleg Solovyev</dc:creator>
		
		<category><![CDATA[Data Mining]]></category>

		<guid isPermaLink="false">http://sasbi.net/?p=284</guid>
		<description><![CDATA[Investigating the ABT table I’ve found anomaly in the gender column. There were only 5% of males and 95% of females in the sample. The expected ratio was 50%/50%. After comparing client’s names and gender I was sure that values of the gender are wrong.
I couldn’t simply delete the gender because it is often an [...]]]></description>
			<content:encoded><![CDATA[<p>Investigating the ABT table I’ve found anomaly in the gender column. There were only 5% of males and 95% of females in the sample. The expected ratio was 50%/50%. After comparing client’s names and gender I was sure that values of the gender are wrong.</p>
<p>I couldn’t simply delete the gender because it is often an important factor in the model. Thus I decided to replace the gender with a new column calculated using clients’ patronymics. The thing is that most Russian male patronymics are derived from father’s name by adding “ich” like “Ivanovich” and “Ilyich”. Most female patronymics end in “na” like “Ivanovna” and “Ilyinichna”.</p>
<p><span id="more-284"></span></p>
<p>Most of the rest patronymics belong to Turkic peoples. Their surnames end in “-ogli” meaning son or “-kizi” meaning daughter. Thus one can recalculate the gender using SAS the following way:</p>
<pre class="brush: text">
data test;
  input patronymic $ 1-20;

  length gender $10;

  if prxmatch(&#039;/ich$|ogli$/&#039;, lowcase(trim(patronymic))) &gt; 0 then gender = &#039;male&#039;;
    else if prxmatch(&#039;/na$|kizi$/&#039;, lowcase(trim(patronymic))) &gt; 0 then gender = &#039;female&#039;;
	  else gender = &#039;unknown&#039;;

datalines;
Ivanovich
Ilyich
Ivanovna
Ilyinichna
Hamzat-ogli
Suleyman-kizi
;
run;
</pre>
<p>The Oracle code:</p>
<pre class="brush: text">
drop table names;
create table names(patronymic varchar2(20));

insert into names values(&#039;Ivanovich&#039;);
insert into names values(&#039;Ilyich&#039;);
insert into names values(‘Ivanovna’);
insert into names values(&#039;Ilyinichna’);
insert into names values(&#039;Hamzat-ogli &#039;);
insert into names values(‘Suleyman-kizi’);
select patronymic,
       case when REGEXP_LIKE(lower(patronymic), &#039;ich$|ogli$&#039;) then &#039;male&#039;
            when REGEXP_LIKE(lower(patronymic), &#039;na$|kizi$&#039;) then &#039;female&#039;
            else                                                  &#039;unknown&#039;
       end as gender
from names;
</pre>
<p>This algorithm recalculated the gender for the 99% of the clients. Patronymics column contained missing for the rest 1% of the clients. But the algorithm can be improved to take into account names and family names.</p>
]]></content:encoded>
			<wfw:commentRss>http://sasbi.net/gender-cleaning/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Locale names</title>
		<link>http://sasbi.net/locale-names/</link>
		<comments>http://sasbi.net/locale-names/#comments</comments>
		<pubDate>Wed, 27 Apr 2011 02:54:16 +0000</pubDate>
		<dc:creator>Oleg Solovyev</dc:creator>
		
		<category><![CDATA[SAS]]></category>

		<guid isPermaLink="false">http://sasbi.net/?p=271</guid>
		<description><![CDATA[Once at the interview they proposed me to show my programming skills on the production database. The interviewer rose from his table and asked me to write any code I want. I set down at his computer, opened the SAS window and wrote simple query that returns a list of all columns in the database:

proc [...]]]></description>
			<content:encoded><![CDATA[<p>Once at the interview they proposed me to show my programming skills on the production database. The interviewer rose from his table and asked me to write any code I want. I set down at his computer, opened the SAS window and wrote simple query that returns a list of all columns in the database:</p>
<pre class="brush: text">
proc sql;
  create table test as
  select *
  from dictionary.columns;
quit;
</pre>
<p>&nbsp;&nbsp;“Well you rashly gave me access to your product DB and now I can get any information out of it.”<br />
&nbsp;&nbsp;“Hey, be careful!”<br />
&nbsp;&nbsp;“For instance I can get all the column names in your database.”<br />
&nbsp;&nbsp;“Hm… Nice!”</p>
<p><span id="more-271"></span></p>
<p>I was also impressed but not with the query.</p>
<p>&nbsp;&nbsp;“Looks like some of your columns have Russian names.”<br />
&nbsp;&nbsp;“So what?”<br />
&nbsp;&nbsp;“I thought that SAS columns should contain English characters only.”<br />
&nbsp;&nbsp;“Well, seems this interview is not a waste of time for you. You’ve learned something new.”
</p>
<p>This feature really impressed me. At SAS courses I learned that column names can contain only English letters, numbers or underscores. But Russian names didn’t fit this rule.</p>
<p>Later I forgot about this incident till once I had to import a lot of Excel files with Russian names. The IMPORT procedure converts Russian names into F1, F2, F3, … according to the order of names in an Excel file. But VALIDVARNAME option allows columns with Russian names for example:</p>
<pre class="brush: text">
options validvarname=any;
data work.test;
  &#039;да - это русское имя&#039;n = 1;
run;
</pre>
<p>One can also import Excel files with Russian names and use them in a code:</p>
<pre class="brush: text">
proc sql;
  select sum(&#039;да - это русское имя&#039;n)
  from test;
quit;
</pre>
<p>Looks like SAS has its own double standards.</p>
]]></content:encoded>
			<wfw:commentRss>http://sasbi.net/locale-names/feed/</wfw:commentRss>
		</item>
		<item>
		<title>DWH optimization</title>
		<link>http://sasbi.net/dwh-optimization/</link>
		<comments>http://sasbi.net/dwh-optimization/#comments</comments>
		<pubDate>Sat, 23 Apr 2011 09:16:33 +0000</pubDate>
		<dc:creator>Oleg Solovyev</dc:creator>
		
		<category><![CDATA[SAS]]></category>

		<guid isPermaLink="false">http://sasbi.net/?p=261</guid>
		<description><![CDATA[Indexes
The first thing one should start with is indexes. They decrease table read time if one of the columns in the where statement is indexed. They also decrease tables join/merge time if one of the ID columns is indexed. The list of DWH indexes is available in the system table DICTIONARY.INDEXES:

proc sql;
  create table [...]]]></description>
			<content:encoded><![CDATA[<h4>Indexes</h4>
<p>The first thing one should start with is indexes. They decrease table read time if one of the columns in the where statement is indexed. They also decrease tables join/merge time if one of the ID columns is indexed. The list of DWH indexes is available in the system table DICTIONARY.INDEXES:</p>
<pre class="brush: text">
proc sql;
  create table work.indexes_list as
  select *
  from dictionary.indexes;
quit;
</pre>
<p>Indexes can be simple and combined. Simple index is created on one column. Combined index involves several columns. The main difference is that combined index is more efficient then simple index when query involves filter or join based on several columns. SAS compiler decides what index to use depending on the query code and indexes available.</p>
<p><span id="more-261"></span></p>
<p>Indexes have advantages and draw backs. Indexes do reduce query time but it takes time and additional disk space to calculate and store them. Ideally indexes should be created on every column that can be used in a query where or join statements. But calculation of all possible indexes can take too much time. One has to rely on his own experience and gut feel to decide what columns to index.</p>
<h4>Index renewal</h4>
<p>DWH is updated with new data every day and old indexes become useless. That is why indexes are modified each time the table updates. There are two ways of updating indexes:</p>
<ul>
<li>remove the index and calculate it again</li>
<li>update the existing index without removal</li>
</ul>
<p>The first way takes the most time but results in better index quality. The second approach updates index with the new information that takes less time but results in worse index quality. SAS &#8220;optimizer&#8221; can delete and recalculate the index automatically if it decides that its quality is too low.</p>
<h4>Compress option</h4>
<p>SAS file format is very rarefied. One can compress the SAS table 10 times using RAR or ZIP archiver. SAS also has built in algorithm that is activated with compress option:</p>
<pre class="brush: text">
data work.test(compress = yes);
  ...
run;
</pre>
<p>or in the libname statement:</p>
<pre class="brush: text">
libname example &#039;c:\dwh\data\example&#039; compress = yes;
</pre>
<h4>SPDE</h4>
<p>SPDE (Scalable Performance Data Engine) is a new SAS library type. The old and default type is V9. But SPDE can process data faster than V9. That is done by splitting SAS tables into several files and using of several CPU to process the table. SPDE does not require additional license and comes with Base SAS starting from SAS 9.1. More about <a href="http://sasbi.net/spde-library/" >SPDE here</a>.</p>
<h4>Defragmentation</h4>
<p>One can imagine hard disk as a huge line. When new file is saved the beginning of the file is written in any free segment of the line. When the file is too big to fit the segment the file is being split into several fragments. The more fragments the more time it takes to read the file. When the DWH is updated a lot of new files are being written to the disk.</p>
<p>Defragmentation is a process that rewrites the files to reduce the number of fragmented ones. It is probably the most simple and the most efficient DWH optimization technique.</p>
<h4>Optimizing hard disk usage</h4>
<p>DWH server hard disks as a rule have different characteristics. The most important for the DWH are the size and read/write speed. The fastest disk should be used for the WORK library as it is the most extensively used by users.</p>
<p>The slowest disks should contain OS files, DWH software (SAS) and the tables being copied from sources as network speed is usually slower than the disk write speed.</p>
<p>One can also store tables within one library but on different disks. It is useful when library size grows bigger than any disk available:</p>
<pre class="brush: text">
libname example (&#039;c:\DWH\data\example&#039; &#039;d:\DWH\data\example&#039;);
libname sample (sasuser sashelp);
</pre>
<h4>Metrics</h4>
<p>The article started with the optimization methods. But one should start optimization project with defining the DWH metrics. Good examples are process load and the volume of data red/written to the DWH. Metrics are used to assess whether the optimization methods work or not. A lot of metrics are available in the Windows Performance Monitor: Start&nbsp;&rarr; Control Panel&nbsp;&rarr; Administrative Tools&nbsp;&rarr; Performance.</p>
]]></content:encoded>
			<wfw:commentRss>http://sasbi.net/dwh-optimization/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>

