Social Network Analysis
Posted by Oleg Solovyev on Aug 14, 2011
One of the newest fields in data mining is Social Network Analysis (SNA). The task is to find out your friends (first circle), then friends of your friends (second circle) etc. Mathematicians call it “to develop a graph” made of nodes (the people) and edges (ties between people).
For example in Telecom graphs can be built using phone calls data. The people you call are your first circle. They are relatives, colleagues or friends. You value those people and listen to their opinions. If one of your friends uses mobile internet the telecom operator can offer this service to you with a high probability of purchase.
Gender cleaning
Posted by Oleg Solovyev on Jul 5, 2011
Investigating the ABT table I’ve found anomaly in the gender column. There were only 5% of males and 95% of females in the sample. The expected ratio was 50%/50%. After comparing client’s names and gender I was sure that values of the gender are wrong.
I couldn’t simply delete the gender because it is often an important factor in the model. Thus I decided to replace the gender with a new column calculated using clients’ patronymics. The thing is that most Russian male patronymics are derived from father’s name by adding “ich” like “Ivanovich” and “Ilyich”. Most female patronymics end in “na” like “Ivanovna” and “Ilyinichna”.
Most important var
Posted by Oleg Solovyev on Mar 15, 2011
They say that Business and IT speak different languages. I consider the third group: Statisticians. All three have problems interacting with each other that slow down any mutual project.
I once worked on the marketing campaign optimization. The task was to find a group of customers most likely to buy a product. The problem involves half a hundred variables and Business kept pressing me to find the most important one. They though there is a variable that alone may split customers into groups with the highest/lowest response rate.

I told them that models can include dozens of variables. Their importance can depend on the presence of other variables (multicollinearity). Moreover each model has an algorithm to pick the “best” ones. But I wasn’t able to conceive the Business. They kept looking for the most important variable.
