Gender cleaning
Posted by Oleg Solovyev on Jul 5, 2011
Investigating the ABT table I’ve found anomaly in the gender column. There were only 5% of males and 95% of females in the sample. The expected ratio was 50%/50%. After comparing client’s names and gender I was sure that values of the gender are wrong.
I couldn’t simply delete the gender because it is often an important factor in the model. Thus I decided to replace the gender with a new column calculated using clients’ patronymics. The thing is that most Russian male patronymics are derived from father’s name by adding “ich” like “Ivanovich” and “Ilyich”. Most female patronymics end in “na” like “Ivanovna” and “Ilyinichna”.
Most of the rest patronymics belong to Turkic peoples. Their surnames end in “-ogli” meaning son or “-kizi” meaning daughter. Thus one can recalculate the gender using SAS the following way:
data test;
input patronymic $ 1-20;
length gender $10;
if prxmatch('/ich$|ogli$/', lowcase(trim(patronymic))) > 0 then gender = 'male';
else if prxmatch('/na$|kizi$/', lowcase(trim(patronymic))) > 0 then gender = 'female';
else gender = 'unknown';
datalines;
Ivanovich
Ilyich
Ivanovna
Ilyinichna
Hamzat-ogli
Suleyman-kizi
;
run;
The Oracle code:
drop table names;
create table names(patronymic varchar2(20));
insert into names values('Ivanovich');
insert into names values('Ilyich');
insert into names values(‘Ivanovna’);
insert into names values('Ilyinichna’);
insert into names values('Hamzat-ogli ');
insert into names values(‘Suleyman-kizi’);
select patronymic,
case when REGEXP_LIKE(lower(patronymic), 'ich$|ogli$') then 'male'
when REGEXP_LIKE(lower(patronymic), 'na$|kizi$') then 'female'
else 'unknown'
end as gender
from names;
This algorithm recalculated the gender for the 99% of the clients. Patronymics column contained missing for the rest 1% of the clients. But the algorithm can be improved to take into account names and family names.