This project uses data from the TalkingData Mobile User Demographics Kaggle competition.

TalkingData is a Chinese based mobile data platform. From the competition site:

In this competition, Kagglers are challenged to build a model predicting usersâ€™ demographic characteristics based on their app usage, geolocation, and mobile device properties. Doing so will help millions of developers and brand advertisers around the world pursue data-driven marketing efforts which are relevant to their users and catered to their preferences.

The Data

The data for the competition can be downloaded here: https://www.kaggle.com/c/talkingdata-mobile-user-demographics/data .

I have downloaded and read each file into R, giving each data.frame the same name as the file. Below are the first 3 records of each file.

head(app_events, 3)

event_id	app_id	is_installed	is_active
1924946	3.489720e+18	1	1
19620	7.723979e+18	1	1
53629	7.723979e+18	1	1

head(app_labels, 3)

app_id	label_id
-2.600988e+18	2
-2.600988e+18	4
4.214070e+18	5

head(events, 3)

event_id	device_id	timestamp	longitude	latitude	region	date	hour	time
1924946	-7.013555e+18	2016-05-06 07:39:53	109.46	21.89	NA	2016-05-06	7	morning
19620	8.427965e+18	2016-05-07 17:38:53	117.35	25.34	NA	2016-05-07	17	midday
53629	-6.506152e+18	2016-05-03 18:55:04	0.00	0.00	NA	2016-05-03	18	evening

head(gender_age_train, 3)

device_id	gender	age	group
8.427965e+18	F	23	F23-
-3.293695e+18	M	40	M39+
7.104535e+18	M	25	M23-26

head(phone_brand_device_model, 3) # Chinese characters displayed in UTF-8 codes

device_id	phone_brand	device_model
8.427965e+18	vivo	X5M
1.186608e+18		2
1.186608e+18

The number of records in each data.frame can be seen in the table below.

Table	Records
app_events	12,732,996
app_labels	459,943
events	3,252,950
gender_age_train	74,645
phone_brand_device_model	187,245

And below is the data schema provided on the Kaggle site.

Tidy Data

For our purpose, it is more convenient to look at the data in one nice “tidy” table. I have joined the tables and created some new columns that help make sense of the data. You can see the data by clicking on “Data Table” in the banner above. Below is a data dictionary.

Fields	Description
event_id	Unique mobile event identifier. The primary key for this table.
device_id	The mobile device identifier. Demographics are associated with this number.
longitude; latitude	Geolocation
region	Feature generated from the data. Most events in china take place in three cities: Beijing, Chengdu, Hong Kong, and Shanghai.
date; hour; time	Time
gender; age	Demographics
phone_brand; device_model	Mobile device brand information.
apps	Feature generated from the data. The number of active apps during the event.
Custom; Education; etc.	These columns are also features generated from the data. These are categories of apps that are active during the event. A value of 1 means that at least 1 app from this category is active. See the file generalizeCategories.R in the GitHub repository for the R code that generated these features.

Note: The Chinese characters for the phone brand and device model are rendered as their UTF-8 code or sometimes not at all.

Data Exploration

The tidy data set can be visualized using the TalkingData Explorer found here: https://natebyers.shinyapps.io/TalkingData_Explorer/ .

Below are plots that vizualize the distribution of gender and age.

Prediction

Random forest model for predicting gender.

library(caret)
for(i in names(full_data)){
  if(class(full_data[[i]]) == "character"){
    full_data[[i]] <- as.factor(full_data[[i]])
  }
}

gender_subset <- full_data %>%
  arrange(gender) %>%
  sample_n(2000) %>%
  select(gender, region, hour, phone_brand, Games, Education, Finance)

ctrl <- trainControl(method="repeatedcv", number=2, repeats=1, 
                     selectionFunction = "oneSE")
in_train <- createDataPartition(gender_subset$gender, p=.80, list=FALSE)

rf <- train(gender ~ ., data = gender_subset, method = "rf",
            metric = "Kappa", trControl = ctrl, subset = in_train)

test <- gender_subset[-in_train,]
test$pred <- predict(rf, test, "raw")
confusionMatrix(test$pred, test$gender)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   F   M
##          F  10  19
##          M  90 280
##                                         
##                Accuracy : 0.7268        
##                  95% CI : (0.6803, 0.77)
##     No Information Rate : 0.7494        
##     P-Value [Acc > NIR] : 0.8634        
##                                         
##                   Kappa : 0.0477        
##  Mcnemar's Test P-Value : 2.017e-11     
##                                         
##             Sensitivity : 0.10000       
##             Specificity : 0.93645       
##          Pos Pred Value : 0.34483       
##          Neg Pred Value : 0.75676       
##              Prevalence : 0.25063       
##          Detection Rate : 0.02506       
##    Detection Prevalence : 0.07268       
##       Balanced Accuracy : 0.51823       
##                                         
##        'Positive' Class : F             
##

prd <- data.frame(region = "Hong Kong", hour = 12, phone_brand = "vivo",
                  Games = 1, Education = 1, Finance = 0)
predict(rf, prd)

## [1] M
## Levels: F M

Linear model for predicting age.

age_subset <- full_data %>%
  arrange(age) %>%
  sample_n(2000)

split<-createDataPartition(age_subset$age, p = 0.6, list = FALSE)

dev<-age_subset[split,] %>%
  select(age, region, hour, phone_brand, Games, Education,
         Finance)

val<-age_subset[-split,] %>%
  select(age, region, hour, phone_brand, Games, Education,
         Finance)

ctrl<-trainControl(method = "cv", number = 2)

lm<-train(age~., data = dev, method = "lm", trControl = ctrl)

lm

## Linear Regression 
## 
## 1202 samples
##    6 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (2 fold) 
## Summary of sample sizes: 602, 600 
## Resampling results:
## 
##   RMSE      Rsquared  
##   9.129228  0.03066149
## 
##

predict(lm, prd)

##        1 
## 27.83763

Application

The models can be used to make a prediction of the users gender and age. Below is an application that gives an estimated age and gender based on the user inputs.