This project uses data from the TalkingData Mobile User Demographics Kaggle competition.
TalkingData is a Chinese based mobile data platform. From the competition site:
In this competition, Kagglers are challenged to build a model predicting users’ demographic characteristics based on their app usage, geolocation, and mobile device properties. Doing so will help millions of developers and brand advertisers around the world pursue data-driven marketing efforts which are relevant to their users and catered to their preferences.
The data for the competition can be downloaded here: https://www.kaggle.com/c/talkingdata-mobile-user-demographics/data .
I have downloaded and read each file into R, giving each data.frame the same name as the file. Below are the first 3 records of each file.
head(app_events, 3)
| event_id | app_id | is_installed | is_active |
|---|---|---|---|
| 1924946 | 3.489720e+18 | 1 | 1 |
| 19620 | 7.723979e+18 | 1 | 1 |
| 53629 | 7.723979e+18 | 1 | 1 |
head(app_labels, 3)
| app_id | label_id |
|---|---|
| -2.600988e+18 | 2 |
| -2.600988e+18 | 4 |
| 4.214070e+18 | 5 |
head(events, 3)
| event_id | device_id | timestamp | longitude | latitude | region | date | hour | time |
|---|---|---|---|---|---|---|---|---|
| 1924946 | -7.013555e+18 | 2016-05-06 07:39:53 | 109.46 | 21.89 | NA | 2016-05-06 | 7 | morning |
| 19620 | 8.427965e+18 | 2016-05-07 17:38:53 | 117.35 | 25.34 | NA | 2016-05-07 | 17 | midday |
| 53629 | -6.506152e+18 | 2016-05-03 18:55:04 | 0.00 | 0.00 | NA | 2016-05-03 | 18 | evening |
head(gender_age_train, 3)
| device_id | gender | age | group |
|---|---|---|---|
| 8.427965e+18 | F | 23 | F23- |
| -3.293695e+18 | M | 40 | M39+ |
| 7.104535e+18 | M | 25 | M23-26 |
head(phone_brand_device_model, 3) # Chinese characters displayed in UTF-8 codes
| device_id | phone_brand | device_model |
|---|---|---|
| 8.427965e+18 | vivo | X5M |
| 1.186608e+18 | 2 | |
| 1.186608e+18 |
The number of records in each data.frame can be seen in the table below.
| Table | Records |
|---|---|
| app_events | 12,732,996 |
| app_labels | 459,943 |
| events | 3,252,950 |
| gender_age_train | 74,645 |
| phone_brand_device_model | 187,245 |
And below is the data schema provided on the Kaggle site.
For our purpose, it is more convenient to look at the data in one nice “tidy” table. I have joined the tables and created some new columns that help make sense of the data. You can see the data by clicking on “Data Table” in the banner above. Below is a data dictionary.
| Fields | Description |
|---|---|
| event_id | Unique mobile event identifier. The primary key for this table. |
| device_id | The mobile device identifier. Demographics are associated with this number. |
| longitude; latitude | Geolocation |
| region | Feature generated from the data. Most events in china take place in three cities: Beijing, Chengdu, Hong Kong, and Shanghai. |
| date; hour; time | Time |
| gender; age | Demographics |
| phone_brand; device_model | Mobile device brand information. |
| apps | Feature generated from the data. The number of active apps during the event. |
| Custom; Education; etc. | These columns are also features generated from the data. These are categories of apps that are active during the event. A value of 1 means that at least 1 app from this category is active. See the file generalizeCategories.R in the GitHub repository for the R code that generated these features. |
Note: The Chinese characters for the phone brand and device model are rendered as their UTF-8 code or sometimes not at all.
The tidy data set can be visualized using the TalkingData Explorer found here: https://natebyers.shinyapps.io/TalkingData_Explorer/ .
Below are plots that vizualize the distribution of gender and age.



Random forest model for predicting gender.
library(caret)
for(i in names(full_data)){
if(class(full_data[[i]]) == "character"){
full_data[[i]] <- as.factor(full_data[[i]])
}
}
gender_subset <- full_data %>%
arrange(gender) %>%
sample_n(2000) %>%
select(gender, region, hour, phone_brand, Games, Education, Finance)
ctrl <- trainControl(method="repeatedcv", number=2, repeats=1,
selectionFunction = "oneSE")
in_train <- createDataPartition(gender_subset$gender, p=.80, list=FALSE)
rf <- train(gender ~ ., data = gender_subset, method = "rf",
metric = "Kappa", trControl = ctrl, subset = in_train)
test <- gender_subset[-in_train,]
test$pred <- predict(rf, test, "raw")
confusionMatrix(test$pred, test$gender)
## Confusion Matrix and Statistics
##
## Reference
## Prediction F M
## F 10 19
## M 90 280
##
## Accuracy : 0.7268
## 95% CI : (0.6803, 0.77)
## No Information Rate : 0.7494
## P-Value [Acc > NIR] : 0.8634
##
## Kappa : 0.0477
## Mcnemar's Test P-Value : 2.017e-11
##
## Sensitivity : 0.10000
## Specificity : 0.93645
## Pos Pred Value : 0.34483
## Neg Pred Value : 0.75676
## Prevalence : 0.25063
## Detection Rate : 0.02506
## Detection Prevalence : 0.07268
## Balanced Accuracy : 0.51823
##
## 'Positive' Class : F
##
prd <- data.frame(region = "Hong Kong", hour = 12, phone_brand = "vivo",
Games = 1, Education = 1, Finance = 0)
predict(rf, prd)
## [1] M
## Levels: F M
Linear model for predicting age.
age_subset <- full_data %>%
arrange(age) %>%
sample_n(2000)
split<-createDataPartition(age_subset$age, p = 0.6, list = FALSE)
dev<-age_subset[split,] %>%
select(age, region, hour, phone_brand, Games, Education,
Finance)
val<-age_subset[-split,] %>%
select(age, region, hour, phone_brand, Games, Education,
Finance)
ctrl<-trainControl(method = "cv", number = 2)
lm<-train(age~., data = dev, method = "lm", trControl = ctrl)
lm
## Linear Regression
##
## 1202 samples
## 6 predictor
##
## No pre-processing
## Resampling: Cross-Validated (2 fold)
## Summary of sample sizes: 602, 600
## Resampling results:
##
## RMSE Rsquared
## 9.129228 0.03066149
##
##
predict(lm, prd)
## 1
## 27.83763
The models can be used to make a prediction of the users gender and age. Below is an application that gives an estimated age and gender based on the user inputs.