Thursday, 13 April 2017

Introduction to H2O with R

H2O is scalable, open-source Machine Learning framework with interfaces is Python, R, Java, Scala and C++. It lays on the top of other major ML Frameworks (MXNet, Caffe, TensorFlow, etc.) and adds a layer of abstraction unifying and simplifying API for client/consumer applications. H2O can run in standalone mode, on Hadoop, or within a Spark cluster.

Prerequisities

RStudio
Installed package: R interface for H2O

Installation in RStudio:

install.packages("h2o")

Launching

To load h2o package and its namespace:

library(h2o)

To start and connect to H2O instance running on localhost and listening on port 54321:

h2o.init()

Connection successful!

R is connected to the H2O cluster:
H2O cluster uptime: 3 days 6 hours
H2O cluster version: 3.10.3.6
H2O cluster version age: 1 month and 19 days
H2O cluster name: H2O_started_from_R_bojan_lsr768
H2O cluster total nodes: 1
H2O cluster total memory: 3.46 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 2
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
R Version: R version 3.2.3 (2015-12-10)

This command will start H2O on maximum 2 CPUs. If we want to use all CPUs on the system, we have to specify nthreads argument to have value -1:

h2o.init(nthreads = -1)

Importing Data

To import data from a file into H2O cloud we can use h2o.importFile or h2o.uploadFile functions.

If file resides on the sever, we have to use h2o.importFile and specify file's absolute path (on the server):

frame <- h2o.importFile(file_absolute_path)

This file can be e.g. CSV (Comma Separated Value) file.
The output type is an instance of H2OFrame class which represents a table (2D array).
If CSV file does not have specified column names, H2O will automatically assign names C1, C2... to such columns.

If we want to push file from a client onto the server, we have to use h2o.uploadFile and specify file's absolute path (on the client):

h2o.uploadFile(file_absolute_path)

Data Exploration

To get a string vector containing column names from the H2OFrame object:

h2o.colnames(frame)

[1] "Creditability" "Account Balance" "Duration of Credit (month)"
[4] "Payment Status of Previous Credit" "Purpose" "Credit Amount"
[7] "Value Savings/Stocks" "Length of current employment" "Instalment per cent"
[10] "Sex & Marital Status" "Guarantors" "Duration in Current address"
[13] "Most valuable available asset" "Age (years)" "Concurrent Credits"
[16] "Type of apartment" "No of Credits at this Bank" "Occupation"
[19] "No of dependents" "Telephone" "Foreign Worker"

To print first 6 rows from the H2OFrame object we can use h2o.head:

h2o.head(frame)

To get detailed report on each column (type, number of missing values etc...), use h2o.describe:

h2o.describe(frame)

Label Type Missing Zeros PosInf NegInf Min Max Mean Sigma Cardinality
1 Creditability enum 4 300 0 0 0 1 0.698795180722892 0.459011997978603 2
2 Account Balance enum 0 274 0 0 0 3 4
3 Duration of Credit (month) int 0 0 0 0 4 72 20.903 12.0588144527564
4 Payment Status of Previous Credit enum 0 40 0 0 0 4 5
5 Purpose enum 0 234 0 0 0 9 10
6 Credit Amount int 0 0 0 0 250 18424 3271.248 2822.75175989565
7 Value Savings/Stocks enum 0 603 0 0 0 4 5
...

h2o.summary prints information for each column. It treats differently factor and columns of non-enum type. For factor columns it prints the statistics how many times each enum value occurs and how many values are missing ("NA"). For other columns it shows minimum, maximum, median, mean 1st and 3rd quantile:

h2o.summary(frame)

Creditability Account Balance Duration of Credit (month) Payment Status of Previous Credit Purpose Credit Amount
1 :696 4:394 Min. : 4.0 2:530 3:280 Min. : 250
0 :300 1:274 1st Qu.:12.0 4:293 0:234 1st Qu.: 1359
NA: 4 2:269 Median :18.0 3: 88 2:181 Median : 2304
3: 63 Mean :20.9 1: 49 1:103 Mean : 3271
3rd Qu.:24.0 0: 40 9: 97 3rd Qu.: 3958
Max. :72.0 6: 50 Max. :18424
...

h2o.str displays the structure of an H2OFrame object:

h2o.str(frame)

Class 'H2OFrame'
- attr(*, "op")= chr ":="
- attr(*, "eval")= logi TRUE
- attr(*, "id")= chr "RTMP_sid_a051_45"
- attr(*, "nrow")= int 1000
- attr(*, "ncol")= int 21
- attr(*, "types")=List of 21
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "int"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "int"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "int"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "enum"
..$ : chr "int"
- attr(*, "data")='data.frame': 10 obs. of 21 variables:
..$ Creditability : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2
..$ Account Balance : Factor w/ 4 levels "1","2","3","4": 1 1 2 1 1 1 1 1 4 2
..$ Duration of Credit (month) : num 18 9 12 12 12 10 8 6 18 24
..$ Payment Status of Previous Credit: Factor w/ 5 levels "0","1","2","3",..: 5 5 3 5 5 5 5 5 5 3
..$ Purpose : Factor w/ 10 levels "0","1","2","3",..: 3 1 9 1 1 1 1 1 4 4
..$ Credit Amount : num 1049 2799 841 2122 2171 ...
..$ Value Savings/Stocks : Factor w/ 5 levels "1","2","3","4",..: 1 1 2 1 1 1 1 1 1 3
..$ Length of current employment : Factor w/ 5 levels "1","2","3","4",..: 2 3 4 3 3 2 4 2 1 1
..$ Instalment per cent : Factor w/ 4 levels "1","2","3","4": 4 2 2 3 4 1 1 2 4 1
..$ Sex & Marital Status : Factor w/ 4 levels "1","2","3","4": 2 3 2 3 3 3 3 3 2 2
..$ Guarantors : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1
..$ Duration in Current address : Factor w/ 4 levels "1","2","3","4": 4 2 4 2 4 3 4 4 4 4
..$ Most valuable available asset : Factor w/ 4 levels "1","2","3","4": 2 1 1 1 2 1 1 1 3 4
..$ Age (years) : num 21 36 23 39 38 48 39 40 65 23
..$ Concurrent Credits : Factor w/ 3 levels "1","2","3": 3 3 3 3 1 3 3 3 3 3
..$ Type of apartment : Factor w/ 3 levels "1","2","3": 1 1 1 1 2 1 2 2 2 1
..$ No of Credits at this Bank : Factor w/ 4 levels "1","2","3","4": 1 2 1 2 2 2 2 1 2 1
..$ Occupation : Factor w/ 4 levels "1","2","3","4": 3 3 2 2 2 2 2 2 1 1
..$ No of dependents : Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 1
..$ Telephone : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1
..$ Foreign Worker : num 1 1 1 2 2 2 2 2 1 1

To draw a histogram of values of some column:

h2o.hist(data[, "Height"])

We can also use dollar notation to specify desired column:

h2o.hist(data$Height)

This will divide range of all possible values in columns "Height" into 10 equal sub-ranges and for each of them draw a vertical bar showing occurrence frequency. Instead of specifying column name, we can specify column number:

h2o.hist(data[, 3])

Data Manipulation

Factor column is the one whose possible values belong to some infinite set of predefined values (like enum type in some programming languages). If we want to convert type of some H2OFrame data set column i (which is of type i.e. int) into enum we can use h2o.asfactor:

data[, i] <- h2o.asfactor(data[, i])

To split a single data set into multiple smaller ones, use h2o.splitFrame(frame, ratios, destination_frames, seed). frame is source data set (H2OFrame object). ratios is scalar or vector of percentages of parts; if scalar, it denotes percentage of the first part; if vector, sum of its elements must be equal to 1. seed is a random number.

frame.split = h2o.splitFrame(frame.hex, 0.7)

frame.split = h2o.splitFrame(frame.hex, ratios = c(0.2, 0.5))

Result is list of list of H2OFrame objects:

> typeof(credit_samples)
[1] "list"
> typeof(credit_samples[1])
[1] "list"
> typeof(credit_samples[[1]])
[1] "environment"

To extract H2OFrame object we can use double squared bracket notation:

frame.training_set <- frame.split[[1]] frame.test_set <- frame.split[[2]]

To create a new frame which contains rows grouped by values in some specific column we can use h2o.group_by (similar to SQL's GROUP BY):

h2o.group_by(frame, by="Creditability", nrow("Creditability"))

Creditability nrow_Creditability
1 4
2 0 300
3 1 696

h2o.group_by's arguments are: name of the original frame, column whose values are used for grouping and the aggregate function which is using values from the chosen column to map multiple rows into aggregate values - one per each group.

To calculate natural logarithm of values in specific column in the H2OFrame object we can use h2o.log. The output is a new column-vector with the same number of elements as the source vector:

h2o.log(data[, "Velocity"])

log(Velocity)
1 6.955593
2 7.937017
3 6.734592
4 7.660114
5 7.682943
6 7.714677

[1000 rows x 1 column]

Machine Learning Algorithms

Generalized Linear Model

For Generalized Linear Model use h2o.glm. Arguments are:
y - dependent variable. This is a string, name of the column in the frame.
x - list of predictors (independent/random variables). This is a vector of strings where each string is a name of the (independent variable) column in the table.
training_frame - training data set; H2OFrame object which represents table containing columns mentioned above.
family - response's distribution family which is a type of exponential family. Supported values are: "gaussian", "poisson", "binomial", "multinomial", "gamma", "tweedie".

model <- h2o.glm(y = "VOL", x = c("AGE", "RACE", "PSA", "GLEASON"), training_frame = frame, model_id = "glm_model1", family = "binomial") |==============================================================================================================================| 100%

Return value is a GLM model, an object of type H2OBinomialModel:

summary(model)

Model Details:
==============

H2OBinomialModel: glm
Model Key: glm_model1
GLM Model: summary
family link regularization number_of_predictors_total number_of_active_predictors
1 binomial logit Elastic Net (alpha = 0.5, lambda = 0.02103 ) 71 24
number_of_iterations training_frame
1 5 RTMP_sid_be14_25

H2OBinomialMetrics: glm
** Reported on training data. **

MSE: 0.1604356
RMSE: 0.4005442
LogLoss: 0.4897464
Mean Per-Class Error: 0.275817
AUC: 0.8095359
Gini: 0.6190719
R^2: 0.2323731
Null Deviance: 736.5862
Residual Deviance: 592.5932
AIC: 642.5932

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 104 76 0.422222 =76/180
1 55 370 0.129412 =55/425
Totals 159 446 0.216529 =131/605

Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.601181 0.849598 266
2 max f2 0.412327 0.926251 373
3 max f0point5 0.671558 0.848516 221
4 max accuracy 0.605270 0.783471 264
5 max precision 0.947226 1.000000 0
6 max recall 0.258939 1.000000 394
7 max specificity 0.947226 1.000000 0
8 max absolute_mcc 0.671558 0.470868 221
9 max min_per_class_accuracy 0.683051 0.744444 213
10 max mean_per_class_accuracy 0.671558 0.750196 221

Gains/Lift Table: Extract with `h2o.gainsLift(, )` or `h2o.gainsLift(, valid=, xval=)`

Scoring History:
timestamp duration iteration negative_log_likelihood objective
1 2017-04-13 08:03:21 0.000 sec 0 368.29309 0.60575
2 2017-04-13 08:03:21 0.007 sec 1 308.02649 0.54301
3 2017-04-13 08:03:21 0.010 sec 2 305.03591 0.54149
4 2017-04-13 08:03:21 0.013 sec 3 304.82054 0.54148
5 2017-04-13 08:03:21 0.023 sec 4 296.50608 0.53809
6 2017-04-13 08:03:21 0.027 sec 5 296.29658 0.53809

Variable Importances: (Extract with `h2o.varimp`)
=================================================

Standardized Coefficient Magnitudes: standardized coefficient magnitudes
names coefficients sign
1 Account Balance.4 0.669816 POS
2 Account Balance.1 0.419182 NEG
3 Duration of Credit (month) 0.294272 NEG
4 Purpose.3 0.273682 POS
5 Payment Status of Previous Credit.4 0.269195 POS

---
names coefficients sign
66 Guarantors.2 0.000000 POS
67 Guarantors.3 0.000000 POS
68 Concurrent Credits.2 0.000000 POS
69 No of dependents.1 0.000000 POS
70 No of dependents.2 0.000000 POS
71 credit_amount_trnsf 0.000000 POS

Neural networks

h2o.deeplearning

Random Forest

rfHex <- h2o.randomForest(x=features, y="logSales", ntrees = 500, max_depth = 30, nbins_cats = 1115, training_frame=trainHex, validation_frame=validHex)

Model Analasys

Once model is trained, we can calculate its performance on a new (unseen) dataset by using h2o.performance. This new dataset has to have the same column names, types and dimensions as the data set used for training. Arguments are:
model - one of H2O objects representing trained model (e.g. H2OBinomialModel)
newdata - H2OFrame object representing table with unseen data
train, valid, xval - logical (boolean) values indicating whether function shall return training, validation and the cross-validation metrics (all constructed during training)

Return value is an object of one of H2O metrics types. E.g. if model is of type H2OBinomialModel then metrics is of type H2OBinomialMetyrics.

performance <- h2o.performance(model, newdata = test_frame)

> performance
H2OBinomialMetrics: glm

MSE: 0.1747882
RMSE: 0.4180768
LogLoss: 0.5196604
Mean Per-Class Error: 0.3554121
AUC: 0.7768143
Gini: 0.5536285
R^2: 0.1782966
Null Deviance: 482.3467
Residual Deviance: 406.3744
AIC: 456.3744

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 44 76 0.633333 =76/120
1 21 250 0.077491 =21/271
Totals 65 326 0.248082 =97/391

Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.547567 0.837521 325
2 max f2 0.283012 0.918644 390
3 max f0point5 0.613870 0.811103 278
4 max accuracy 0.558243 0.751918 319
5 max precision 0.968788 1.000000 0
6 max recall 0.283012 1.000000 390
7 max specificity 0.968788 1.000000 0
8 max absolute_mcc 0.613870 0.387921 278
9 max min_per_class_accuracy 0.692120 0.683333 223
10 max mean_per_class_accuracy 0.772244 0.705028 159

Gains/Lift Table: Extract with `h2o.gainsLift(, )` or `h2o.gainsLift(, valid=, xval=)`

To calculate the accuracy of the model (the only supported model at the moment is H2OBinomialModel), we can use h2o.accuracy. Arguments are:
object - H2OModelMetrics object (H2OBinomialMetrics is currently the only one supported)
thresholds - a value or a list of values between 0.0 and 1.0

h2o.accuracy(performance, 0.95)

[[1]]
[1] 0.314578

To use the trained model on a test set in order to make predictions, we can use h2o.predict.

pred_creditability <- h2o.predict(glm_model1,credit_test)

|==============================================================================================================================| 100%
> pred_creditability
predict p0 p1
1 1 0.3593716 0.6406284
2 1 0.2807624 0.7192376
3 1 0.2209632 0.7790368
4 1 0.1332073 0.8667927
5 1 0.3779753 0.6220247
6 1 0.3902468 0.6097532

[392 rows x 3 columns]

References

https://www.rdocumentation.org/packages/h2o/versions/3.10.3.6/topics/h2o.init
https://www.rdocumentation.org/packages/h2o/versions/3.10.3.6/topics/h2o.importFile
https://www.rdocumentation.org/packages/h2o/versions/3.10.3.6/topics/h2o.str
https://www.rdocumentation.org/packages/h2o/versions/3.10.3.6/topics/h2o.group_by
https://www.rdocumentation.org/packages/h2o/versions/3.10.3.6/topics/h2o.log
https://www.rdocumentation.org/packages/h2o/versions/3.10.3.6/topics/h2o.colnames
https://rdrr.io/cran/h2o/man/h2o.splitFrame.html
http://h2o-release.s3.amazonaws.com/h2o/master/3574/docs-website/h2o-docs/data-munging/splitting-datasets.html
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/GLMBooklet.pdf
https://h2o-release.s3.amazonaws.com/h2o/rel-slater/9/docs-website/h2o-py/docs/frame.html