NCI Thesaurus & Naive Bayes
Following our previous kernel Preliminary Investigation: NCIt & Binary Encoding, we present another attempt at solving the Personalized Medicine: Redefining Cancer Treatment problem, but this time using naive Bayes.
A quick look at the literature revealed that naive Bayes had been the leading choice in solving various text classification problems for quite some time—see, e.g., the work of Cohen et al. (2005)1 or Hotho et al. (2005)2 and references therein. Although here we are not really dealing with a pure text classification problem, there are some obvious similarities; hence our interest in trying out a naive Bayes classifier…
Apparently, the main reason behind the popularity of the naive Bayes method is “because it is fast and easy to implement” to quote Rennie et al. (2003, abstract)3. A reason that lack compellingness when performance—in terms of predictions—is of critical importance. The authors open their introduction with the sobering statement (page 1):
Naive Bayes has been denigrated as “the punching bag of classifiers” (Lewis, 1998), and has earned the dubious distinction of placing last or near last in numerous head-to-head classification papers (Yang & Liu, 1999; Joachims, 1998; Zhang & Oles, 2001).
And a reason behind its reported poor performances seems to lie in the strong assumptions required by the method which are in most cases not satisfied (typically independence between features, plus some other assumptions about the distribution of those features). It is worth noting, though, that “naive Bayes assumptions can yield an optimal classifier in a variety of situations where the assumptions are widely violated” (Lewis, 1998, page 9)4.
Given that we do not implement any of the proposed remedies to mitigate the described common drawbacks of the naive Bayes method in text classification problems (Rennie et al., 2003)3, we do not expect winning outcome here…
The feature tinkering part is very similar to the one in our previous kernel. In particular, the NCI Thesaurus5—NCIt in short—is used to create a specialized dictionary of medical terms. This dictionary is used to limit the extraction of terms from the text data.
Features Tinkering
Processing Corpora
Two corpora will need to be processed in the same way (one to generate a specialized dictionary and the other to extract features from the clinical evidence provided in plain text form).
processCorpus <- function (corpus) {
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stemDocument, language="english")
corpus <- tm_map(corpus, removePunctuation, preserve_intra_word_dashes = TRUE)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, function (x) {
gsub("\\s*(?<!\\B|-)\\d+(?!\\B|-)\\s*", "", x, perl = TRUE)})
corpus <- tm_map(corpus, stripWhitespace)
return (corpus)
}
Get Dictionary from NCI Thesaurus (NCIt)
We get the NCI Thesaurus (the ReadMe file provides information regarding the format).
Bear in mind that the NCI Thesaurus’s license terms and conditions need to be accepted prior to using the file. (As stated in the FAQ: “basically under the license you can do anything you want with the NCI Thesaurus content except to change it while still calling it NCI Thesaurus.”)
getDictionary <- function (filePath) {
if (is.null(filePath)) return(NULL)
ret <- do.call(rbind, strsplit(readLines(filePath), "\t", fixed = TRUE))
ret <- setNames(data.table(ret, stringsAsFactors = FALSE),
c("code", "concept_name", "parents",
"synonyms", "definition", "display_name",
"concept_status", "semantic_type"))
corpus <- Corpus(VectorSource(gsub("\\|", " ", ret$synonyms)))
corpus <- processCorpus (corpus)
terms <- DocumentTermMatrix(
corpus,
control = list(
minWordLength = 3,
weighting = function(x) weightTfIdf(x, normalize = FALSE)))
return (as.vector(terms$dimnames$Terms))
}
filePath <- file.path(
".",
"Personalized Medicine-Redefining Cancer Treatment",
"data", "Thesaurus.txt", fsep = .Platform$file.sep)
tryCatch({
stop("for test")
if (is.null(filePath) || !file.exists(filePath)) {
destfile <- file.path(".", "thesaurus.zip", fsep = .Platform$file.sep)
download.file(
url = "https://evs.nci.nih.gov/ftp1/NCI_Thesaurus/Thesaurus_17.06d.FLAT.zip",
destfile = destfile)
filePath <<- unzip(destfile, exdir = dirname("."))
file.remove(destfile)
rm(destfile)
}
}, error = function(e) {
print (e)
filePath <<- NULL
})
medDico <- getDictionary(filePath)
Extract Features from Text Files
Note that we use the test
set in the feature tinkering process. We must bear in mind that it is definitely not a good practice, for the test set becomes part of the training process (basically, the resulting model might perform well on this test set, but might miserably fail to generalize to a new unseen data set).
getFeaturesFromTxt <- function (data, dictionary, sparse = NULL) {
corpus <- Corpus(VectorSource(data$txt))
corpus <- processCorpus (corpus)
if (!is.null(dictionary)) {
terms <- DocumentTermMatrix(
corpus,
list(dictionary = dictionary,
weighting = function(x) weightTfIdf(x, normalize = TRUE)))
if (!is.null(sparse)) {
terms <- removeSparseTerms(terms, sparse)
}
} else {
terms <- DocumentTermMatrix(
corpus,
control = list(
minWordLength = 3,
weighting = function(x) weightTfIdf(x, normalize = FALSE)))
if (!is.null(sparse)) {
terms <- removeSparseTerms(terms, sparse)
}
}
return (as.matrix(terms))
}
extractedFeatures <- getFeaturesFromTxt(
data = data,
dictionary = medDico,
sparse = switch(ifelse(length(medDico) > 5000, 2, 1), 0.99, 0.95))
TF-IDF-based wordcloud (doc ID 152):
data <- cbind(data, extractedFeatures)
data[, txt := NULL]
Naive Bayes Classifier
In R, there are several implementations, for example in the packages e1071, KlaR, naivebayes or h2o. Here, we use the one in h2o.
The predictors are (truncated): Gene, Variation, onset, pci, percutan, stablhrs, stemi, symptom, fulldos, infarctionst, stabl, success, therapi, thrombolyt, infarctionunstableovhour, unstablhrs, mitral, repair, valv, pericardi, strip, evalu, postcardiac, transplant, noncardiovascular, …, bpr01blunt, disorient, methotrexateprednisonethioguaninevincristin, mtxpredtgvcr, qtcun, eq5d01mobl, mobil, eq5d0102, selfcar, eq5d0104, vas, hamd101, hamd1depress, r24, ansamycin, antibiot, antineoplast, hamd103, hamd108, ch50, plathct, a1aglp, npap, iohexol, rubriblast (there are 728 features); we ignore the columns ID, Class, txt (toIgnore
).
Data Split
A validation set is extracted from the train set. We need a validation set to select the best models out of the grid search.
splits <- h2o.splitFrame(
data = as.h2o(train),
ratios = c(0.8),
seed = 1)
trainH2o <- splits[[1]]
validH2o <- splits[[2]]
testH2o <- as.h2o(test)
Grid Search
hyperParamsNaiveBayes = list(
laplace = c(0, 5),
min_sdev = c(0.001, 0.004, 0.006),
# eps_sdev = c(0, 0.002),
min_prob = c(0.001, 0.006),
# eps_prob = c(0, 0.002),
compute_metrics = c(TRUE))
h2o.rm("grid_naivebayes")
gridAutoencoder <- h2o.grid(
x = predictors,
y = response,
training_frame = trainH2o,
validation_frame = validH2o,
hyper_params = hyperParamsNaiveBayes,
search_criteria = list(strategy = "Cartesian"),
algorithm = "naivebayes",
grid_id = "grid_naivebayes",
balance_classes = FALSE,
seed = 1)
The following table summarizes the grid results (it is sorted decreasingly by ‘accuracy’):
compute_metrics | laplace | min_prob | min_sdev | accuracy | |
---|---|---|---|---|---|
1 | true | 5.0 | 0.001 | 0.006 | 0.4810126582278481 |
2 | true | 0.0 | 0.001 | 0.006 | 0.4810126582278481 |
3 | true | 5.0 | 0.006 | 0.006 | 0.4810126582278481 |
4 | true | 0.0 | 0.006 | 0.006 | 0.4810126582278481 |
5 | true | 5.0 | 0.001 | 0.004 | 0.4683544303797469 |
6 | true | 0.0 | 0.001 | 0.004 | 0.4683544303797469 |
7 | true | 5.0 | 0.006 | 0.004 | 0.4683544303797469 |
8 | true | 0.0 | 0.006 | 0.004 | 0.4683544303797469 |
9 | true | 0.0 | 0.001 | 0.001 | 0.45569620253164556 |
10 | true | 5.0 | 0.006 | 0.001 | 0.44303797468354433 |
11 | true | 5.0 | 0.001 | 0.001 | 0.44303797468354433 |
12 | true | 0.0 | 0.006 | 0.001 | 0.44303797468354433 |
Performance on the Validation Set
bestNbModel <- h2o.getModel(sortedGridNb@model_ids[[1]])
h2o.confusionMatrix(bestNbModel, valid = TRUE)
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
## 1 2 3 4 5 6 7 8 9 Error Rate
## 1 6 1 0 2 0 0 3 0 0 0.5000 = 6 / 12
## 2 1 2 0 1 0 0 6 0 0 0.8000 = 8 / 10
## 3 0 1 0 0 0 0 1 0 0 1.0000 = 2 / 2
## 4 6 0 0 7 1 2 3 0 0 0.6316 = 12 / 19
## 5 1 1 0 1 3 0 0 0 0 0.5000 = 3 / 6
## 6 1 1 0 0 0 3 1 0 0 0.5000 = 3 / 6
## 7 2 1 1 2 0 0 17 0 0 0.2609 = 6 / 23
## 8 0 0 0 0 0 0 0 0 0 = 0 / 0
## 9 0 0 0 0 0 0 1 0 0 1.0000 = 1 / 1
## Totals 17 7 1 13 4 5 32 0 0 0.5190 = 41 / 79
predValid <- as.data.table(h2o.predict (bestNbModel, newdata = validH2o))
predValid[, predict := NULL]
# here we make sure that y_true has all the classes
# (as one could be missing from the smaller validation set)
y_true <- as.vector(validH2o[[response]])
y_true <- c(y_true, c(paste0(seq(1, 9))))
y_true <- model.matrix(~ 0 + ., data.frame(as.character(y_true)))
y_true <- y_true[-((nrow(y_true)-9+1):nrow(y_true)), ]
nbScore <- MultiLogLoss(y_pred = as.matrix(predValid), y_true = y_true)
## [1] 17.92519
Hum… Another example that won’t contradict the unprestigious title of “the punching bag of classifiers.” One could argue that the feature extraction from text is overly simplistic here; which certainly (partly) explain those poor results. On the other hand, our previous kernel describes a solution using XGBoost with almost the same features, and delivers a much better outcome…
Retrain the Best Model Using the Entire Train Set
Here we use k-folds cross validation to evaluate the final model (trained using the entire train set—i.e., rbind(trainH2o, validH2o)
).
nfolds <- ifelse(kIsOnKaggle, 2, 8)
finalModel <- do.call(h2o.naiveBayes,
{
p <- bestNbModel@parameters
p$model_id = NULL
p$training_frame = h2o.rbind(trainH2o, validH2o)
p$validation_frame = NULL
p$nfolds = nfolds
p
})
finalModel@model$cross_validation_metrics_summary
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid
## accuracy 0.377501 3.536949E-5 0.37755102 0.37745097
## err 0.622499 3.536949E-5 0.622449 0.622549
## err_count 124.5 1.767767 122.0 127.0
## logloss 20.946302 0.25053015 20.592 21.300606
## max_per_class_error 1.0 0.0 1.0 1.0
## mean_per_class_accuracy 0.33982143 0.054649327 0.2625356 0.41710725
## mean_per_class_error 0.66017854 0.054649327 0.73746437 0.5828928
## mse 0.6211964 9.262516E-4 0.61988646 0.6225063
## r2 0.8857343 0.00105865 0.88723147 0.8842372
## rmse 0.78815967 5.87604E-4 0.7873287 0.7889907
Test Set & Submission
predTest <- as.data.table(h2o.predict (finalModel, newdata = testH2o))
predTest[, predict := NULL]
We save the result out of the best model.
Note: To avoid timeout, the size of the test set has been significantly reduced; consequently, the saved result is incomplete. Run the script locally by setting the flag kIsOnKaggle to FALSE.
result <- setNames(
predTest,
c(paste0("class", seq(1, 9))))
fwrite(
data.table(
ID = as.vector(testH2o[["ID"]]),
result),
file = file.path(".", paste0("result-naivebayes_", Sys.Date(), ".csv"),
fsep = .Platform$file.sep),
row.names = FALSE, quote = FALSE)
Comparing Against Alternative Methods
Here we briefly compare the performance—in terms of both score (multi-class logloss) and time required for training—of popular algorithms. This comparison has to be considered loosely, for we do not optimize the various parameters of each algorithm (e.g., by performing a grid search). Consequently, we contend that the yielded results here are an easy-to-achieve-minimum for each method. (Bear in mind, however, that we did a bit of optimization for the naive Bayes one.)
The Challengers
Random Forest (RF)
rf <- h2o.randomForest(
x = predictors,
y = response,
training_frame = trainH2o,
validation_frame = validH2o,
nfolds = nfolds,
ntrees = ifelse(kIsOnKaggle, 100, 500),
max_depth = 10,
seed = 1)
predValidRf <- as.data.table(h2o.predict (rf, newdata = validH2o))
predValidRf[, predict := NULL]
rfScore <- MultiLogLoss(y_pred = as.matrix(predValidRf), y_true = y_true)
rfRunTime <- rf@model$run_time
Gradient Boosting Machine (GBM)
gbm <- h2o.gbm(
x = predictors,
y = response,
training_frame = trainH2o,
validation_frame = validH2o,
ntrees = ifelse(kIsOnKaggle, 50, 500),
max_depth = 5,
seed = 1)
predValidGbm <- as.data.table(h2o.predict (gbm, newdata = validH2o))
predValidGbm[, predict := NULL]
gbmScore <- MultiLogLoss(y_pred = as.matrix(predValidGbm), y_true = y_true)
gbmRunTime <- gbm@model$run_time
Generalized Linear Model (GLM)
glm <- h2o.glm(
x = predictors,
y = response,
training_frame = trainH2o,
validation_frame = validH2o,
family = "multinomial",
seed = 1)
predValidGlm <- as.data.table(h2o.predict (glm, newdata = validH2o))
predValidGlm[, predict := NULL]
glmScore <- MultiLogLoss(y_pred = as.matrix(predValidGlm), y_true = y_true)
glmRunTime <- glm@model$run_time
Deep Learning (DL)
dl <- h2o.deeplearning(
x = predictors,
y = response,
training_frame = trainH2o,
validation_frame = validH2o,
standardize = TRUE,
hidden = c(200, 200),
epochs = ifelse(kIsOnKaggle, 30, 800),
activation = "Rectifier",
categorical_encoding = "Binary",
seed = 1)
predValidDl <- as.data.table(h2o.predict (dl, newdata = validH2o))
predValidDl[, predict := NULL]
dlScore <- MultiLogLoss(y_pred = as.matrix(predValidDl), y_true = y_true)
dlRunTime <- dl@model$run_time
Verdict
Summary of the results
## algo score runtime
## NB 17.925188 0.340
## RF 1.294498 6.978
## GBM 1.974667 24.586
## GLM 2.067138 2.563
## DL 4.328637 13.366
Let’s chart those results (inspired by the First week progress section of beluga’s kernel, the xkcd style is used).
<?xml version=’1.0’ encoding=’UTF-8’ ?>
Those results are not really in favor of naive Bayes… Naive Bayes was the quickest to train… Yet, other techniques would certainly remain more appealing in many applications… ■
License and Source Code
© 2017 Loic Merckel, Apache v2 licensed. The source code is available on both Kaggle and GitHub.
-
Cohen, Aaron M., and William R. Hersh. “A survey of current work in biomedical text mining.” Briefings in bioinformatics 6.1 (2005): 57-71. ↩
-
Hotho, Andreas, Andreas Nürnberger, and Gerhard Paaß. “A brief survey of text mining.” Ldv Forum. Vol. 20. No. 1. 2005. ↩
-
Rennie, Jason D., et al. “Tackling the poor assumptions of naive bayes text classifiers.” Proceedings of the 20th International Conference on Machine Learning (ICML-03). 2003. ↩ ↩2
-
Lewis, David D. “Naive (Bayes) at forty: The independence assumption in information retrieval.” European conference on machine learning. Springer, Berlin, Heidelberg, 1998. ↩
-
NCI stands for National Cancer Institute. ↩