Fully Fledged xkcd Theming
Although the focus in this kernel is rather on a topic not directly—at least in an obvious way—related to data science, I still think it might be of interest to some folks here.
A way to use the xkcd theme on kaggle is detailed. In particular, a trick to use any font not available on the Kaggle’s script environment is described.
The idea came first when I skimmed through the nice beluga’s kernel with a rather bold opening section titled First week progress where the xkcd style was shining at just the right intensity, without dazzling. (But the font was wrong!)
This Taxi competition is (i) on-going and (ii) is a kernel competition; therefore it might be the right place to present a tool to enhance kernels.
Besides, the application section presents a comparison of how well some popular algorithms perform at solving this problem (basically, the tools suite offered by the H2O framework for regression). Unlike our previous comparison of this sort—NCI Thesaurus & Naive Bayes (vs RF, GBM, GLM & DL)—where naive Bayes (NB) appeared a clear loser, here it remains unclear. GLM demonstrates lower performance RMSLE-wise than its counterparts; however, it is markedly quicker to train. (NB is absent here, because it is focused on classification problems—there are attempts described in the literature at trying to use NB for addressing regression problem, though…)
The xkcd Theme
How-To
A quick search revealed that there exists an xkcd package for R that does just that. When I tried, however, everything went seemingly smoothly until I realized that the xkcd font was missing from the Kaggle environment…
My rather short persistence in trying to install it using the extrafont package was being in vein when interrupted by the simpler idea of using the SVG format and the CSS @font-face
capability in concert with the svglite package.
First, the font is converted to the WOFF format (we used Font Squirrel for that matter) and embedded as a Base64 encoded string.
<style type="text/css">
@font-face {
font-family: 'xkcd';
src: url(data:application/font-woff;charset=utf-8;base64, ...) format('woff');
font-weight: normal;
font-style: normal;
}
</style>
Second, Knitr
is configured to generate images as SVG. Besides, we assume that the three packages svglite
, xkcd
and (optinaly) xkcdcolors
are installed and loaded.
knitr::opts_chunk$set(
dev = "svglite",
fig.ext = ".svg"
)
Finally, the function svgstring
(which takes the dimension of the image as parameters) is used to inline the SVG in the html document. That is to say, a chunk previously written as
```{r echo=TRUE, warning=FALSE, results='show',
message=FALSE, fig.width=10, fig.height=5}
ggplot()
```
becomes
```{r echo=TRUE, warning=FALSE, results='show', message=FALSE}
s <- svgstring(width = 10, height = 5)
ggplot()
invisible(dev.off())
htmltools::HTML(s())
```
When to Use
The xkcd style is not only amusing and denoting good taste, it also sets the tone to the argumentation being made (like a well-choosen emoticon can drastically influence the interpretation of a message).
I found this style quite well-suit for the kernel NCI Thesaurus & Naive Bayes (vs RF, GBM, GLM & DL), where a naive Bayes approach is contrasted with other popular methods, illustrating—without much room for ambiguity—the inferiority of the approach1. Such an establishment, as objective and rational as it can be, could be perceived as an abject smear campaign by the proponent of the method; and could easily hurt the touchiness of the Thomas Bayes’ disciples and/or the naive Bayes activists (which are, to my reckoning, two large groups—even with a substantial non-empty intersection between the two groups, the union is even larger—that is to say, wiser not to mess with them). Given this extra-sensitive context, and being not that temerarious, the reliance on the xkcd theme to mitigate the exposure to undesirable outcome appeared adequate.
To generalize, any tendentious claim that has the potential to (i) either bring the wrath of a large group or (ii) to eventually become embarrassing for the author2 had better to use such a theme. That way, should the all thing go south, the humoristic card can be played…
Application: Comparing Algorithms
To illustrate the first section, we compare how well some popular methods perform on the Taxi dataset (as we did in the NCI Thesaurus & Naive Bayes (vs RF, GBM, GLM & DL) kernel mentionned above). Furthermore, this reinforces the relevance of this kernel in here.
Feature Tinkering
This part is identical to the one of the same name in the previous kernel Autoencoder and Deep Features.
The predictors are: vendor_id, passenger_count, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, store_and_fwd_flag, distance_as_the_crow_flies, wday, hour (there are 10 features); we ignore the columns id, trip_duration, pickup_datetime (toIgnore
).
The Algorithms
Here we briefly compare the performance—in terms of both score (RMSLE) and time required for training—of popular algorithms. This comparison has to be considered loosely, for we do not optimize the various parameters of each algorithm (e.g., by performing a grid search). Consequently, we contend that the yielded results here are an easy-to-achieve-minimum for each method.
Random Forest (RF)
rf <- h2o.randomForest(
x = predictors, y = response,
training_frame = trainH2o, validation_frame = validH2o,
ntrees = ifelse(kIsOnKaggle, 100, 500), max_depth = 10, seed = 1)
Gradient Boosting Machine (GBM)
gbm <- h2o.gbm(
x = predictors, y = response,
training_frame = trainH2o, validation_frame = validH2o,
ntrees = ifelse(kIsOnKaggle, 50, 500), max_depth = 5, seed = 1)
Generalized Linear Model (GLM)
glm <- h2o.glm(
x = predictors, y = response,
training_frame = trainH2o, validation_frame = validH2o,
family = "gaussian", seed = 1)
Deep Learning (DL)
dl <- h2o.deeplearning(
x = predictors, y = response,
training_frame = trainH2o, validation_frame = validH2o,
standardize = TRUE, hidden = c(10, 10), epochs = 70,
activation = "Rectifier", seed = 1)
Results
License and Source Code
© 2017 Loic Merckel, Apache v2 licensed. The source code is available on both Kaggle and GitHub.