Machine Learning: An Introduction

Thomas Keck (

Japenese vs. Chinese


Training / Fitting

Japanese (hiragana)

ま ち ね ら に ご

Chinese (kanji)

电 买 开 东 车 红 马

Application / Inference

Japanese (hiragana)

Japanese (hiragana)

Chinese (kanji)

Chinese (kanji)

Japanese (hiragana)

Chinese (kanji)

Take a moment to appreciate what you just did

Let's build a machine that can do this!


Multivariate Classification

Decision Tree
Model Complexity

Decision Tree (Inference)

Decision Tree (Fitting)

Decision Tree (Summary)

  • Classifies using a number of consecutive rectangular cuts
  • Each cut locally maximizes a separation gain measure
  • Signal probability given by the purity in each leaf
  • Interpretable (white box) model

Misclassification Rate: 16%
SKLearn Example

            from sklearn import tree
            dt = tree.DecisionTreeClassifier(max_depth=4)
  , y)

Model Complexity

Training Dataset

Misclassification Rate: 0%
Independent Test Dataset

Misclassification Rate: 21%


  • Model is too complex
  • Statistical fluctuations in the training data dominate predictions
  • Model does not generalize → poor performance on new data
  • Need to check for this on an independent test dataset!


  • Model is too simple
  • Relevant aspects of the data are ignored

Training vs. Test Error

Bias-Variance Dilemma

  • Bias due to wrong modeling of the data (underfitting)
  • Variance due to sensitivity to statistical fluctuations (overfitting)
  • Irreducible error due to noise in the problem itself
$$ \mathrm{E} \left\lbrack (y- \widehat{f}(\vec{x}))^2 \right\rbrack = \mathrm{Bias} \left\lbrack \widehat{f}(\vec{x}) \right\rbrack^2 + \mathrm{Var} \left\lbrack \widehat{f}(\vec{x}) \right\rbrack + \mathrm{Var} \left\lbrack y \right\rbrack $$

Model Complexity

Number of Degrees of freedom (NDF) of the model
(≈ number of parameters)

  • Input dataset
    • Reduce dimensionality
    • Higher statistic
  • Hyperparameters (control NDF)
    • E.g. depth of the tree
    • Optimized using search-algorithm
  • Regularization (reduce effective NDF)
    • E.g. Include tree structure in separation gain measure
    • Ensemble methods

Model Complexity

(All you have to know)

Always test on an independent test dataset in the end!

Boosted Decision Tree
Ensemble Methods

Ensemble Methods

Average many simple models to obtain a robust complex model
$$ F\left( \vec{x} \right) = \sum_m \gamma_m f_m(\vec{x}) $$



  • Reweight events w.r.t current prediction
  • Individual classifiers are simple (weak-learners)
  • Focus on events near the optimal separation hyper-plane
  • Loss function L is crucial
    • Least square → Regression
    • Binomial deviance → GradientBoost Classification
    • Exponential loss → AdaBoost classification



  • Use only a fraction of events / features per classifier
  • Robustness against statistical fluctuations
  • Embarrassingly parallel
  • Sampling method is crucial:
    • Bagging:
      random events with replacements
    • Pasting:
      random events without replacement
    • Random Subspaces:
      random features

Stochastic Boosted Decision Tree (Summary)

  • Good out-of-the-box performance
  • Robust against over-fitting
  • Supports classification and regression
  • Widely used in HEP

Misclassification Rate: 14.5%
SKLearn Example

            from sklearn import ensemble
            bdt = ensemble.GradientBoostingClassifier(subsample=0.5,
  , y)

Further Ensemble Methods

  • Divide feature-space into sub-spaces
  • Different behavior of the data in the chosen subspaces
  • e.g. train separate classifiers for Barrel and Endcap
  • Combination
  • Combine different classifiers
  • Different regularization methods learn different aspects of the data
  • e.g. combine neural network, BDT and SVM

Support Vector Machine
Kernel Trick

Support Vector Machine

Misclassification Rate: 24%

Kernel Trick

  • SVM Algorithm depends only on scalar product!
  • Replace scalar product with an arbitrary kernel function
  • Solves problem in implicitly high-dimensional space
$$ \max g(c_1, \dots, c_n) $$ $$ g(c_1, \dots, c_n)= \sum_i c_i - \frac{1}{2} \sum_i \sum_j y_i c_i (\vec{x}_i \cdot \vec{x}_j) y_j c_j$$

Kernel Trick

Polynomial Kernel
$k(x_i, x_j) ) = (x_i \cdot x_j)^d$

Misclassification Rate: 19%
Gaussian Kernel
$k(x_i, x_j) ) = \exp(-\gamma ||x_i - x_j||^2)$

Misclassification Rate: 15%

Support Vector Machine (Summary)

  • Maximum margin classifier
  • Quadratic problem: can be solved efficiently in $O(N^2)$
  • Optimal for linearly separable problems
  • Kernel trick allows solving of non-linear problems
  • Solution depends only on the support-vectors

Misclassification Rate: 15%
SKLearn Example

            from sklearn import svm
            svc = svm.SVC(kernel='rbf')
  , y)

Artificial Neural Networks
Universal Function Approximator

Artificial Neural Network

Universal Function Approximator

$$f(\vec{x}) = \sigma \left(\sum_h w_{oh} \sigma_h \left( \sum_i w_{hi} x_i \right) \right)$$
→ can approximate any reasonable function $f: \mathrm{R}^\mathrm{N} \rightarrow [0, 1]$
Usually visualized as a network
Link $\hat{=}$$w_{ij}$
Neuron $\hat{=}$$\sigma \left( \sum \dots \right)$

Activation Functions

$$\sigma (\underbrace{\sum w x}_{a})$$
  • Desirable properties
    : Nonlinear, Differentiable, Monotonic, Smooth
  • Examples for the hidden layer
    • sigmoid $\frac{1}{1+e^{-a}}$:
      not zero-centered
      $\rightarrow$ don't use this
    • $\tanh(a)$:
      $\rightarrow$ better than sigmoid
    • ReLU $\max(0, a)$:
      constant gradient
      can die out
    • PReLU $\max(0, a) - \beta \max(0, -a)$:
      constant gradient
      additional parameter

Backpropagation of Error

$$ \Delta w = - \eta \frac{\mathrm{d}\mathcal{L}}{\mathrm{d}w} $$
→ choose weights so that they minimize a loss-function
$ \mathcal{L} = H\left(y, f(\vec{x})\right) = $
$-y \ln (f(\vec{x}))$
$ - (1-y) \ln (1 - f(\vec{x}))$
$\quad (\textrm{with }\ y = 0, 1)$

Artificial Neural Network (Summary)

  • Universal function approximator
  • Adjust weights to minimize loss-function
  • Fast and small model
  • Fitting can be challenging
  • Ubiquitous in all modern ML applications

Misclassification Rate: 15.5%
SKLearn Example

            from sklearn import neural_network
            ann = neural_network.MLPClassifier(activation='tanh',
  , y)

Artificial Neural Networks
Stochastic Gradient Descent

Stochastic Gradient Descent

  1. Feed $N$ samples to the network ($N \hat{=} $ batch-size $\rightarrow$
  2. Calculate the
    of the average loss with respect to each weight using the chain-rule of analysis
  3. Adjust the weights in the opposite direction (
    ) with a small step-size (learning-rate) $\eta$
  4. Repeat until convergence

Stochastic Gradient Descent

  1. Feed $N$ samples to the network ($N \hat{=} $ batch-size $\rightarrow$
  2. Calculate the
    of the average loss with respect to each weight using the chain-rule of analysis
  3. Adjust the weights in the opposite direction (
    ) with a small step-size (learning-rate) $\eta$
  4. Repeat until convergence
Many more advanced variants of Stochastic Gradient Descent exists
How should you choose the associated hyper-parameters?
  • Learning Rate and Learning Rate Schedule
  • Batch-Size
  • Momentum Term

Learn-Rate Schedule

Learn-Rate vs. Batch-Size vs. Momentum

$$ \Delta w_t = - \underbrace{\eta}_{\textrm{Learn-Rate}} \cdot \underbrace{ \frac{\mathrm{d}}{\mathrm{d}w} \frac{1}{N} \sum_i^N \mathcal{L}_i}_{\textrm{Gradient averaged over Batch-Size}} + \underbrace{m \cdot \Delta w_{t-1}}_{\textrm{Momentum term}} $$

Initial noisy optimization phase is similar to simulated annealing
$\rightarrow$ explored region is determined by the noise scale $g$

$$ g \sim \frac{\eta}{N (1-m)} $$ S. L. Smith, P. Kindermans, C. Ying, Q. V. Le (02/2018)
If you increase the Batch-Size you must increase the Learn-Rate!

Early Stopping

Idea: Prevent overfitting by stopping the training before convergence

Very effective and simple!

$L_1$ and $L_2$ Penalty Terms

Idea: Prevent overfitting by penalize large weights
Optimal Brain Damage: Forget unimportant stuff
$$\mathcal{L} \rightarrow \mathcal{L} + L_{\textrm{Penalty}}$$

$$L_2 = \beta \sum_i w_i^2 $$
  • Requires careful choice of $\beta$
  • Also called Ridge leads to dense representations
  • Equivalent to
    for Stochastic Gradient Descent
$$L_1 = \alpha \sum_i | w_i | $$
  • Requires careful choice of $\alpha$
  • Also called LASSO leads to sparse representations

Generative Models
Neyman-Pearson Lemma

Neyman-Pearson Lemma

On the Problem of the most Efficient Tests of Statistical Hypotheses
By J. Neyman and E. S. Pearson
$$ f\left(\vec{x}\right) = \frac{\mathrm{PDF}\left( \vec{x} | \mathrm{S} \right)}{\mathrm{PDF}\left( \vec{x} | \mathrm{B} \right)} $$

Most powerful test at a given significance level to distinguish between two simple hypotheses (signal or background)

Problem solved? No!

Signal and Background PDF are usually unknown

  • High dimensional $\rightarrow$ cannot be sampled due to the curse of dimensionality
  • Multiple sources for signal and background
  • Mislabelled training data / Simulation uncertainties

Solution: Approximate Neyman-Pearson Lemma

  • Neyman-Pearson Lemma
$$ f\left(\vec{x}\right) = \frac{\mathrm{PDF}\left( \vec{x} | \mathrm{S} \right)}{\mathrm{PDF}\left( \vec{x} | \mathrm{B} \right)} $$
  • Discriminative Models
    • (Boosted) Decision Trees
    • Support Vector Machines
    • Artificial Neural Networks
$$ f\left(\vec{x}\right) \approx \frac{\mathrm{PDF}\left( \vec{x} | \mathrm{S} \right)}{\mathrm{PDF}\left( \vec{x} | \mathrm{B} \right)} $$
  • Generative Models
    • Analytical approx. (LDA, QDA)
    • Kernel density estimator
    • Gaussian mixture model
$$ f\left(\vec{x} | \mathrm{S}\right) \approx \mathrm{PDF}\left( \vec{x} | \mathrm{S} \right) $$ $$ f\left(\vec{x} | \mathrm{B}\right) \approx \mathrm{PDF}\left( \vec{x} | \mathrm{B} \right) $$

Linear Discriminant Analysis

  • Assumes conditional PDFs are normally distributed
  • Assumes identical covariances
  • Equivalent to Fisher’s discriminant
  • Requires only means and covariances of sample
  • Separating hyperplane is linear

  • $ f(x) = x^{\mathrm{T}} \cdot \Sigma^{-1} (\mu_{\mathrm{S}} - \mu_{\mathrm{B}}) $

Misclassification Rate: 24%
SKLearn Example

            from sklearn import lda
            ld = lda.LDA()
  , y)

Quadratic Discriminant Analysis

  • Assumes conditional PDFs are normally distributed
  • Requires only means and covariances of sample
  • Separating hyperplane is quadratic
  • $ f(x) = \frac{\sqrt{2 \pi | \Sigma_{\mathrm{B}} |} \exp\left( - \frac{1}{2} \left(x - \mu_{\mathrm{S}}\right)^{\mathrm{T}} \Sigma^{-1}_{\mathrm{S}} \left(x - \mu_{\mathrm{S}}\right) \right) }{ \sqrt{2 \pi | \Sigma_{\mathrm{S}} |} \exp\left( - \frac{1}{2} \left(x - \mu_{\mathrm{B}}\right)^{\mathrm{T}} \Sigma^{-1}_{\mathrm{B}} \left(x - \mu_{\mathrm{B}}\right) \right) }$

Misclassification Rate: 21%
SKLearn Example

            from sklearn import qda
            qd = qda.QDA()
  , y)

Kernel Density Estimator

  • Every training sample is replaced with a small gaussian sphere
  • Bandwith (variance of gaussian) is key
  • Works well for low dimensions

Misclassification Rate: 16%
SciPy Example

            from scipy.stats import gaussian_kde
            signal = gaussian_kde(signal_samples)
            background = gaussian_kde(background_samples)

Multivariate Regression

(Generalized) Linear Regression

  • Linear Regression of Base-Functions
  • $y = \beta_0 + \beta_1 \phi_1(x_1) + \dots \beta_n \phi_n(x_n)$
  • Fitted with least-square method

(Boosted) Decision Trees

  • Various different algorithms exists
  • Easiest: calculate average and right of each possible split points
  • Minimize $\left(y - \bar{y}_{\mathrm{L}}\right)^2 + \left(y - \bar{y}_{\mathrm{R}}\right)^2$

Support Vector Regression

  • Search for maximum-margin hyper-band incorporating all data-points

Artificial Neural Networks

  • Neural Networks are still universal function approximators
  • Output activation function: linear
  • Loss Function: Square $\left(f(\vec{x}) - y\right)^2$ or Absolute $\left|f(\vec{x}) - y\right|$
$$f(\vec{x}) = \sum_h w_{oh} \sigma_h \left( \sum_i w_{hi} x_i \right)$$
→ can approximate any reasonable function $f: \mathrm{R}^\mathrm{N} \rightarrow \mathrm{R}$


All concepts we encountered are still valid

  • Model Complexity - Bias Variance Tradeoff
  • Ensemble Methods - Boosting und Bagging
  • SVM: Kernel Trick
  • ANN: Stochastic Gradient Descent



There will be a second lecture on
Deep Learning
after a short break
There will be workshop on
in the afternoon