Activation function

Soumit Kar
4 min readJun 30, 2021

An activation function is a mathematical equation attached to each hidden and output neuron in the network.

Transfer function — transfer function calculates the neuron weighted sum of its inputs.

Activation function — the activation function decides whether to fire or not the neuron.

Range : specifies the range of the function outputs.

0-centered : indicates whether the function is 0-centered or not. (E.g. in standard normal distribution data, mean = 0 and standard deviation = 1. So data centered between 0 either -ve or +ve). If the data is not zero centered it takes more time to reach the conversion point (global minima).

Saturation : indicates whether the function suffers from saturation or not. A neuron is considered as saturated if it reaches its maximum or minimum value

Vanishing Gradient : specifies if the activation function causes the vanishing gradient problem or not.

Computation : how the function smoothly or easily compute (e.g. time ,cost)

Sigmoid activation function

It normalizes the output of the neuron to a range between 0 and 1. Here 0.5 basically the threshold.

f(x) = 1/1+e^x

f ‘(X) = e^x / (e^x+1)²

The derivatives of sigmoid function range between 0 to 0.25.

Pros

1.Smooth gradient.

2.Clear prediction , close to 0 or 1.

Cons

1.Not zero centered.

2.Prone to gradient vanishing.

3.Computational expense for exponential function.

Use a “Xavier Normal” or “Xavier Uniform” weight initialization when using sigmoid activation function.

Tanh( Hyperbolic tangent function)

Similar to sigmoid but range is different

Whether the input is higher or smaller the output is smooth and gradient is small, which is not conducive to weight update. The output interval of tanh is 1 and function is zero centric, better than sigmoid.

Output range between -1 to 1 and derivative of function is 0 to 1.

PROS

1.Zero centric.

2.Output interval is 1

3.Used in binary classification for Hidden layer.

CONS

1.Vanishing Gradient problem.

RELU (Rectified Linear unit)

The ReLU activation function takes maximum values. It is not fully interval derivable

Relu(x) = max(0, x)

d(Relu(x))/dx = { 0 if x < 0,

x if x > 0,

Does not exist if x = 0 [Why is the ReLU function not differentiable at x=0?]

Dying Relu : when doing backpropagation the derivatives of -ve weights (w) is zero. So, in weight update (w_new = w_old-n(dL/dw_old) the new weight is equal to old weight.

It used in hidden layer as it solves the vanishing gradient problem(derivative of ReLU is either 0 or 1)

PROS

1. Faster, when input +ve no gradient saturation problem.

CONS:

1. when input -ve , Relu inactive.

2. Not Zero centric as output 0 or +ve numbers.

Leaky ReLU

To solve the dead ReLU problem, set the first half of ReLU 0.01x instead of 0.

f(x) = max(0.01x, x)

f’(x) = { alpha(0.01) for x < 0,

1 for x > 0 }

Another idea is the parametric method [f(x) = max(alpha * x, x)]

If maximum weights are negative and few are positive then..

[0.01 * 0.01 * 1 * 0.01..] = very small number (ie. w_new ~ w_old)

So, there is a vanishing gradient problem.

1. Solve the dying ReLu at negative values

2. It is easy to compute.

3. It is close to a zero-centered function.

4.Can’t be used in complex classification as linearity property.

In the ReLU function use a “He Normal” or “He Uniform” weight initialization and scale input data to the range 0–1 (normalize) prior to training.

ELU (Exponential)

ELU is also proposed to solve the problems of ReLU.

f(x) = {x, if x >0,

alpha((e^x)-1) }

f ‘(x) = { 1 if x> 0,

alpha * e^x if x < 0 }

It is theoretically better than ReLU, but no evidence.

PROS:

1. Unlike ReLU, ELU can produce negative outputs.

CONS:

1. For x > 0, it can blow up the activation with the output range of [0, inf].

2. Computational Expensive.

--

--