Loss Functions

What is a cost function?

In deep learning, a cost function, also known as a loss function, is a mathematical function that measures the difference between the predicted output of a model and the actual target output. The goal of training a deep learning model is to minimize the value of the cost function, so that the model's predictions are as close as possible to the target outputs.
The cost function is used to guide the optimization process during training by providing a measure of the error in the model's predictions. The optimization algorithm, such as stochastic gradient descent (SGD), updates the model's weights and biases in the direction that reduces the value of the cost function.
There are many different cost functions that can be used in deep learning, depending on the type of problem being solved and the type of model being used. Common examples include mean squared error for regression problems, cross-entropy for binary classification problems, and categorical cross-entropy for multi-class classification problems.

Cost vs Loss function?

Loss function is defined on a single training example.
Cost function is defined on the entire training set.

What are some Regression Loss Functions?

Mean Absolute Error/L1
Mean Squared Error/L2
Root Mean Squared Error
Mean Bias Error
Huber Loss
Mean Squared Logarithmic Error Loss
Mean Absolute Error Loss

Choosing the right loss function for a regression problem depends on the specific requirements of the problem and the type of data being used. In general, MSE and MAE are a good starting point for most regression problems.

Read more:

What are different types of Multi-Class Classification Loss Functions?

Binary Cross-Entropy (BCE)
Categorical Cross-Entropy (CCE):
Hinge Loss
Kullback Leibler Divergence Loss
Focal Loss

Probability vs likelyhood

Maximum likelihood seeks to find the optimum values for the parameters by maximizing a likelihood function derived from the training data.

https://www.youtube.com/watch?v=pYxNSUDSFH4&ab_channel=StatQuestwithJoshStarmer

Why do we use log?

What is logarithm? | Math, Statistics for data science, machine learning

What is information theory, entropy, cross-entropy, KL Divergence?

Information Theory:

Very well explained in the beginning of this video

Entropy:

It is a measure of how uncertain the events are. It tells you how unpredictable that probability distribution is. The formula for entropy is:

\(Entropy = H(p) = -\sum_ip_ilog_2(p_i)\)

It gives us the average amount of information that you get from one sample drawn from a given probability distribution \(p\)

https://www.youtube.com/watch?v=YtebGVx-Fxw&ab_channel=StatQuestwithJoshStarmer

https://www.youtube.com/watch?v=IPkRVpXtbdY&ab_channel=mfschulte222

Cross Entropy

https://www.youtube.com/watch?v=tRsSi_sqXjI&ab_channel=Udacity

https://www.youtube.com/watch?v=bLb_Kp5Q9cw&ab_channel=MattYedlin

https://www.youtube.com/watch?v=TIL0BU6917o&ab_channel=DrJuanKlopper

KL Divergence

The amount by which the cross entropy exceeds the entropy is called the relative entropy or commonly KL Divergence
Cross Entropy = Entropy + KL Divergence

\(D_{KL}(p||q) = H(p, q) - H(p)\)
It is a measure of the information lost when Q is used to approximate P

https://stats.stackexchange.com/questions/154879/a-list-of-cost-functions-used-in-neural-networks-alongside-applications

How are entropy and probability related?

https://www.analyticsvidhya.com/blog/2020/11/entropy-a-key-concept-for-all-data-science-beginners/

Categorical Cross Entropy Loss vs Sparse Categorical Cross Entropy Loss

The main difference is the former one has the output in the form of one-hot encoded vectors whereas the latter has it in integers.

The sparse version can also help you when you encounter memory constraint issues are used in multi-class classification.

Sparse cross-entropy addresses this by performing the same cross-entropy calculation of error, without requiring that the target variable be one-hot encoded before training.