Back-Propagation: Definition, Neural Network Training, and Petroleum Applications

Back-propagation (also written as backpropagation or abbreviated as backprop) is the algorithm that trains artificial neural networks by computing the gradient of the network's prediction error with respect to every trainable weight and bias parameter, propagating those gradients layer by layer from the output back through the hidden layers to the input, and then adjusting each parameter in the direction that reduces the error. First described formally by Rumelhart, Hinton, and Williams in 1986, back-propagation is an application of the calculus chain rule: the total error gradient at any layer is the product of the local gradient at that layer and the error gradient flowing in from the layer ahead of it toward the output. This recursive calculation makes it computationally feasible to train networks with dozens of layers and millions of parameters, because each weight's gradient can be derived from quantities already computed in the forward pass (where inputs flow toward the output) and in the backward sweep without rerunning the forward pass. In petroleum geoscience and engineering, back-propagation is the training engine behind every neural network applied to wireline log interpretation, seismic facies classification, production decline analysis, and formation lithology prediction. The algorithm does not define what a network learns, it provides the mechanism by which the network tunes itself to learn whatever the training data contains. A network trained to predict porosity from gamma-ray and density logs uses back-propagation to minimize the difference between its porosity outputs and the core-measured porosity values in the training dataset, updating its internal weight matrices until the prediction error falls below a specified threshold. Because back-propagation operates on differentiable functions, the choice of activation function in each neuron matters: sigmoid and tanh activations were standard through the early 2000s but suffered from the vanishing gradient problem in deep networks, where gradients diminished exponentially during back-propagation through many layers; the rectified linear unit (ReLU) activation, which passes positive values unchanged and zeros negative values, largely resolved vanishing gradients and became the default activation in deep geoscience networks after approximately 2012. Modern oilfield machine learning pipelines use back-propagation in combination with the Adam optimiser, mini-batch gradient descent, dropout regularisation, and batch normalisation to train neural networks that generalise reliably from well-log training data to untested wells across a play area.

Key Takeaways

  • Chain-rule gradient computation through every layer: Back-propagation works by applying the chain rule of calculus to compute how much each weight in the network contributed to the total prediction error. Starting from the output layer where the prediction error (loss) is calculated by comparing the network's output to the known target value, back-propagation computes the partial derivative of the loss with respect to each output weight, then multiplies that by the derivative of the activation function at the previous layer to get the gradient for the weights in the second-to-last layer, and continues this sweep back to the input layer. Each weight receives a gradient that tells the optimiser whether increasing or decreasing that weight would reduce or increase the loss, and by how much. The optimiser (typically stochastic gradient descent or Adam) then subtracts a fraction of the gradient (the learning rate, typically 0.0001 to 0.01) from each weight, nudging the entire network toward lower prediction error. This process repeats for thousands to millions of iterations over the training dataset until the loss converges.
  • Forward pass and backward pass as complementary halves of training: Every training iteration consists of two passes through the network. The forward pass runs the input data (for example, seven wireline log values at a given depth sample) through each layer in sequence, applying weight matrix multiplication and activation functions, to produce a prediction at the output (for example, a porosity value or a facies class probability). The forward pass also stores the intermediate activation values at every layer, because these are needed during the backward pass to compute the local gradients. The backward pass then sweeps from output to input, computing gradients and updating weights. The two passes together constitute one training step, and a complete sweep through the entire training dataset is called an epoch. Most oilfield neural-net training programs run 200 to 1,000 epochs, with the loss monitored on a held-out validation set to detect overfitting before it becomes severe enough to degrade predictions on new wells.
  • Vanishing gradient, ReLU, and deep network stability: The original formulation of back-propagation using sigmoid or tanh activations suffered from vanishing gradients in deep networks: because the derivative of a sigmoid is always less than 0.25, multiplying many such derivatives together during the backward sweep produces gradients that shrink exponentially toward zero, leaving the weights in early layers receiving essentially no update signal and failing to learn. This limited practical neural networks to two or three hidden layers for most of the 1990s and 2000s. The introduction of the ReLU activation (f(x) = max(0, x), derivative = 1 for positive x, 0 for negative x) in the 2010s largely solved this: gradients in the positive-activation region pass through unchanged rather than being compressed, allowing gradients to flow back through 10, 20, or 50 layers without vanishing. Variants including Leaky ReLU, ELU, and GELU address the problem of dying ReLU neurons (units permanently stuck at zero output) and have become standard in modern oilfield neural network architectures.
  • Petroleum geoscience applications: log interpretation and seismic facies: Back-propagation-trained networks are used in the WCSB for three primary tasks. The first is petrophysical property prediction: predicting porosity, permeability, water saturation, or total organic carbon (TOC) from wireline log suites in wells where core data is sparse. The network is trained on the wells where core measurements and logs both exist, learns the complex nonlinear relationships between logs and rock properties that conventional linear or empirical equations miss, and then predicts those properties in wells with logs but no core. The second application is lithofacies classification: assigning each depth sample to a lithology class (for example, Montney A, B, C, D, or C and D mixed) based on the log signature, trained on wells with whole-core descriptions or image logs. The third is seismic attribute inversion: classifying 3D seismic voxels into geological facies categories based on amplitude, AVO gradient, instantaneous frequency, and coherence attributes, with training labels derived from wells tied to the seismic volume.
  • Overfitting, regularisation, and validation discipline: The most common failure mode of back-propagation-trained networks in petroleum geoscience is overfitting: the network memorises the training data, including its noise and measurement errors, rather than learning the underlying geological relationships. An overfitted network produces excellent metrics on the training wells and poor predictions on new wells. Overfitting is controlled through regularisation techniques that back-propagation accommodates naturally: L2 (weight decay) regularisation adds a penalty proportional to the sum of squared weights to the loss function, preventing any single weight from becoming too large; dropout randomly zeros a fraction (typically 20 to 50 percent) of neuron outputs during training, forcing the network to learn redundant representations; early stopping halts training when the validation loss stops decreasing even as the training loss continues to fall. For oilfield applications where training wells are typically 5 to 30 in number (a small dataset by machine learning standards), these techniques are not optional refinements but essential practice, and results must always be validated on wells withheld from training before any operational decisions are based on network predictions.

Mathematical Framework and Optimiser Choices

The loss function chosen for back-propagation training defines what error the network minimises. For regression tasks such as porosity or permeability prediction, mean squared error (MSE) or mean absolute error (MAE) are standard: MSE penalises large individual errors more heavily because of the squaring operation, while MAE treats all errors proportionally and is more robust to outliers, which in log data often correspond to bad-hole intervals flagged by a high caliper log. For classification tasks such as lithofacies assignment, categorical cross-entropy is used, which measures the divergence between the network's predicted class probabilities and the one-hot encoded true class labels. The choice of loss function directly shapes the gradient signals that back-propagation propagates through the network, so selecting a loss function appropriate to the prediction task and the statistical characteristics of the training data is one of the most consequential decisions in network design.

The stochastic gradient descent (SGD) optimiser and its variants determine how the gradients computed by back-propagation are translated into weight updates. Vanilla SGD updates each weight by subtracting the product of the learning rate and the gradient, computed on a mini-batch of training samples typically 16 to 128 samples in size. The Adam optimiser, which has become the default in most oilfield neural net implementations, extends SGD by maintaining a per-weight running average of both the gradient and the squared gradient; this adaptive per-weight learning rate causes parameters that receive infrequent updates to receive larger steps (helpful for weights connected to rare geological events) and parameters that receive frequent large-gradient updates to receive smaller steps (preventing oscillation around the optimum). For most well-log and seismic attribute problems with several hundred to several thousand depth samples per well and 10 to 30 wells in the training set, Adam with a learning rate of 0.001 and a mini-batch size of 32 converges reliably in 300 to 600 epochs without manual tuning of the learning rate schedule.

Architecture choices interplay with back-propagation efficiency. Fully connected (dense) networks are the most common architecture for depth-sample-wise log property prediction, where each depth sample is treated independently and the spatial context of adjacent depth samples is captured by including a window of log values centred on the sample as the input vector. Recurrent architectures (LSTM, GRU) handle the sequential nature of log data more explicitly by feeding hidden state from one depth sample to the next, but the back-propagation-through-time variant required to train recurrent nets adds complexity that is rarely justified for Montney or Duvernay log interpretation where the key geological signals span 0.1 to 5 m depth intervals well within a standard window-based dense network's receptive field. Convolutional neural networks (CNNs) applied to 2D or 3D seismic attribute volumes are trained by a back-propagation variant that propagates gradients through the convolutional kernel weights, which are shared spatially, dramatically reducing the number of trainable parameters relative to a dense network of comparable depth and enabling effective seismic facies mapping at the 3D volume scale.

Interpretability of back-propagation-trained networks is an active concern in petroleum applications because a network that cannot explain its predictions provides limited value to geoscientists who need to understand and justify reservoir models. Gradient-based attribution methods, including saliency maps and integrated gradients, re-use the back-propagation machinery to compute the sensitivity of a network's output prediction to each input log value, producing a per-input importance score that reveals which logs are driving the prediction at each depth. In Montney lithofacies classification, such attribution maps typically show that the photo-electric factor (Pe), density (RHOB), and neutron porosity (NPHI) logs carry the highest prediction weight for distinguishing Montney A silica-rich siltstone from Montney B carbonate-cemented intervals, which is geologically consistent with the known mineralogical differences between the sub-members and validates that the network is learning real geology rather than noise.