Skip to content
About

Neural Networks

Strip away the mystique and a neural network is a function: numbers go in, numbers come out, and a pile of tunable parameters in the middle decides the mapping. “Deep” just means there are many layers between input and output.

The basic unit takes several inputs, multiplies each by a weight, sums them, adds a bias, and passes the result through an activation function.

x1 x2 x3 w1 w2 w3 weighted sum + bias activation f() output

The weights and bias are the parameters — the numbers training adjusts. Alone, a neuron is just weighted-sum-then-squash. The power comes from wiring many of them together.

Neurons are organized into layers, and the output of one layer feeds the next:

  • Input layer — your raw data as numbers (pixel values, token IDs).
  • Hidden layers — the “deep” middle. Each layer transforms the previous layer’s output into a more abstract representation.
  • Output layer — the final answer: one number for regression, a probability per class for classification.
Input raw data Hidden simple features Hidden richer features Output prediction

The hierarchy is the point. On images, early layers detect edges, middle layers assemble them into shapes, later layers recognize objects. Nobody programs this — the layers discover it during training. That’s representation learning.

Without activation functions, stacking layers is pointless: a chain of pure weighted sums collapses into a single weighted sum, and the network can only draw straight lines. The activation function injects non-linearity, letting the network model curves, interactions, and complex boundaries.

FunctionBehaviorWhere used
ReLUmax(0, x) — zero out negativesDefault for hidden layers
GELUA smooth ReLU variantHidden layers in transformers/LLMs
SigmoidSquashes to 0–1Binary classification output
SoftmaxTurns scores into a probability distributionMulti-class output

ReLU’s popularity is mostly practical: it’s trivial to compute and its gradient doesn’t vanish for positive inputs, so deep networks train faster.

Making a prediction is the forward pass — data flows input → hidden → output:

# A tiny network, conceptually. W = weight matrix, b = bias vector.
def forward(x):
h1 = relu(x @ W1 + b1) # input -> hidden layer 1
h2 = relu(h1 @ W2 + b2) # hidden -> hidden layer 2
return h2 @ W3 + b3 # hidden -> output

Each @ is a matrix multiplication. That’s the whole secret of why GPUs matter: a neural network is mostly giant matrix multiplications, and GPUs do those in massively parallel fashion. An LLM is this same pattern — just with billions of parameters and a smarter layer design.

A model’s parameter count is the total number of weights and biases. It is a rough proxy for capacity:

  • Small networks: thousands to millions of parameters.
  • Modern LLMs: billions to hundreds of billions.

More parameters means more capacity to learn — and more data, compute, and memory required to train and run. When you see “7B” or “70B” next to a model name, that’s billions of parameters, and it directly predicts the GPU memory you’ll need to serve it. See AI Infrastructure.

A neural network is a function built from layers of neurons; each neuron does a weighted sum plus an activation. Activation functions add the non-linearity that makes depth useful. The forward pass is how a prediction is computed — a chain of matrix multiplications, which is why GPUs are essential. Parameter count is the model’s capacity dial, and it sets the hardware bill.