Neural Networks

Strip away the mystique and a neural network is a function: numbers go in, numbers come out, and a pile of tunable parameters in the middle decides the mapping. “Deep” just means there are many layers between input and output.

The neuron

The basic unit takes several inputs, multiplies each by a weight, sums them, adds a bias, and passes the result through an activation function.

The weights and bias are the parameters — the numbers training adjusts. Alone, a neuron is just weighted-sum-then-squash. The power comes from wiring many of them together.

Layers

Neurons are organized into layers, and the output of one layer feeds the next:

Input layer — your raw data as numbers (pixel values, token IDs).
Hidden layers — the “deep” middle. Each layer transforms the previous layer’s output into a more abstract representation.
Output layer — the final answer: one number for regression, a probability per class for classification.

The hierarchy is the point. On images, early layers detect edges, middle layers assemble them into shapes, later layers recognize objects. Nobody programs this — the layers discover it during training. That’s representation learning.

Activation functions

Without activation functions, stacking layers is pointless: a chain of pure weighted sums collapses into a single weighted sum, and the network can only draw straight lines. The activation function injects non-linearity, letting the network model curves, interactions, and complex boundaries.

Function	Behavior	Where used
ReLU	`max(0, x)` — zero out negatives	Default for hidden layers
GELU	A smooth ReLU variant	Hidden layers in transformers/LLMs
Sigmoid	Squashes to 0–1	Binary classification output
Softmax	Turns scores into a probability distribution	Multi-class output

ReLU’s popularity is mostly practical: it’s trivial to compute and its gradient doesn’t vanish for positive inputs, so deep networks train faster.

The forward pass

Making a prediction is the forward pass — data flows input → hidden → output:

# A tiny network, conceptually. W = weight matrix, b = bias vector.
def forward(x):
    h1 = relu(x  @ W1 + b1)   # input  -> hidden layer 1
    h2 = relu(h1 @ W2 + b2)   # hidden -> hidden layer 2
    return     h2 @ W3 + b3   # hidden -> output

Each @ is a matrix multiplication. That’s the whole secret of why GPUs matter: a neural network is mostly giant matrix multiplications, and GPUs do those in massively parallel fashion. An LLM is this same pattern — just with billions of parameters and a smarter layer design.

Parameters and scale

A model’s parameter count is the total number of weights and biases. It is a rough proxy for capacity:

Small networks: thousands to millions of parameters.
Modern LLMs: billions to hundreds of billions.

More parameters means more capacity to learn — and more data, compute, and memory required to train and run. When you see “7B” or “70B” next to a model name, that’s billions of parameters, and it directly predicts the GPU memory you’ll need to serve it. See AI Infrastructure.

Key takeaways

A neural network is a function built from layers of neurons; each neuron does a weighted sum plus an activation. Activation functions add the non-linearity that makes depth useful. The forward pass is how a prediction is computed — a chain of matrix multiplications, which is why GPUs are essential. Parameter count is the model’s capacity dial, and it sets the hardware bill.