Neural Networks
Strip away the mystique and a neural network is a function: numbers go in, numbers come out, and a pile of tunable parameters in the middle decides the mapping. “Deep” just means there are many layers between input and output.
The neuron
Section titled “The neuron”The basic unit takes several inputs, multiplies each by a weight, sums them, adds a bias, and passes the result through an activation function.
The weights and bias are the parameters — the numbers training adjusts. Alone, a neuron is just weighted-sum-then-squash. The power comes from wiring many of them together.
Layers
Section titled “Layers”Neurons are organized into layers, and the output of one layer feeds the next:
- Input layer — your raw data as numbers (pixel values, token IDs).
- Hidden layers — the “deep” middle. Each layer transforms the previous layer’s output into a more abstract representation.
- Output layer — the final answer: one number for regression, a probability per class for classification.
The hierarchy is the point. On images, early layers detect edges, middle layers assemble them into shapes, later layers recognize objects. Nobody programs this — the layers discover it during training. That’s representation learning.
Activation functions
Section titled “Activation functions”Without activation functions, stacking layers is pointless: a chain of pure weighted sums collapses into a single weighted sum, and the network can only draw straight lines. The activation function injects non-linearity, letting the network model curves, interactions, and complex boundaries.
| Function | Behavior | Where used |
|---|---|---|
| ReLU | max(0, x) — zero out negatives | Default for hidden layers |
| GELU | A smooth ReLU variant | Hidden layers in transformers/LLMs |
| Sigmoid | Squashes to 0–1 | Binary classification output |
| Softmax | Turns scores into a probability distribution | Multi-class output |
ReLU’s popularity is mostly practical: it’s trivial to compute and its gradient doesn’t vanish for positive inputs, so deep networks train faster.
The forward pass
Section titled “The forward pass”Making a prediction is the forward pass — data flows input → hidden → output:
# A tiny network, conceptually. W = weight matrix, b = bias vector.def forward(x): h1 = relu(x @ W1 + b1) # input -> hidden layer 1 h2 = relu(h1 @ W2 + b2) # hidden -> hidden layer 2 return h2 @ W3 + b3 # hidden -> outputEach @ is a matrix multiplication. That’s the whole secret of why GPUs matter:
a neural network is mostly giant matrix multiplications, and GPUs do those in
massively parallel fashion. An LLM is this same pattern — just with billions of
parameters and a smarter layer design.
Parameters and scale
Section titled “Parameters and scale”A model’s parameter count is the total number of weights and biases. It is a rough proxy for capacity:
- Small networks: thousands to millions of parameters.
- Modern LLMs: billions to hundreds of billions.
More parameters means more capacity to learn — and more data, compute, and memory required to train and run. When you see “7B” or “70B” next to a model name, that’s billions of parameters, and it directly predicts the GPU memory you’ll need to serve it. See AI Infrastructure.
Key takeaways
Section titled “Key takeaways”A neural network is a function built from layers of neurons; each neuron does a weighted sum plus an activation. Activation functions add the non-linearity that makes depth useful. The forward pass is how a prediction is computed — a chain of matrix multiplications, which is why GPUs are essential. Parameter count is the model’s capacity dial, and it sets the hardware bill.