The Skylake processor is a 2015 Intel processor.
The Intel Processor
In 1978 Intel made the choice to have retrocompatibility for every processor. At that time they had the 8086 processor, with some number of memory bits. For backwards compatibility intructions have usually just grown. They used geographic locations because these are not suable. If we want new code to run for old processors, we should need to put specific flags.
Microarchitecture is just the implementation of the architecture.
- We hade details of CPU cache
Types of operations
- Vector instructions Simd (see Central Processing Unit).
- Graphic processors and similars.
- MMX, Sandy bridge doubles precision for floating points
- AVX 512 (most recent on Skylake) 16-way single and 8 way double precision.
- FMA: Fused Multiply-Add it is done as a single operation, and it is also more precise because we have single rounding instead of two.
Out of Order Execution
The Skylake microarchitecture has a 224 pool of instructions, meaning it has already loaded those, if it’s possible, they will be executed out of order.
Superscalar processors
Superscalar processors are a class of CPUs that can execute multiple instructions per clock cycle by leveraging multiple execution units. Unlike scalar processors, which handle one instruction at a time, superscalar architectures use instruction-level parallelism (ILP) to dispatch and execute several independent instructions simultaneously. This is achieved through techniques like out-of-order execution, register renaming, and branch prediction, allowing efficient utilization of CPU resources.
Modern processors, including those based on Intel’s Skylake and AMD’s Zen architectures, are heavily superscalar, often issuing four or more instructions per cycle to maximize performance. Intel processors were superscalar since Pentium Pro. However, achieving high throughput depends on software optimizations that expose parallelism and minimize pipeline stalls. These processors are very expensive.
- Each port can execute instructions in parallel in the same clock cycle.
- The ideal efficiency takes this into account, as if many instructions start on parallel on the same port (this is clearly impossible in reality).

example of a superscalar processor See test
When we have throughput of 1 behind a single port, we have a fully pipelined pipeline. While for division, we don’t have that.
Floating point registers
They are 256 bits registers.
Scalar (non-vector) double precision FP code uses the bottom quarter Explains why throughput and latency is usually the same for vector and scalar operations Using vectors, is just parallel execution over this register.
Intel Tick-Tock Model
Intel’s Tick-Tock model was a processor development strategy introduced in 2007 to maintain Moore’s Law by ensuring continuous performance improvements. It followed a two-phase cadence, alternating between technological advancements in process technology and microarchitecture upgrades:
- Tick Phase – Shrinking the fabrication process:
- The existing microarchitecture was shrunk to a smaller manufacturing node (e.g., from 45nm to 32nm).
- This led to increased efficiency, lower power consumption, and slight performance gains.
- Example: The transition from the 65nm Core 2 (Merom) to the 45nm Core 2 (Penryn).
- This phase did not work anymore, we could not produce smaller chips, more and more difficult to pack transistors!
- Tock Phase – Introducing a new micro-architecture:
- A major redesign of the CPU architecture was implemented while maintaining the same manufacturing process.
- This phase led to significant performance improvements, better instruction sets, and architectural innovations.
- Example: The move from the 45nm Core (Penryn) to the 45nm Nehalem, which introduced Hyper-Threading and an integrated memory controller.
This predictable cycle allowed Intel to continuously improve processor performance while reducing development risks. However, as semiconductor manufacturing faced increasing challenges in scaling, the Tick-Tock model was phased out around 2016, replaced by the Process-Architecture-Optimization (tick, tock, opt three phases development) strategy, which introduced an additional step to extend the lifespan of each process node (with Skylake). We reached 7 nanometers last year (intel 7 has 10nm but claims performance of 7nm chips).
Operation intensity
$$ \frac{\text{Operations}}{\text{Data}} $$- High operation intensity means that we have a lot of operations to do on a small amount of data. Techniques to make it faster are:
- So you can reduce cache misses
- Keep floating point units busy
- Instruction level parallelism
- Vectorization
- Low operation intensity means that we have a small number of operations to do on a large amount of data.
- Compress Data
- Reduce data movement
- Reduce cache misses
Usually the numerator is in flobs and data is in bytes.