Core i7 Architecture
In order to optimize performance - and reduce power consumption - Nehalem takes power management to new heights for Intel. Not only can it run on just one underclocked core when the computer is not loaded, it can automatically overclock from one to four cores as the performance is needed! Intel hopes to use the same Nehalem design to address the needs of servers, mobile computing, and desktop/workstation computers.
You read right, Core i7 chips automatically overclock themselves to a certain degree when needed. Mind you, the overclock is not much at this time - just an increase by one step of the multiplier when all four cores are busy - however it can potentially overclock fewer cores even higher, as long as the processor as a whole stays within its rated TDP envelope.

The Core i7 keeps all of the performance features of the Core 2:
- Wide Dynamic Execution - 4 wide decode/rename/retire
- Advanced Digital Media Boost - which consists of 128 bit SSE instructions executed in one cycle
- Intel HD Boost - the new SSE4.1 instructions
- Smart Memory Access - consisting of memory disambiguations and hardware prefetching
- Advanced Smart Cache - low latency high bandwidth shared L2 cache
and adds new advances of its own:
- new SSE4.2 instructions adding string handling and CRC32 calculations
- improved locking support
- an additional cache hiearchy
- improved looping and streaming support
- better branch prediction
- improved virtualization support with faster transitions into / out of virtual machines
- simultaneous mult-threading - keeping the work units busy and improving performance
- new TLB hierarchy - adds 512 small page 2nd level TLB
- fast 16 byte unaligned access - basically eliminating the unaligned access speed penalty
- faster synchronization primitives - improving multi-threaded performance
Each core has 32KB of instruction cache, 32KB of data cache, and a private unified 256KB L2 cache - and the four cores share a massive 8MB of L3 cache.
At the "front end" of a core, the 4 instruction wide decored is followed by a "macro fusion" unit and a loop stream detector.
The Macro fusion unit can combine TEST/CMP instruction followed by a branch into a single operation, thus improving throughput and effectively executing more instructions per unit time.
The loop stream detector allows the disabling of unneeded gates as there is no need to keep fetching and decoding the same instructions repeatedly, and also there is no need to predict the branches - this leads to higher performance and lower power consumption. Nehalem also improves on branch prediction by going to a multi-level approach,
At the "Execution Unit", Nehalem uses a "unified reservation station" to schedule work among the six potential execution units - and it can potentially execute six operations per clock cycle:
- One Load from memory
- One Store Address
- One Store Data
- Three "Computational Operations" such as math, logic or branch operations
While Penryn (Core 2) already has a similar Reservation Station scheme, Nehalem significantly improves on it.
Due to the addition of SMT, Nehalem has 36 reservation stations instead of 32 for Penryn; Penryn had 32 load buffers, Nehalem increases that number to 48, and Nehalem also increases store buffers to 32 from Penryns 20.
These changes also help Simultaneous Multi Threading keep more of the execution engines occupied that would be left idle, and by keeping idle units busy, increasing performance instead of wasting power.