Sign In
Request for warranty repair

In case of a problem we’ll provide diagnostics and repairs at the server installation site. For free.

Language

VLIW vs x86: the rise and fall of Itanium

Hello, regular readers and those who just dropped by :) It’s me again — with thearticle in the series about architectures, microarchitectures, processors, instruction sets, and all that good stuff.

What came before:

In this series of articles, we’ll dive back down to earth — to the level of transistor technologies — and break down the VLIW (Very Long Instruction Word) architecture. We’ll talk about its predecessors, immerse ourselves in the spirit of the 1980s–1990s, learn how Itanium became “Itanic,” and see how this architecture lived, lives, and will continue to live. Oh, and yes — there will be even PlayStation 2.

Warning: long read ahead! Thermoses of hot (and other warming) drinks are highly recommended.

Itanium — one step away from greatness

1.png

In 2003, Microsoft CEO Steve Ballmer enthusiastically presented the first generation of HP Integrity systems supporting Windows.

Considering how technology evolves and what products and companies dominate the market today, it might seem like processor architectures are a rather boring field where everything has already settled.

CISC (x86) and RISC (ARM, and to some extent RISC-V) have taken their positions, divided the market, and barely compete with each other — aside from some minor clashes in the laptop segment. And no one really expects anything new.

But it wasn’t always like this.

The spirit of the era: mid-1980s to early 1990s

1.png

The world was doing “fine”: the fall of the Berlin Wall, the end of the Cold War, the Maastricht Treaty, a technological boom, American pop culture and cinema producing future classics like Back to the Future, The Terminator, and Pulp Fiction, along with music from Michael Jackson and Nirvana.

Across Europe, there was massive political and cultural transformation, followed by an explosion of new creative movements — music, cinema, and art evolving rapidly.

The internet appeared — technologies were connecting the world, and culture was spreading at incredible speed.

It was the era of globalization.

In the world of technology and processor architectures, intense debates were raging between supporters of CISC and RISC.

Technically, the concept of VLIW (Very Long Instruction Word) originated in the 1970s within academic circles, but its popularization began precisely during this chaotic period of the 1980s–1990s.

Now, briefly about its essence (we’ll dissect it in detail later).

VLIW is an architecture in which multiple instructions are packed into a single long machine word, executed by the processor in one cycle.

All instructions in such a bundle are executed in parallel, which allows for significantly increased performance in tasks with a high degree of parallelism.

When implemented correctly, it enables extremely efficient utilization of processor resources.

As is tradition, I’ll give credit to a researcher well-known in narrow circles (but not to the general public) — Josh Fisher from Yale University.

He was a pioneer in VLIW architectures, known for his work in parallel instruction processing and performance optimization of processors using VLIW.

He contributed to the development and popularization of approaches for applying VLIW in high-performance computing and worked on compiler optimization challenges.

1.png

More about Fisher:

In 2003, the IEEE Computer Society and the Association for Computing Machinery awarded him the Eckert–Mauchly Award in recognition of 25 years of contributions to instruction-level parallelism, groundbreaking work in VLIW architectures, and the development of the Trace Scheduling compilation method.

The first commercial implementation of VLIW appeared in 1987 — the TRACE processors from Multiflow Computer, a company founded by Fisher.

TRACE systems could execute up to 28 operations per cycle, but they turned out to be extremely complex and expensive due to their niche nature (low production volume).

So, VLIW is not just another architecture / microarchitecture / ISA (Instruction Set Architecture); it is a concept that could have completely reshaped the world of computing.

But it didn’t — at least not yet.

In processor design, there are dozens of architectures that never make headlines or capture markets.

These are usually highly specialized technologies that didn’t become mainstream but still influenced the industry — shifting its direction, if you will.

But as I’ve said in previous articles, having superior technology alone is not enough to make all the money in the world — and there are countless examples:

The technically superior Betamax video format from Sony lost to the simpler VHS format from JVC (technology needs not only to be better, but also more accessible, supported by an ecosystem, and aligned with market demand).

1.png

Windows Phone arrived too late, entered an overly competitive environment, and suffered from poor developer support (the ecosystem turned out to be more important than the technology).

1.png

In the context of VLIW, it’s logical to mention another failed technology — Itanium processors (sarcastically nicknamed “Itanic”) based on the IA-64 architecture.

1.png

How the Itanic sank — an EPIC fail

The IA-64 architecture is based on the VLIW concept with a modification called EPIC (Explicitly Parallel Instruction Computing).

At its core, this means explicitly parallel execution of machine instructions — where the compiler decides which instructions the processor should execute in parallel.

Let me remind you that in classic superscalar architectures (x86, ARM), the processor itself manages instruction dependencies at runtime — this requires complex hardware logic, which increases power consumption and makes further scaling more difficult.

Simply put, the main idea of EPIC is to shift the responsibility for determining parallelism from the processor to the compiler.

In other words, the program itself must tell the processor which parts of the code can be executed simultaneously (in parallel).

In theory, this should have delivered high performance, but in practice, compilers were not as effective at analyzing complex dependencies in real-world applications.

And the range of suitable workloads was limited.

Here’s the issue: the compiler cannot always guarantee that the processor will receive a continuous stream of instructions to process.

As a result, EPIC-based processors also faced problems with branching.

EPIC addresses this using a mechanism called predicated execution with predicate registers.

So, let’s return to our predicate registers in EPIC.

Instead of executing only one branch of instructions, the processor processes multiple branches simultaneously.

As soon as it becomes clear which branch is correct, execution of the other branch is stopped.

Expert note: For example, if the result of operation A determines whether B or E should be executed, the processor sends instructions A, B, C, D, E, F, and G into the pipeline at the same time.

When the result of A becomes available, it turns out that execution should continue with E.

At that moment, execution of B, C, and D is halted, while the results of E, F, and G remain in progress.

If the correct path had been B, C, D, the processor would have continued executing them without losing time.

This is how EPIC saves time by eliminating the penalties of branch misprediction.

Processors such as those based on x86, on the other hand, use branch prediction.

If the prediction is wrong, the pipeline is flushed and execution restarts from the correct branch, resulting in significant time loss.

Sometimes instruction execution is impossible because the processor lacks the necessary data—it has to fetch it from cache, RAM, or persistent storage.

To reduce waiting time, the EPIC architecture uses speculation—where the compiler заранее indicates which data might be needed, and the processor caches it in advance.

This helps minimize memory access latency, but there are downsides: if conditions change in the meantime, the processor must revalidate the data before using it to avoid errors.

In x86, speculation is handled in hardware: the processor itself predicts which data will be needed and loads it—a complex and power-hungry mechanism.

Here’s another important feature of EPIC: the compiler prepares bundles of long instructions that are already optimized for parallel execution.

It determines which instructions can be executed simultaneously without interfering with each other and embeds dependency information between instruction groups.

These bundles may also include additional instructions related to speculation mechanisms.

This approach is fundamentally different from x86 architecture, where complex hardware modules are used to analyze instructions, manage dependencies, and perform branch prediction.

This is essentially a confrontation between software and hardware—the compiler versus the processor’s physical implementation.

Below you can take a look at the Intel Itanium (Merced).

By modern standards, the processor looks unusual—essentially a mezzanine board with a 418-pin microprocessor socket (PAC418).

The actual die is mounted on a board inside this bulky package.

A few photos of the Intel Itanium Merced processor

1.png

Itanium 733 MHz chip with the heat spreader removed

The first generation of Itanium (Merced) was released to the market in the summer of 2001 with a price of up to $4000.

These were 180 nm chips with 25 million transistors, a TDP of 116–130 W, and core frequencies of 733 and 800 MHz.

They were released after 10 years of development, but with delays, since they were originally planned for 1997.

Model

Clock Speed

L2 Cache

L3 Cache

FSB

Cores

Threads/Core

Voltage

TDP (W)

Socket

Itanium 733

733 MHz

96 KB

2 MB

133 MHz

1

1

1.25–1.6 V

116

PAC418

Itanium 733

733 MHz

96 KB

4 MB

133 MHz

1

1

1.25–1.6 V

130

PAC418

Itanium 800

800 MHz

96 KB

2 MB

133 MHz

1

1

1.25–1.6 V

116

PAC418

Itanium 800

800 MHz

96 KB

4 MB

133 MHz

1

1

1.25–1.6 V

130

PAC418

Because of this, first-generation Itanium processors lagged significantly behind their contemporaries in both RISC and CISC camps, partly due to an inefficient memory subsystem, immature compilers, and poor x86 emulation.

Even the Pentium 4 Willamette, released in November 2000, had 42 million transistors and clock speeds of 1.4–1.5 GHz (despite being a stopgap product designed as a quick response to AMD’s Athlon Thunderbird, which had a relatively high TDP of 75 W—but that’s a different story).

Sales of Itanium Merced were modest—only a few thousand units.

1.png

Pentium 4 Willamette 1.5 GHz processor in retail packaging

But despite the questionable start, Intel and HP (who got into this together) positioned Itanium chips as the future of servers—they were supposed to mark the transition to 64-bit computing and replace the ubiquitous x86.

Vendors promised that by the second generation, McKinley, everything would become significantly better.

And overall, that did turn out to be the case — in its second iteration, Itanium processors became competitive in terms of performance per watt.

Support for the IA-64 architecture and its software ecosystem gradually improved, which helped increase sales.

There were even opinions and expectations that Itanium could dominate the market (starting with the server segment), but something still went wrong: as we remember, AMD introduced its Opteron processors based on the x86-64 architecture; meanwhile, Itanium systems were expensive, consumed a huge amount of power, ran hot, and required serious cooling.

x86 emulation was terrible (so bad that its very existence was arguably more of a drawback, as numerous articles have pointed out), and the market as a whole simply wasn’t ready to transition to a VLIW-based architecture.

In terms of sales, Itanium couldn’t even surpass Power ISA and SPARC — architectures you may not have even heard of.

As a result, Itanium became a niche product (in the high-performance enterprise server segment) all the way until its eventual demise, while x86 continued to dominate.

1.png

Itanium Kittson

It’s hardly surprising that technical issues, a weak ecosystem, and incompatibility with mainstream technologies tend to scare off customers.

But here’s what’s interesting — the agony dragged on all the way until July 2021, with shipments ending on the quad-core and octa-core Itanium 9700 Kittson processors.

I could go into detail about the history of Itanium, diving deep into IA-64 and the architectural nuances, but this article is ultimately about VLIW.

However, the Itanium story is important and takes up a significant portion of the article to demonstrate the viability of the very long instruction word concept — that the technology itself is solid and, in principle, capable of competing with x86 and ARM.

But not every flaw and limitation can be patched over with large budgets and strong brands like Intel and HP.

1.png

High-performance HP Integrity rx8640 server (up to 16 Itanium processors, 17U form factor, weighing up to 171.4 kg in maximum configuration)

Here’s a simple thought experiment: swap VLIW and x86 at the dawn of the latter. What would we get?

A world where VLIW accumulated a strong legacy codebase and optimizations, gained support from multiple major vendors and third-party developers, ran natively with all relevant and in-demand software (both enterprise and likely consumer), along with all the other advantages of being a first mover.

And it’s unclear what position x86 and ARM would be in today.

If my story about Itanium caught your interest — drop a comment.

If I see enough interest, I’ll put together a proper deep-dive long read — exactly the way you like it :)

But for now, we still have a lot more interesting ground to cover about VLIW:

  • what solutions exist besides Itanium (I know that you know :D);

  • what this VLIW thing actually is;

  • and why (and most importantly, what exactly) it has that’s “longer.”

Comments
(0)
No comments
Write the comment
I agree to process my personal data

Next news

Be the first to know about new posts and earn 50 €