Itanium - any comments?

Printable View

Show 25 post(s) from this thread on one page

11-09-2000, 02:02 AM
Conrad Song

http://developer.intel.com/design/ia..._ovw/index.htm

Somehow I don't know of any server microprocessor out there today that ISN'T excessively complex. Itanium is a beast. Not the same animal as POWER4, but a beast still the same.
11-09-2000, 03:05 AM
Arcadian

Quote:

Originally posted by Bash:
Concerning Itanium complexity...

I guess that from what I've heard the Itanium shouldn't be an excessively complex chip. This is primarily due to its EPIC architecture -- because the instruction scheduling is done in the compiler, the Itanium doesn't need to include the excessively complicated instruction scheduling circuitry. I can't remember the number any more, but I think the P3 instruction scheduler takes a sizeable percentage of the chip area.

So the question of the day is where do the all the transistors in the Itanium get used. One candidate is the register file. I believe the Itanium has 128 usable registers. What's that, about 10x the number of registers available in the x86 architecture? https://www.sharkyforums.com/images/.../2005/06/5.gif Also, I've heard the first generation Itanium mantains backward compatability with x86 binaries by having some kind of instruction translation hardware built in -- this must take a bit of space too. I suppose if they have any free transistors after that we can just pray for more functional units https://www.sharkyforums.com/images/.../2005/06/5.gif

-Bash

Thanks for responding, Bash https://www.sharkyforums.com/images/.../2005/06/5.gif.

I have two comments. First, it's hard to compare the number of registers in IA-64 vs IA-32, because IA-32 has a good amount of renaming register. Just how many is something that Intel chooses not to disclose. However, I bet it's around 100.

Second, the Itanium has IA-32 compatability through the use of an entire IA-32 core right on the die. Yes, the die combines everything in IA-64 PLUS an IA-32 core for compatability. The Itanium can switch between the two using an instruction that Windows 64bit knows how to use.
11-09-2000, 03:36 PM
Humus

Quote:

Originally posted by Arcadian:
Thanks for responding, Bash https://www.sharkyforums.com/images/.../2005/06/5.gif.

I have two comments. First, it's hard to compare the number of registers in IA-64 vs IA-32, because IA-32 has a good amount of renaming register. Just how many is something that Intel chooses not to disclose. However, I bet it's around 100.

Second, the Itanium has IA-32 compatability through the use of an entire IA-32 core right on the die. Yes, the die combines everything in IA-64 PLUS an IA-32 core for compatability. The Itanium can switch between the two using an instruction that Windows 64bit knows how to use.

Around 100??? You kidding? I'd bet it's below 20.
I don't have exact numbers but I think Athlon only have 17 renaming registers, for the simple reason that more registers were useless. I mean, how many false dependencies can occure, and how long time do you think it takes to get them resolved?
11-09-2000, 04:36 PM
Bash

Quote:

Originally posted by Arcadian:
Thanks for responding, Bash https://www.sharkyforums.com/images/.../2005/06/5.gif.

I have two comments. First, it's hard to compare the number of registers in IA-64 vs IA-32, because IA-32 has a good amount of renaming register. Just how many is something that Intel chooses not to disclose. However, I bet it's around 100...

I know that the core of the P3 has a great deal of registers available for renaming, but there's a big difference between renaming registers and actual user accessable registers. When you compile code for the x86, you have to continually load variables from memory because you don't have near enough registers to store all of them. With 128 user accessable registers you'll be able to compile code much more effeciently. Of course, this is a necessity for the Itanium since the compiler is doing all of the instruction scheduling too.

Oh about the transistor count in the 3rd or 4th message of this thread...320 MILLION??? Is that counting a 16meg L2 cache or something? I bet the core is around 32 million, which is approx. the same as the current P3 / Athlon.

-Bash

[This message has been edited by Bash (edited November 09, 2000).]
11-09-2000, 05:07 PM
jaywallen

Quote:

Also, this comment is for jaywallen. What parts of the architecture do you forward to the most?

Sorry, I was so busy being a smartass that I forgot to actually say something about the architecture.

I'm a pragmatist, and I like the new architecture for what it promises me. My background is in physics with a specialty in medical imaging. So the thing I look forward to is finally having a platform for mere mortals that has the memory bandwidth and addressability, as well as the pure raw hunk to do things like, say, fourier transform-based high resolution motion-corrected image reconstructions in near-real-time. Present-day solutions to image reconstruction needs are barely able to provide us with useful resolution and specificity in the output. Furthermore, a great deal of the detector event discrimination chores must be handled by the detection apparatus itself. The application of discrimination algorithms at the detectors makes the detection apparatus less efficient, but, worse yet, presupposes that the twits who designed the algorithms are not causing the apparatus to discard useful information! I'd rather grab all of the interaction events I can detect in raw form and analyze them and apply discrimination techniques where multiple alternative algorithms can be imposed, at the back end. This means that the only absolute limitation for future analysis of medical images will be the actual absolute limitations of the detectors used in the imaging procedures. And subsequently developed analytic techniques will be applicable to the saved raw data, meaning that revised analyses and readings of those analyses may yield important new data from studies performed previously. As a scientist, I hate discarding data before it has been analyzed just because I was too stupid to be able to analyze it properly at the moment it was gathered!

Regards,
Jim
11-09-2000, 10:07 PM
Arcadian

Quote:

Originally posted by Humus:
Around 100??? You kidding? I'd bet it's below 20.
I don't have exact numbers but I think Athlon only have 17 renaming registers, for the simple reason that more registers were useless. I mean, how many false dependencies can occure, and how long time do you think it takes to get them resolved?

Well, more than renaming registers, there are registers for each pipeline, and lots of hidden registers for many puposes. Over all, I'd say the P6 architecture has well over 100 registers. However, it's interesting that only 8 are user accessable. https://www.sharkyforums.com/images/.../2005/06/5.gif

Also, this is for Bash: I just wanted to remind you that I mentioned registers because we were comparing die sizes. I realize that user accessable registers mean more for performance. Thanks for the clarification, though.
11-09-2000, 10:15 PM
Arcadian

Quote:

Originally posted by jaywallen:
My background is in physics with a specialty in medical imaging. So the thing I look forward to is finally having a platform for mere mortals that has the memory bandwidth and addressability, as well as the pure raw hunk to do things like, say, fourier transform-based high resolution motion-corrected image reconstructions in near-real-time.

Regards,
Jim

Thanks for the detail, Jim. I think you'll find that Itanium has the raw floating point power to match anything else out there at similar clock speeds, including Alpha. I hear that the floating point engine can calculate 8 single precision or 4 double precision floating point numbers at the same time. Compare this to SSE-2, which can only do 4 single precision and 2 double precision floating point calculations simuiltaneously. And SSE-2 is not available for every program, while IA-64 will be taken advantage of from the beginning.

Also the Itanium's chipset, called 460GX, does memory interleaving technology to give very high bandwidth. I think it will be more than enough to provide plenty of data to the power hungry Itaniums. I'd really like to see some Itanium chips in action. https://www.sharkyforums.com/images/.../2005/06/5.gif
11-10-2000, 01:31 AM
Humus

Quote:

Originally posted by Arcadian:
Well, more than renaming registers, there are registers for each pipeline, and lots of hidden registers for many puposes. Over all, I'd say the P6 architecture has well over 100 registers. However, it's interesting that only 8 are user accessable. https://www.sharkyforums.com/images/.../2005/06/5.gif

Sure, if you count all registers it's gonna be a huge amount ... but you said you thought it was over 100 renaming register, which is very unlikely.
11-10-2000, 02:29 AM
Arcadian

Quote:

Originally posted by Humus:
Sure, if you count all registers it's gonna be a huge amount ... but you said you thought it was over 100 renaming register, which is very unlikely.

With so many instructions that can be simultaneously in flight on the Pentium III, I would be surprised if there weren't a lot of renaming registers to support all the dependancies. But, since Intel isn't revealing how many they use, it's probably senseless to argue.

So Humus, do you have any other comments on Itanium? https://www.sharkyforums.com/images/.../2005/06/5.gif
11-10-2000, 08:58 AM
Humus

Quote:

Originally posted by Arcadian:
With so many instructions that can be simultaneously in flight on the Pentium III, I would be surprised if there weren't a lot of renaming registers to support all the dependancies. But, since Intel isn't revealing how many they use, it's probably senseless to argue.

So Humus, do you have any other comments on Itanium? https://www.sharkyforums.com/images/.../2005/06/5.gif

There are a lot of instructions in flight in the Athlon too, and from what I understand there's no gain of having more than 17 renaming register, and I doubt the situation is significantly different on the P3 ...
But it's not an important issue ...

About the Itanium, one kickass feature of this lilly processor is register indexing (or whatever it's called). For the first time you can put small arrays into register, pretty cool if you ask me .. https://www.sharkyforums.com/images/.../2005/06/5.gif
11-10-2000, 11:23 AM
Conrad Song

Quote:

Originally posted by Bash:

...When you compile code for the x86, you have to continually load variables from memory because you don't have near enough registers to store all of them. With 128 user accessable registers you'll be able to compile code much more effeciently. Of course, this is a necessity for the Itanium since the compiler is doing all of the instruction scheduling too.

-Bash

[This message has been edited by Bash (edited November 09, 2000).]

It gets better. Stack mapping to registers can occur on registers 32-127. This allows you to allocate and call subroutines by passing parameters by register-stack. This results in a unified subroutine calling model but with the speed of passing by register. Let's examine this further:

The typical IA-32 subroutine calling model involves decrementing the ESP and moving the parameters on to the SS:[ESP] memory area. The called subroutine retrieves from SS:[ESP]. The pass and retrieval of subroutine parameters are all load/store instructions. Passing by register is extremely difficult to do because of the few general registers in IA-32, and the difficulty in supporting this is in a precompiled object or library. In summary, passing by register on IA-32 is nearly non-existant.

In IA-64, the "default" passing model is on the register-stack. IA-64 instructions are provided to allocate and free register stack space, which automatically fills/spills to the stack and rotates as needed. Well, if you think that fill/spills are load/stores, you're right. But if you analyze the stack frame level of object-orientated code, in particular, you'll find that the stack frame level a majority of the time stays well within 1-2 levels a high percentage of the time from the current level. And because most methods are not parameter heavy, chances are that spills/fills are infrequent. So effectively, registers 32-127 become a register cache for the run-time stack.

Better still, because this is the only parameter passing model, these benefits are gained across precompiled objects and libraries without special treatment. Big win.
11-10-2000, 01:36 PM
mosier

Quote:

Originally posted by Arcadian:
jtshaw, I have heard that there are already over 400 apps compiled in native IA-64 already. This includes a lot of scientific apps as well.

One professor from a University (I forget which one) runs scientific programs and said that there was a routine that he used to run on the supercomputers of two years ago that took a day to complete and get data. On a 4 processor Itanium system (he has one of the Pilot release systems), this same task took less than 1/2 hour. Granted that computers have matured over the last two years, but to get a 48x improvement over a supercomputer seemed impressive to me!

I think there will be a lot of software apps available for Itanium by launch (I think I read somewhere that launch was in Q1 2001). I also think Itanium will give Ultrasparc III a run for its money.

Also, this comment is for jaywallen. What parts of the architecture do you forward to the most?

I know a Prof (more of research ph.d)that is running the "pilot" release now. It is with some PHAST technology via P&G. Personally, I would say the computers are unbelievable when doing projects that can actually utilize the multi-processor IA-64. I have used the computer for different apps, and running a program developed for rendering the images for the consortium runs probably 10 times as fast as what is took before. Now the bottleneck problem is the hardware that the program is rendering for. The inability to run the old apps is going to be the downfall for the near future, but scrapping the old and starting anew can only be a good thing.

Now if that weakest link in hardware could be taken care of....
11-10-2000, 02:29 PM
Arcadian

Quote:

Originally posted by mosier:
I know a Prof (more of research ph.d)that is running the "pilot" release now. It is with some PHAST technology via P&G. Personally, I would say the computers are unbelievable when doing projects that can actually utilize the multi-processor IA-64. I have used the computer for different apps, and running a program developed for rendering the images for the consortium runs probably 10 times as fast as what is took before. Now the bottleneck problem is the hardware that the program is rendering for. The inability to run the old apps is going to be the downfall for the near future, but scrapping the old and starting anew can only be a good thing.

Now if that weakest link in hardware could be taken care of....

Wow, thanks for the info, Mosier. It's nice to hear that rendering software gets so much of an improvement. I have several questions, though.

1) What system were you using previously that ran 10 times slower?

2) Which bottleneck in hardware were you refering to?

3) What rendering program were you using before, and which are you using now? I am curious what Itanium does and does not support.

4) What is PHAST from P&G?

5) How many Itanium processors is this professor running?

Thanks again... I appreciate any comments. https://www.sharkyforums.com/images/.../2005/06/5.gif
11-10-2000, 08:56 PM
Moridin

Quote:

Originally posted by Humus:
[B] There are a lot of instructions in flight in the Athlon too, and from what I understand there's no gain of having more than 17 renaming register, and I doubt the situation is significantly different on the P3 ...
But it's not an important issue ...

B]

The number of rename registers is directly related to the number of pipeline stages. The more stages the more rename registers you need. The Athlon has 72 and the P4 has 128. I don’t know *** many the PIII has but I suspect it is about the same as the Athlon since the working parts of the pipelines are about the same length.
11-10-2000, 09:05 PM
Moridin

Quote:

Originally posted by Conrad Song:
http://developer.intel.com/design/ia..._ovw/index.htm

Somehow I don't know of any server microprocessor out there today that ISN'T excessively complex. Itanium is a beast. Not the same animal as POWER4, but a beast still the same.

The only reason I am concerned with Itanium complexity is that EPIC should have resulted in a simpler design, not a more complex one. Rumors are that McKinley is half the size of Itanium and has more on chip cache as well. So most likely this is the result of a poor first design for IA-64, but I want to see something more definite before I decide.

No kidding the Power 4 is a beast. I bet each Power 4 module consumes 1500 W or more. A fully configured 64 processor Power 4 could consume 10 kW or more.

Show 25 post(s) from this thread on one page