Click to See Complete Forum and Search --> : L1 or L2 - which is more important.


KR02
10-15-2000, 02:15 PM
If you had a CPU with a 64k L1 cache and 64k L2 cache, would it be as fast, faster, or slower than a CPU with 128k of L1 cache? Which one would be more expensive?



------------------
Why Drink and Drive when you can Smoke and Fly?

sww
10-15-2000, 02:25 PM
Well, expensewise there is no difference as they take up the same amount of silicon. What is as important as the size of cache is the width of the data bus, and when comparing L1 and L2 cache sizes, remember that some chips are set up so that the caches can share infomation (wasteful) and some are set up so that they cannot (more efficient).

Humus
10-15-2000, 05:01 PM
The one with 128kb L1 cache will be slightly faster. L1 takes more die space than L2 but operates faster. The reason there are both L1 and L2 cache is because you get more L2 for the same amount of die space.

Wintermute
10-15-2000, 07:36 PM
So does L1 have greater bandwidth? What is the difference anyway?

------------------
From Blade Runner: Holden> The tortoise lies on its back, its belly baking in the hot sun...it's thrashing its legs, trying to turn itself over but it can't; not without your help. But you're not helping. Why is that Leon?

Phoenix
10-15-2000, 08:26 PM
You might search the archives for a great post that Arcadian had on the differences between the two and how they work. Just search for L1, L2 and the name Arcadian and I'm sure you'll find it.

Arcadian
10-15-2000, 08:42 PM
Originally posted by Phoenix:
You might search the archives for a great post that Arcadian had on the differences between the two and how they work. Just search for L1, L2 and the name Arcadian and I'm sure you'll find it.

Ask and you shall receive http://www.sharkyforums.com/ubb/smile.gif
http://www.sharkyforums.com/ubb/Forum2/HTML/000178.html

KR02
10-15-2000, 09:50 PM
arcadian - that piece was VERY informative. From what i understand so far, L1 chache is (physically) larger, but has "no" latecny. Therefore, L1 chache is more important and faster, but a large L1 cache would be hard to impliment on to the die, so L2 cache, which is physically smaller is used. Now, what about L3?

BTW, if i made any mistakes, please correct me. I am still learning. I love working with computers and want to learn more.

------------------
Why Drink and Drive when you can Smoke and Fly?

Phoenix
10-15-2000, 10:54 PM
Originally posted by KR02:
arcadian - that piece was VERY informative. From what i understand so far, L1 chache is (physically) larger, but has "no" latecny. Therefore, L1 chache is more important and faster, but a large L1 cache would be hard to impliment on to the die, so L2 cache, which is physically smaller is used. Now, what about L3?

BTW, if i made any mistakes, please correct me. I am still learning. I love working with computers and want to learn more.


Well, there is some latency for L1 cache, as the post said 2-3 cycles, it's just the first place the CPU goes to when it's trying to load something in memory into a register. I also have a question though, as far as x86 and Intel goes, how many registers do they have and what is the function of each register (i.e. reserved for the O/S), I am assuming that register 0 contains "0".

Arcadian
10-15-2000, 10:57 PM
Originally posted by KR02:
arcadian - that piece was VERY informative. From what i understand so far, L1 chache is (physically) larger, but has "no" latecny. Therefore, L1 chache is more important and faster, but a large L1 cache would be hard to impliment on to the die, so L2 cache, which is physically smaller is used. Now, what about L3?

BTW, if i made any mistakes, please correct me. I am still learning. I love working with computers and want to learn more.


It really depends on the implementation of the cache. Typical SRAM (Static RAM) caches use a 6 transistor design. If you know what a latch looks like, it is very similar. Think of two NOT gates (2 trasistors each) with the output of one going into the input of another. That is your memory cell. Any logic level you put in that cell will reverse itself twice, so you always end of with the negative of a negative, or the original value. Then you need one transistor for setting the memory, and one more to discharge it. That's your six transistor SRAM cell.

So physically, you will start with blocks like that. You want these blocks to be optimized for high speed if you want an L1 cache, and for small size if you want an L2 or L3 cache. Like I said in my earlier post, L1 cache is right next to the processor, and you want it small and fast to decrease the time to access data (otherwise known as latency).

Spacial and Temporal locality is another thing I mentioned previously, and you want a large cache to make sure you have data that's local. You get the best of both worlds if you have a small L1 (for speed) and a large L2 (for locality).

Forget for a moment AMD's design, which uses exclusive cache, a method that has not yet been proven to be better than an inclusive cache. Like I said earlier, inclusive caches are smaller because you repeat data, but average access time is decreased, because you are more likely to have valuable data in your low latency cache (L1). Exclusive caches give you more cache, but more isn't necessarily better because you worsten your average access time. Although, worst case access time is better, since you don't have to copy data. Don't let AMD merketing fool you because they say they have a larger cache, because what they make up in size they lose in latency. (Remember it's better to increase average access time over worst case access time).

I also wanted to clear up that L1 cache does incur latency. Anything that is not a register incurs latency, but it's the job of the designer to minimize this. The Pentium 4 has a data cache that's 8KB. This may seem small compared to Athlon's 64KB L1 data cache, but it's much faster, and the designers accepted that as a trade off. In the end, a smaller L1 may end up benefiting the Pentium 4 much more.

L3 cache, since you have asked, is simply another level that can be placed between the L2 and memory. For Foster, the server version of the Pentium 4, there will be an L3 cache, because the designers no doubt found that keeping L2 small will keep latency small, and then there can be an L3 for locality. (To contrast, the current Pentium III Xeon has a very large L2 cache that is 1MB or 2MB.) I don't know whether this cache will be on die or off, but given the already large size of the Pentium 4, I'm guessing it will be off die. This is OK, though. We already know that the L3 will be high latency because of the size, so what's a little more in keeping it off die? There will already be a tiny 8KB L1 and a small 256KB L2, which will contain a lot of data. The L3 is just a bonus that will increase locality, and lessen the slow accesses to memory.

Hope this clears some stuff up for you. Continue to ask questions, KR02.

Arcadian
10-15-2000, 11:25 PM
Originally posted by Phoenix:
Well, there is some latency for L1 cache, as the post said 2-3 cycles, it's just the first place the CPU goes to when it's trying to load something in memory into a register. I also have a question though, as far as x86 and Intel goes, how many registers do they have and what is the function of each register (i.e. reserved for the O/S), I am assuming that register 0 contains "0".

In the MIPS processor, register 0 contains "0". Several RISC designs use this convention. Intel (and thus x86) does not.

There are 8 GPRs (General Purpose Registers) in x86. The Intel's Software Developer's Manual Volume I, which can be found online if you are interested (see section 3.6.1 in particular),
http://developer.intel.com/design/pentiumii/manuals/243190.htm

contains recommendations for how these registers are used, but in fact they can all be used for accumulators. The exception to this, is that sometime Windows or other Operating Systems have special uses for these registers, as is defined in the manual.

One thing you should know is that these are the registers that programs can access through the assembly code. (Realize that all machine code is assembly code, too.) There are many more registers called Renaming Registers. These registers are there to prevent false dependancies. For example, consider the following pseudocode.

ADD R1 + R2, and put in R3
ADD R1 + R3, and put in R4

In this example, the second instruction cannot begin until the first one is finished. In other words, R3 is an operand in the second instruction, but it is the result of the first instruction, so you need to finish instruction one before you begin instruction 2. In terms of performance, this sucks http://www.sharkyforums.com/ubb/smile.gif.

In order to fix this, Intel came up with register renaming. I don't know how many Renaming Registers there are, but let's say there are 64 (I think there are more than this, but Intel considers it top secret info that I don't have access to http://www.sharkyforums.com/ubb/wink.gif). The above code goes through a process of pseudorandomly assigning new registers to the instructions, so the above code becomes something like this.

ADD R19 + R42, and put in R26
ADD R44 + R51, and put in R39

Since Intel's P6 architecture is superscalar and out of order, the first instruction can be passed through one pipeline, and run other instructions farther down in the code, while the second instruction has access to the superscalar pipeline. When R26 gets data, it can be forwarded to R51, and there is no time lost while both instructions complete. This is all done very esotericly by advanced hardware that keeps track of all the dependancies so that the CPU is always running and doesn't have to wait. In addition, there is not a penalty for having to deal with the small amount of GPR in the x86 architecture. Without Register Renaming, you would run out of registers (there are only 8, remember) trying to get other instructions in the pipeline.

So to recap you original question, there are 8 registers in x86, though there are more if you consider the renaming registers.

Phoenix
10-16-2000, 12:27 AM
Thanks Arcadian, I just started programming in MIPS assembly, but wanted a some information about how x86 works in comparision to RISC processors. I should have been at tad bit more clear, I knew that you could use any register that you want, just that certain registers are suggested for certain uses. I just try not to use these other registers as I for what I'm doing the 10 temp. registers are enough for me.

Arcadian
10-16-2000, 02:17 AM
Originally posted by Phoenix:
Thanks Arcadian, I just started programming in MIPS assembly, but wanted a some information about how x86 works in comparision to RISC processors. I should have been at tad bit more clear, I knew that you could use any register that you want, just that certain registers are suggested for certain uses. I just try not to use these other registers as I for what I'm doing the 10 temp. registers are enough for me.

Well the best place to look is the Software Developers Manual that I posted a link to earlier. MIPS tends to assign a lot of its registers to special functions. Basic MIPS has 32 registers, as you know, but only some are accumulators, and some are temp registers. x86 basically has 8 accumulators and some other special purpose registers. Refer to the manual for specifics... I think you'll find a lot of information there about x86 you may really enjoy digesting.

Moridin
10-16-2000, 10:58 AM
Originally posted by Phoenix:
Thanks Arcadian, I just started programming in MIPS assembly, but wanted a some information about how x86 works in comparision to RISC processors. I should have been at tad bit more clear, I knew that you could use any register that you want, just that certain registers are suggested for certain uses. I just try not to use these other registers as I for what I'm doing the 10 temp. registers are enough for me.


I believe two of the X86 registers are reserved for specific use. One is the stack pointer the other is the program counter. (I'm not 100% sure about the program counter, it may be a separate entity altogether) This leaves you with 6 working registers.

**EDIT**
I decided to look this up. The Program counter is separate. The Stack pointer is one of the General-purpose registers leaving you with 7.


Here is what the developer guide has on register uses.


• EAX—Accumulator for operands and results data.
• EBX—Pointer to data in the DS segment.
• ECX—Counter for string and loop operations.
• EDX—I/O pointer.
• ESI—Pointer to data in the segment pointed to by the DS register; source pointer for string operations.
• EDI—Pointer to data (or destination) in the segment pointed to by the ES register; destination pointer for string operations.
• ESP—Stack pointer (in the SS segment).
• EBP—Pointer to data on the stack (in the SS segment).



[This message has been edited by Moridin (edited October 16, 2000).]

Bash
10-16-2000, 11:19 AM
Anything that is not a register incurs latency, but it's the job of the designer to minimize this. [/B]

Well I suppose this IS true today, but go back to the days of the 80386-16. With 60ns ram you could do 1 cycle accesses to main memory!! woohoo!

-Bash

Moridin
10-16-2000, 11:35 AM
Arcadian

I think you are missing something in your explanation. The way you have written this makes it a real data dependency not a false one. You should have put in the loads to make it clearer.

Consider the following code

Load R1 (from memory)
Load R2
ADD R1 + R2, and put in R3
Move R3 to R5

Load R3
ADD R1 + R3, and put in R4
Store R4
Store R5

If you just look at the instruction itself (ADD R1 + R3, and put in R4) appears to be dependant on (ADD R1 + R2, and put in R3). The problem is that there is a load in between so the value calculated in the first add is not used in the second.

If the value for R2 happened to be stored in main memory while the value for R1 and R3 were in L1 the entire block of code would stall for up to 100 cycles waiting for R2 to load from main memory. If you recognized that the dependency was false you could complete the second half while you were waiting for R2 to load.

The other point that I think needs to be made is that in a pipelined processor, all of these instructions would be in some stage of completion in the pipeline at the same time. R3 however has 2 unrelated values for instructions in the pipeline and you have no idea which value will be required first.

This is where the re-name registers come in. You need to have a complete set of rename registers for every pipeline stage prior to the actual execution stage to accommodate these different values.

Re-name registers something required by every OOOE processor, so they were not really developed by Intel. I think Intel did have the first commercially available OOOE processor but others were working on them as well. If memory serves me correctly the P6 come out about a year before the 21264???

Arcadian
10-16-2000, 04:23 PM
Moridin, thanks for clearing that up. I seemed to be mistaken in regards to the function of renaming registers. I was sure what they were to some degree, but you were correct that my explanation was slightly off. Thanks for the detail you put into responding http://www.sharkyforums.com/ubb/biggrin.gif.