WEBVTT

1
00:00:20.800 --> 00:00:21.540
<v Matt Godbolt>Hey, Ben.

2
00:00:21.540 --> 00:00:22.580
<v Ben Rady>Hey Matt.

3
00:00:22.580 --> 00:00:23.600
<v Matt Godbolt>How you doing?

4
00:00:23.600 --> 00:00:24.300
<v Ben Rady>Good.

5
00:00:24.300 --> 00:01:03.640
<v Matt Godbolt>So I've been thinking recently there's an awful lot of cool things that processors do that I don't think people are aware of except when maybe they come along and ruin their day because they hear things in the press like Spectre or Meltdown or other side channel attacks, these scary sounding things that probably aren't as, quite as scary as you think, but there's a lot of technology and a lot of very cool and clever development has gone into CPUs and we have a podcast and I love talking about this stuff. So I figured let's just write down, "Matt talks about CPUs" and then see what comes from that.

6
00:01:03.640 --> 00:01:09.120
<v Ben Rady>There's a whole bunch of things that CPU's do that I don't know about. So I feel like I'm going to learn a lot in this podcast.

7
00:01:09.120 --> 00:01:18.220
<v Matt Godbolt>So I, um, many years ago, actually I had a lightning talk, which was something aligned like a five things you didn't realize that your CPU does for you.

8
00:01:18.220 --> 00:01:20.260
<v Ben Rady>Oh yeah!

9
00:01:20.260 --> 00:01:32.260
<v Matt Godbolt>A, it was a fun one to do because, uh, it was for an audience of people that weren't necessarily low level engineers as indeed. I'm sure many of our listener are not necessarily interested in all the low level things

10
00:01:32.260 --> 00:01:32.820
<v Ben Rady>Both listeners.

11
00:01:32.820 --> 00:02:20.280
<v Matt Godbolt>Sorry, there are two listeners. I tried to come up with this format that was, I thought interesting because they were only five minutes long anyway, but just phrasing it in terms of your CPU, even if you have only the vaguest understanding of like assembly code, machine code or whatever is actually going on from the point at which you're typing your code to the point at which something useful happens in your computer, you've probably got this idea of, you know, instructions that move registers around like MOV EAX, 123, that kind of thing. Or if you're an ARM something similar or, you know, you've maybe done, uh, some, you know, RISC thing at university when you've been forced to take some kind of assembly course, and you kind of have a rough idea about the operations that a CPU can perform.

12
00:02:20.280 --> 00:02:56.980
<v Ben Rady>Yeah. That's certainly my mental model of like, if you say assembly language, I think, okay, there's an op code identified by probably an integer for each of these instructions and they take some number of arguments, not totally unlike a function call, but they take something of arguments and they are things like, you know, uh, move this data from memory into this register; take this register and this register and add them together and put the result in this register. And that's sort of my, you know, very basic mental model of how assembly works and all of the basic operations of the CPU are represented by those commands.

13
00:02:56.980 --> 00:03:47.580
<v Matt Godbolt>So, yeah, I think that's a really useful way of thinking about the way the computer works. You mentioned, you know, like the, the, the arguments are like the parameters to the instruction as if it's like a function call. And I think the only thing I add to that when I kind of mentally model it is like the registers are global variables, right? At the end of the day, the CPU only has global variables, which is pretty horrible and pretty nasty. But so then one of those five things I was talking about that you didn't know your CPU does is that your CPU is kind of a JIT compiler. It doesn't run that either. It only has the illusion, right? The instruction set architecture, the ISA, the machine code that you're giving to it is an API boundary and provided the CPU appears to do what you would expect it to do when you do MOV EAX, 1234, it is up to the CPU how it achieves that goal, right?

14
00:03:47.580 --> 00:04:35.200
<v Matt Godbolt>It's a, it's a completely, um, uh, it's an abstraction layer, right? What goes on underneath is completely up to the CPU. Now you might reasonably think like I do, because my mental model is sort of sat back in the 1970s, 1980s of CPUs is that like, there is a little box that says EAX somewhere inside the PC. And when I do MOV EAX, 123 then the number 123 goes into that box, right? That's what happens, right? But that's not what's happening at all anymore. Unfortunately, things have gotten far more complicated. And like the first thing that a modern CPU will do is it will take that expression effectively that you give it this instruction and it will devolve it into smaller instructions called micro operations. And those micro operations are what the actual core inside is going to work on.

15
00:04:35.200 --> 00:05:52.640
<v Matt Godbolt>Now, in the case of that MOV instruction, we were just talking about it probably doesn't have anything more to do than the, that, operation. The operation is about as small as you're going to get. But when you're doing something like, um, move from memory, there are several things that need to happen. And while we're here, let's just talk about a pet peeve of mine. Why on earth is it move? It's always move. And it's MOV this, that there's no moving. It's not like the old value has gone away. It's, copy, right? When I say, move the value 123 into EAX, but then I haven't actually just deleted the value 123 from outside of the world, right? There's no longer, it's a copy. Put it's like move. Yeah, move. I couldn't even think of the right word without saying the word move, but anyway, so, but if you're, if you're say acting or memory and say the Intel instructions give you an awful lot of flexibility to say, do an add instruction where one of the operands is a register and the other operand is some complicated expression that says read from this memory address offset by this other, um, register multiplied by four. Because you're going to use it to index into like integers and each integer's four long. There's kind of three things going on there. One is the actual addition. Another one is the calculation of which memory address was it that you actually wanted to go and read from.

16
00:05:52.640 --> 00:06:46.120
<v Matt Godbolt>And I'm not talking about virtual memory here. I'm literally talking about the instruction was ESI plus a hundred plus EDI, times four, that's a complicated thing. And then third of all, there's the actual going to fetch the memory that is at that address. And so it those really are split into three operations and it's not just to make it simpler. Um, on, on the, uh, architecture underneath. The core thing that's happening is that those micro operations can be broken apart and then executed separately and individually, which means that if, for example, the memory unit is, is free, but the adding unit is busy doing something else. Then you can go off and fetch the memory or the memory calculation go and fetch that, and then sort of hold on to that temporary result. And then when the adder becomes free, you can just do the ADD. You don't have to wait for all three units to be ready together so that you can do the operation.

17
00:06:46.120 --> 00:06:57.500
<v Ben Rady>Interesting. So is this like an extension of Moore's law running out where it's like they couldn't make processors any faster, so they just started giving us more cores. Now they're like, well, what if we take the operations? And we do those in parallel too. Is that all this is?

18
00:06:57.500 --> 00:07:47.140
<v Matt Godbolt>It's exactly that. Except the sequence is the other way round, really, you know, as Moore's law as sort of reaching the end and as we sort of hit the sort of four gigahertz ish wall, instead of making CPU's more and more and more, uh, fast in terms of clock speed, the cunning-ness went along multiple dimensions and parallelism is something that silicon does. Like, it's hard to stop it from doing the parallelism. In fact, all of like chip design is trying to get rid of it from being parallel where you don't want it to be and handling the crazy race conditions. I mean, yeah. I mean, I've never done silicon, but in terms of like FPGA design stuff, it's always the timing. It's always the synchronization points. And if you think you've got problems when you've got like multiple threads accessing stuff, try imagine that you've got a 2 million individual gates that are all doing their own thing.

19
00:07:47.140 --> 00:08:35.060
<v Matt Godbolt>So, um, yeah, the sort of the first access, I mean, so back in the day, we're going to go back to like 6502 era then, um, it took multiple CPU cycles, multiple clock cycles to execute a single CPU instruction. So, you know, the, the classic fetch the instruction, decode the instruction, execute the instruction usually. And then sometimes they have a retire stage or write-back stage and a retire or something like that. Those are kind of like the pipeline stages that an instruction went through. And it went through them one at a time, which was cool. But the obvious observation is it's a pipeline. The fetch can, you can be fetching instructions one cycle at a time, and then you can be decoding then the next one and executing the one after, or sorry, the other way round, you know, the classic pipeline, which is fantastic.

20
00:08:35.060 --> 00:09:18.680
<v Matt Godbolt>And it meant that a lot of things that were taking five cycles to complete, if you imagine a five cycle pipeline, well for every individual instruction, it does indeed take five cycles to get from one end of the pipeline out the other, but every single clock clock of the tick, tick of the clock, you're doing one more piece of useful work, just like, um, uh, you know, a modern car production line. It might take two weeks for a car to be assembled from a bag of parts that are put in one end and come out of the other for any one car. So if you were to like spray paint that car red, and then it would have to wait five weeks, but every day, a whole new car rolls off the end of the production line. And so you've got this latency throughput trade off that's going on.

21
00:09:18.680 --> 00:10:12.920
<v Matt Godbolt>And so we were already taking steps in that direction, even like, so I said the 6502 grabs these things actually, as it happens, the 6502 had a very, very simple pipeline where on the end of the last take of a, uh, an instruction, it actually was fetching the next instruction, which had some interesting if you're ever going to emulate it had some interesting side effects, but we've, we've done that to death. Um, so then you're getting sort of parallelization out of a single stream of instructions, which is great. What tends to happen though in those cases is that that only works if you know what you're going to do, be doing ahead of time. So in the case of, um, uh, the analogy we have add with the car production line, it's like when you finally get to assembling the last step, you look at the, the, that part of the instructions and it says, oh no, make this one red.

22
00:10:12.920 --> 00:10:54.140
<v Matt Godbolt>And you're like, oh, but it's already been all the way through. And we're nearly at the end. Um, the analogy I'm trying to make there is that's like a branch that's taken or not taken. You've made a decision and it takes a few cycles for the decision making process to come to its conclusion and say, ah, actually we need to be doing something different here. Yeah. But the pipeline is full of the other instructions that would have happened if you had just carried on in a straight line. So at that point, what used to happen, would you have bubbles in the pipeline? So those, those fetched and decoded instructions that weren't going to be executed were just flushed away, thrown away. And then the whole thing was started again. And so it would take several clock cycles before useful work could be done again.

23
00:10:54.140 --> 00:11:49.580
<v Matt Godbolt>And that's what made branches expensive back in the day. That's cool. So what next happened was the concept of making a guess, Hey, we can make a guess at where the branch is going to go ahead of time. Then we can steer that decode the fetcher and decoder in a particular direction. And then as long as when we get to do the execute stage, we check our guess. If we were right, Hey, no harm, no foul. We carry on like that instructions that we fetched and decoded are ready to go. We don't have to have this bubble in the pipeline. And if you get it wrong, it's no worse than how we were before. We have to throw away the pipeline and flush it. Now, as it happens about the time that the branch predictors were coming in and becoming more and more important, it was actually the case that the pipelines were getting longer and longer and longer, um, for, for, for boring technical reasons to do with actually speeding up the, the, the clock speeds overall.

24
00:11:49.580 --> 00:12:39.520
<v Matt Godbolt>Um, obviously if you're making something go faster and faster and faster, it's better to have smaller and smaller pieces of work to do, which in this pipeline analogy means, um, having simpler and simpler steps in the pipeline, but potentially more of them to achieve the ultimate goal. So now we're looking at, um, latencies of, of the orders of tens of cycles before, between the fetch that picked up the instruction and the branch instruction actually running and determining, yes, we are going to take the branch or no, we're not going to take the branch. So the interesting side effect of having a branch predictor and we can talk for, to death about how they work and they are amazingly, amazingly good these days. Is that what the core, the sort of like the engine that's, that's churning out all of the actual work, the stream of micro operations.

25
00:12:39.520 --> 00:13:19.020
<v Matt Godbolt>Remember we've turned these, these instructions to micro operations. It's just given a sequence with no branches in it, effectively of instructions of work to do, and it can be filling that that could be a huge amount of work. It can be hundreds ahead of itself potentially, you know, with, with a long enough pipeline. But also what that gives you an opportunity to do is to sort of do, and this is where I I'm going to go back to that analogy I made of like a JIT compiler is sort of just in time compilation, it can do some analysis of the stream of instructions. And it can say this particular instruction that's later in the stream, doesn't depend on an earlier instruction.

26
00:13:19.020 --> 00:13:21.040
<v Ben Rady>And this all happening in the silicon? This is in the hardware?

27
00:13:21.040 --> 00:13:24.440
<v Matt Godbolt>Completely in hardware. Absolutely.

28
00:13:24.440 --> 00:13:25.200
<v Ben Rady>Wow.

29
00:13:25.200 --> 00:14:08.840
<v Matt Godbolt>Effectively a DAG is being generated of all the interdependencies between instructions. And instead of running the instructions strictly in the order that you wrote them down, it is running these micro operations when they are ready to be run. So in the case of say a divide: divide is a classical example of a really lengthy instructions. So I'm dividing two numbers together. It's going to take a good dozen to two dozen cycles before I get the integer result of that divide. Meanwhile, the next instruction was just like, yeah, MOV EAX, 12 because I'm setting up to do some other piece of work, right? Well that instruction can run. It can jump the queue because there's no observable side effects that the divide needs. And so

30
00:14:08.840 --> 00:14:11.500
<v Ben Rady>It's not like it's moving the result of the divide in that case?,

31
00:14:11.500 --> 00:14:11.560
<v Matt Godbolt>Correct.

32
00:14:11.560 --> 00:14:13.880
<v Ben Rady>Right. It's just, it's just moving some other unrelated things.

33
00:14:13.880 --> 00:14:26.560
<v Matt Godbolt>But even if the next instruction was moving, that that instruction would be blocked. And the, the CPU would be continuing to look down the stream of instructions for something that finally didn't depend on anything that was currently being busy.

34
00:14:26.560 --> 00:14:27.460
<v Ben Rady>Right.

35
00:14:27.460 --> 00:14:31.180
<v Matt Godbolt>And it's just, it's amazing to think that that can be done in silicon quickly.

36
00:14:31.180 --> 00:16:09.460
<v Matt Godbolt>But wait! There's more! One of the problems that you hit very quickly, if you're looking for more work to do, you're looking for these dependency chains, these DAG nodes that don't relate to each other so that you can get as much parallel work being done as possible. Is that very quickly you get bottlenecked on those registers, those global variables, you know, classic, if you imagine, again, going back to 6502, you basically only had one register. It was the, A register, you know, the X and Y registers were over there as well. And they were used for indexing, but let's assume that had something akin to only one register. And we issued that divide, right? So that divide happened and the result's going to go into the accumulator. And then the next instruction was to say, and store that result in some memory, location, and then the next instruction was: and now load a different value into the A register because I need to go and do some additions or what, at that point I have to stop because I can't write into that accumulator register because the divide's using it right now. So it just, we grind to a halt again. What happens as the instructions, their micro operations flow through from the, the, the decoding unit into this out of order system is that as well as looking at the DAG of dependencies between them, it's also rewriting those registers that you gave it, and it's using tons of internal, temporary register names. So in the example, I just gave that first divide. Um, that's going to write into like A-sub-0 like with a little subscript zero, but as soon as I do MOV a value into A I've completely obliterated, what used to be there.

37
00:16:09.460 --> 00:16:41.680
<v Matt Godbolt>You know, I moved another value into it. And at that point, the, the, um, the CPU goes well, that's a different A from the last A. So I last used A-sub-0. Now I'm gonna use A-sub-1. This is a trick that compilers will do themselves, right. As compilers are looking at stuff, there'll be like, well, I can reuse this, or I can not reuse that, but this is happening in silicon. So now we've got A-sub-0 and A-sub-1, and now suddenly they're two independent instruction flows again, and now they can be parallelized. And this is again, parallelized within a single thread of execution on the CPU. And that's so cool, right?

38
00:16:41.680 --> 00:17:35.120
<v Ben Rady>Is it fair to say that this is just like another classic example of the problems to so many things in computing is another layer of indirection. It's sort of like, well, we have these backward compatibility reasons why we can't just change all of these instructions. Right, right. But we don't want the hardware to look like this anymore. Right. So now what are we going to do? Okay. Well, we'll just pretend like we're executing these instructions and underneath the covers, we can do whatever we want. And like that effect is, you know, sort of like a classic programmer trick, right. Like I have some API that the guy before me wrote and I hate it, but I'm stuck with it because that's what everybody uses. And so I'm going to change the implementation of it to do something different that's much faster, or much more efficient or whatever, or, you know, fits the architecture that I have, or is cheaper. And that's kind of, so now assembly, it's just basically like an API at this point, and we're just using it to drive all these instructions.

39
00:17:35.120 --> 00:18:28.080
<v Matt Godbolt>And as you say, it sort of has to be backwards compatible back to like 1970s era code, because we're stuck with it. So that's a really interesting observation and it's one that Intel themselves made and they decided, well, what would it look like? And, um, I apologize for those who probably know a lot more about this than me, but like they decided to investigate, what would it look like if we actually switched to something which was more amenable to easy execution on the CPU, but it doesn't require all this trickery. And they came up with the Itanium, which was a very long instruction word chip with thousands of registers that you could actually act as (or hundreds of registers that you could access). And by very long instruction word, that means each instruction I'm putting little air quotes here is actually a bundle of multiple instructions that will run concurrently, as like a packet.,

40
00:18:28.080 --> 00:18:35.940
<v Ben Rady>Oh so they exposed, they, they peeled back the abstraction layer and they said, what did we just let you run? What is essentially our microcode directly?

41
00:18:35.940 --> 00:18:47.420
<v Matt Godbolt>Kind of, yes. Yeah, exactly. And they said, let's, let's invest in smart compilers that can work out how to rearrange and pack the code that you've written into these instructions.

42
00:18:47.420 --> 00:18:48.300
<v Ben Rady>Yeah.

43
00:18:48.300 --> 00:19:30.220
<v Matt Godbolt>But it turned out to be a dead end, either the compilers just weren't up to the task and given how good compilers are I suspect that's not it or else what we don't get with that approach is any kind of dynamism. Right. I, there's a reason why I kind of use a JIT as an example, because a JIT has extra knowledge at the time. It has the dynamic environment. Like if you look at how the JRE works and I'm switching out, analogies left, right and center here, but like, it's very hard to statically decide whether or not something should be done one way or another, but it could be trivial to make that decision when you have dynamic information in front of you, like which of these two branches is more likely to be taken.

44
00:19:30.220 --> 00:20:24.500
<v Matt Godbolt>I don't know. Is it the same way that I went the last 80 times? Yes. Well then let's, this is now the more optimal route to take. Whereas staring at code ahead of time, you've only got so much information. And especially when you add complexities, like, you know, I alluded to the divide taking forever. Sometimes those divides are dependent on the data that's being put into them. And so you've got like a data dependency. Like even if you're dividing two 64 bit numbers, it can take fewer cycles if the numbers are smaller or bigger, like actual magnitude of the numbers. And so there's a huge amount of dynamism that comes. Another thing that's very hard to predict is memory latency. We've got so many layers of caching now because RAM has never really kept up with the speed of the CPU. So you don't know if something's going to be in cache or out of cache. And so statically trying to predict that and kind of padding your VLIW code it's really, it's almost, I'd say it's impossible.

45
00:20:24.500 --> 00:20:46.440
<v Ben Rady>Got it. So is it, is it the case that sort of like breaking the abstraction here has, has then constrained what you can do underneath the abstraction, uh, and, and therefore made it less efficient because now you're, you're making promises that may, you're trying to make promises that may not hold up in the dynamic environment of actually running the code?

46
00:20:46.440 --> 00:20:47.480
<v Matt Godbolt>Right

47
00:20:47.480 --> 00:21:15.020
<v Matt Godbolt>Like you're, you're sort of implying that certain things will be fast or certain things will be slow or certain things will be more memory efficient when they're not necessarily, because it just depends. But if you've, if you've pulled back that abstraction, now that the compilers, in this case, what are sort of the things producing this now sort of like, can't really, they're, they're just sort of too smart for their own good. Is that a reasonable way, to think about it? Or am I oversimplifying?

48
00:21:15.020 --> 00:22:09.040
<v Matt Godbolt>I think so! Whereas in a software JIT situation, which can also take advantage of this, the information, and like I said, you know, like you can work these, you can observe the pattern of code and kind of make a decision to say, well, I'm going to re-optimize this particular area of code. Um, I don't think that even a software JIT that's producing code for say Itanium could observe enough information at the sort of micro level to take full advantage of this. You know, like the nowadays as the CPU's have hundreds of, um, out of order buffers. So they can be like two or 300 instructions ahead of themselves. So if there's that one slow memory load that you're stuck with, as long as it can find enough information - I'm sorry - enough useful work to do that can be predicted, which has a lot of ifs.

49
00:22:09.040 --> 00:22:56.360
<v Matt Godbolt>But again, all these things are getting very clever. Then that one slow load isn't going to block you from doing too much work. You're going to be able to tear through. And in fact, there's another sort of cunning trick that ... so switching away back from this Itanium, um, pre pre uh, pre-compiled solution to the problem. Um, so thinking about what a single CPU actually looks like, we've got the, uh, the front end, which is this thing that's going to do the fetch, the branch prediction. It's going to fetch, it's going to decode and generate a bunch of micro operations. And then there's going to be some kind of magical renaming, which is the thing I alluded to, where it rewrites, what registers you're actually using to be internal ones you can't even address. So I can't even say the name of these they're totally internal to the CPU.

50
00:22:56.360 --> 00:22:57.600
<v Ben Rady>Yeah, right - the temp directory of registers.

51
00:22:57.600 --> 00:23:47.800
<v Matt Godbolt>It's exactly that that's exactly what it is, but those get all thrown into a big pot of like, here's all the instructions I would ideally like you to run. And now the out of order core is going to just pick off the ones that can be run. And then finally they get completed in order again, there's like a big master list of like, well, this is where we're good to do. These things have actually happened. Anything below this mark is like, may have competed out of order, but we have to make it look like it happens in order. And that's where, for example, um, even if it's run ahead of itself, um, but there was a page fault or, you know, a SEGFAULT on an instruction, which maybe takes a while for it to discover, um, the SEGFAULT doesn't happen until the instruction that caused the SEGFAULT retires, and then it aborts everything that's not happened after it. So that's kind of how that magical time-traveling aspect happens.

52
00:23:47.800 --> 00:23:57.660
<v Ben Rady>Building a software emulation of this seems like years of work to me, it's just mind blowing that there's like physical hardware that is making all these kinds of decisions. That's just incredible.

53
00:23:57.660 --> 00:24:21.160
<v Matt Godbolt>It's crazy. It's really quite, it's a stunning achievement. Most interesting thing to me is that this information is hard to come by Intel deliberately. Doesn't try to tell you very much about what's going on underneath this layer, because they want to have reserved the right to change it quite reasonably, their API and contract sort of ends with you at the ISA at the machine code level. Right?

54
00:24:21.160 --> 00:24:27.800
<v Ben Rady>Right. Well, yeah, I do the same thing, but don't look under the covers because I might want to change my mind later,

55
00:24:27.800 --> 00:25:34.720
<v Matt Godbolt>But you know, Hyrum's law, the law that any observable side effect will be used by somebody and become relied on by somebody comes into play. But yeah, the interesting thing is all this information is, is inferred from some information that's in the manuals that you get from Intel, but just an awful lot of reverse engineering by a community of dedicated engineers who are using the performance counters and seeing when things tick up, you know, how, how often does a front end resteer? You know, Intel doesn't really tell me what that particular counter is measuring. Other than that, if it try and keep it as low as possible, but I can run my code and sort of see under what conditions that happens. And I can start building a theory. And, uh, the people like Travis Downs and Agner Fog and Peter Cordes. There's, there's a few really interesting people. I mean like Agner Fog, for example, he's a Danish professor of anthropology, which is like the most strange hobby to have, but I mean, who, who am I to throw stones in that respect? Uh, but yeah, it's really interesting that this is not as well understood even among the people who are really, really interested in, in it.

56
00:25:34.720 --> 00:25:38.940
<v Ben Rady>I mean, is it, are there intellectual property reasons why, I can imagine this is why they don't want to share too much?

57
00:25:38.940 --> 00:25:41.380
<v Matt Godbolt>There are security considerations...

58
00:25:41.380 --> 00:25:42.580
<v Ben Rady>Yeah that too, of course.

59
00:25:42.580 --> 00:26:29.360
<v Matt Godbolt>You know, so just to sort of... I've mentioned about, um, Spectre and Meltdown at the very beginning as being like a scary reason why it might, people might think of this. Well, um, th the sort of like the mechanism by which they attack is absolutely core to that thing I was talking about: you know, when a fault happens, it sort of erases everything that happened after the fault, but we know that several hundred instructions may have actually run based on speculating that the fault wouldn't happen. And then the fault did happen. Now, as I said, this retirement sequence happens strictly in order. So until, uh, an instruction has retired it's as if it didn't really happen so we can throw it away. So you're like, well, how can, how, how can possibly this be a problem, right? Although we may have done some work beyond the fault, um, it will get thrown away.

60
00:26:29.360 --> 00:27:21.940
<v Matt Godbolt>And so it's not visible to me. I can't see past it, right. That is unfortunately not true because also in this system is the CPU data and instruction caches. Those can't be undone, any changes that happen in them. And now I'm not talking about like, data that's changed inside the caches, because it won't actually have been committed. But if someone who has read from main memory at a particular address that will now be in the cache, that value in the cache, and now I can, after that fault has happened, I can go back and say, how quickly can I read this memory address? Oh, it only took three cycles. Therefore it must be in the L1 cache. And the only reason it must be in the L1 cache, is because my little targeted attack that I did over here caused the speculation system to get way ahead of itself and put in some information that that was indicated.

61
00:27:21.940 --> 00:27:59.780
<v Matt Godbolt>Now, I don't know what was in that information, but if I can contrive a situation in which the address of the read corresponds to a secret piece of information that I shouldn't otherwise have access to, I now have a way to sneak the values out by just doing timing on the cache. And this is a very big topic, and I'm really not doing it a proper service here, but that kind of mechanism there there's like the dirty secret is the undo buffer that we have inside the instructions for like, "whoops, I didn't mean to do those things", can't undo the side effects of things being read into the cache.

62
00:27:59.780 --> 00:28:28.960
<v Ben Rady>So this is, you have some, some data structure where the address in memory tells you something important about the data itself, you know, like a, uh, a hash or something like that, where it's like, oh, well, you know, this value gets written to index zero. This one gets written in the next one and next to so forth. If, if you know that, uh, yeah, I guess it's almost like a, um, uh, what is that called? The bloom filter?

63
00:28:28.960 --> 00:29:24.720
<v Matt Godbolt>There are some things, I mean, you can actually do a much more simple thing. Because if you, if you can control the code, that's going to run. And so for example, the canonical example of that is the safety mechanisms around a, say a Javascript JIT engine, right? You're doing an array access. Um, and, uh, you want to, you, you go out of your way to read within the bounds of the array over and over and over and over and over and over and over again. And now the branch predictor has, has learned that you never ever trip outside of that check that, you know, must be in the JIT somewhere that says, "make sure do they don't read outside of this array". Right? I've made myself a 256K array and I've sat there reading a million times out of it. And so now the CPU is essentially it's cold for it to deal with the exceptional case where inside the JS engine it'sgoing to throw an exception to say, out of bounds access, or it's going to return like the undefined or whatever it's going to do, but it has to do something different.

64
00:29:24.720 --> 00:30:03.960
<v Matt Godbolt>Now I happen to know that my buffer, um, at some giant offset from my buffer, if it were to actually try and read there, then I'd be reading some, some protected piece of memory that I shouldn't otherwise be able to access, right. Something outside of my sandbox. And so what I'm gonna do is I'm gonna read my million times inside, inside arrays, and then I'm going to suddenly read out of index. And then as soon as I've got that value out, which I won't never actually happened, but the speculation will get that far. Um, I will then take that value and immediately use it to read one of 256 different cache lines.

65
00:30:03.960 --> 00:30:04.660
<v Ben Rady>qOhhhhhh! I get it!

66
00:30:04.660 --> 00:30:15.060
<v Matt Godbolt>And then I hope that I can get that before the branch predictor goes, whoops, I didn't mean to go this way. And then it all gets undone. And then I, and then I get my undefined returned and I go, oh, cool.

67
00:30:15.060 --> 00:30:46.400
<v Ben Rady>So you're using the value that you read out, index it to do another read, right. Which now, which now will show up in your, uh, in your analysis essentially of like, oh, this, this one was really fast. I was like, I'm going to read some value that I don't know out of, out of memory. And let's say it's three, but you don't know that. Right. And then you're going to read whatever value you read's index into the memory. Right. And that turns out to be three. And now three is fast so the secret value was three, right?

68
00:30:46.400 --> 00:31:28.120
<v Matt Godbolt>Yeah exactly. And there are different ways of achieving and that's called a side ch annel attack. It's like a way of exfiltrating information that isn't through sort of the architectural state of the computer. So we've got a bit derailed there, not inasmuch as we have any railings at any of these things. But one thing I did want to talk about was going back to that big soup of, uh, out of order execution. Um, we had the front end, we can decode a bunch of stuff. We get microaggressions, they get thrown into this giant soup. They've been rewritten to use the internal registers. And now we're trying to use, um, you know, we've got an adder, we've got a multiplier, we've got a divider, we've got multiple of these things, right. There are called execution ports. So there's usually five between five and eight of them in different things.

69
00:31:28.120 --> 00:32:21.700
<v Matt Godbolt>And they can do different things. But the, the sort of the, the goal of the out of order of system is to try and fill up all these parts of the silicon with useful work to do. And as we've said, one of the ways that it can do that is to get way ahead of itself by predicting branches and starting to do loads of work that maybe is totally speculative. Another thing it can do is, um, rewrite the registers so that more seams of independent work can be found in a single stream of instructions. But there's one more clever trick that you can do for very little silicon cost. What if, instead of having a whole other CPU to make on the side, what if I just have another set of registers, not even the internal rewritten ones, just the external registers.

70
00:32:21.700 --> 00:33:20.960
<v Matt Godbolt>You know, I said at the beginning there isn't really a box of called EAX with 123 in. Well, there is, there is one canonical one, and that's in fact what that retirement stage is doing is it's writing the real value of EAX back into the actual real EAX register somewhere. So that anyway, right, but for the small cost of having another program counter and another set of all of the registers of which there are 16 or 32 of them, I can get another "thread" and I'm putting air quotes over this. So what if I were then to say to the front end, every other cycle read from either the first program counter or the second program counter. And when you read from the first program counter tag it to say, these are different registers from the other program counter. So I kind of have two sets of front end state. So now every other cycle I'm reading two completely independent streams of instructions from two different physical places in memory.

71
00:33:20.960 --> 00:33:56.690
<v Matt Godbolt>Once they've been through the rewriter, we know that they won't be able to talk to each other. So like the, the, the stream ones EAX is a different register, physical registered from stream two's EAX, but once they've been rewritten to internal register 1234, or 1235, whatever, it doesn't matter. At that point now we've got twice as much work to do that is guaranteed to be independent because they're from two completely different streams of execution, but I haven't got any more adders or multipliers or dividers or anything. I just throw that into the big soup in the middle of the out of order execution system.

72
00:33:56.690 --> 00:33:56.780
<v Ben Rady>Oh interesting!

73
00:33:56.780 --> 00:34:01.940
<v Matt Godbolt>And naturally we've got two non-overlapping DAGs that are being processed.

74
00:34:01.940 --> 00:34:03.200
<v Ben Rady>Right, right.

75
00:34:03.200 --> 00:34:20.960
<v Matt Godbolt>And provided. We just have the tiny one bit that says the ultimate result of this needs to go back to EAX in hyperthread - and I'm going to call it that now - hyperthread one or hyperthread two, we've got twice as many sources of independent work to do for very little cost.

76
00:34:20.960 --> 00:34:41.600
<v Ben Rady>So, so like very sort of naively here. If I had two programs, one of which was a very long series of adds and the other one was a very long series of subtracts. You could mix those two instructions together and know that they would never overlap and basically do them all in parallel because you know that they're never going to do two adds in the same register at the same time.

77
00:34:41.600 --> 00:35:34.100
<v Matt Godbolt>Correct. Now adds and subtracts happened to fall in the same category because they're exactly the same operation and when it comes down to it, but like if you said multiply and add; absolutely, or linked lists, one's reading through memory really fast or whatever, you know, so you get for "free" in inverted commas, uh, at least a very small cost of this little, extra architectural state. You get twice as much potential for, um, filling up those seven or eight real work ports, these execution ports with useful work that can allow you to make forward progress. And that's what hyper-threading was. So that's another way of getting parallelism. You know, we've got instruction level parallelism, which is itself the ability for the CPU to go out of order and say, well, I can run more than one instruction at a time from a single stream, then you've got hyperthreading, which says, well, I can, interleave more than one CPU's worth of instructions.

78
00:35:34.100 --> 00:36:19.560
<v Matt Godbolt>As long as I just do a tiny bit of accounting to make sure they don't mix up with each other, which has led again to more of the kind of scary, um, uh, problems that you have, uh, like Spectre and Meltdown type things. I forget which ones have, uh, specific to hyperthread siblings, but there are some, um, and then even just at the instructions themselves, um, we can use a single input, multiple data instructions where, you know, you can treat a single instruction as being, well, actually this is acting on eight independent ints at once, but that's a whole other topic, I think for another time, but that's another level of parallelism and that's before we even go to multi, you know, genuinely multithreading where they said we can't make the chip go any faster, but what we can do is we can put eight copies of the chip on a single die.

79
00:36:19.560 --> 00:36:23.560
<v Ben Rady>Where are we in the timeline? What year is it?

80
00:36:23.560 --> 00:36:24.380
<v Matt Godbolt>"Who's the president!?"

81
00:36:24.380 --> 00:36:29.780
<v Ben Rady>Yes. Yeah. We've talked about all these technologies up to this point. What year is it?

82
00:36:29.780 --> 00:37:18.860
<v Matt Godbolt>That's a great, um, it's a great question. So my sort of memory of where I started looking at certainly Intel CPU's was the mid late 90s. Um, and in fact, I was working for a company who were doing some work for Intel at the time. And we got, I remember the, the very first 233 megahertz Klamath got delivered to the, which was the name of the Pentium 2 got delivered to the office. And even though I was the new guy in the office, um, I was tasked with getting it up and running and working, which was, you know, painful, all these strange, um, drivers and things that needed to be installed and actually ultimately ended up inheriting it and taking it home, which was a whole other story. But, um, yeah, in our lunchtime Quake sessions I had by far the best ping, it was brilliant because no one else had a computer.

83
00:37:18.860 --> 00:38:03.200
<v Matt Godbolt>It was even this close. It was wonderful. It didn't make me much better at Quake, but it was a lot of fun at the time. But you know, that, that era, this was the last era where, um, you had any hope of really understanding or just getting help from Intel about how their CPUs worked. Like, I don't actually exactly remember when, but it was at least Pentium 1 where they were told us, rather explicitly there's a U pipe and a V pipe. And if your instructions were like this, they can go into the U pipe. And if they don't like this, they come into the V pipe and then you could sort of sit down and hand arrange these pairs of instructions. So this is very much before the more complicated stuff we've been talking about, um, around 2000 was when the Netburst CPU came out.

84
00:38:03.200 --> 00:38:57.580
<v Matt Godbolt>And that was one of the ones that, that, um, definitely used microcode by that point. So micro, micro operations, um, and, uh, rather interestingly that took a side step down, um, I mean, we didn't talk about micro operations themselves, but you know, the whole decoding thing is right. Quite complicated. And so there's caches for that as well. That took an interesting side step down an avenue of something called trace caches, which is sort of an alternative way of, um, kind of encoding both the branch prediction and, uh, the decode of the stream of instructions that happened when you went to that particular place. And so it was like, there was a cache that said, Hey, the last time you've got to address 1234 with this prediction of what branches might look like, here's just the pre-decoded, micro operations, half the work's already been done for them, just start executing those things and only just abort if you hit the wrong one, if you discover it on the way that you've made a mistake, you have to undo it and go back.

85
00:38:57.580 --> 00:39:46.620
<v Matt Godbolt>And the problem with that is it just didn't work as well as the engineers thought it was going to work. So, um, and at this point we're still in like 32 bit processors for what it's worth, uh, around the, this time though, AMD were working on a 64 bit extension and, uh, there was some kind of argy bargy back and forth between AMD and Intel about like, you know, who owned it and how they could do it. I think 2001 ish, they, they came to an agreement over a patent. And, um, uh, I think that was like a sort of unilateral patent agreement between the two. They wouldn't go after each other for all the things. And so the, um, in 2004, the first 64 bit processor came out and it was an AMD, which was quite a coup I mean, am I imagine lots of heads rolled inside Intel when, when they were beaten.

86
00:39:46.620 --> 00:40:28.580
<v Matt Godbolt>Uh, I think it was another couple of years before the C ore processor came out and that was 64 bit. So yeah, in the sort of those early two thousands is when, like the, most of the sort of micro operation stuff started to be, uh, to my knowledge anyway, um, more, uh, used, uh, you know, my, my real deep dive of this stuff starts in the sort of 2006 plus era of the core, which was, um, a separate offshoot. It was actually the original, that was a team in Israel who were doing the mobile version, the Pentium M and when the, thwhole, uh, trace cache that was inside the Netburst seemed to be a dead end. They stepped up to the plate and said, Hey, well, how about we take the mobile core? It takes my ideas from the trace cache, but move on in a different direction.

87
00:40:28.580 --> 00:41:17.000
<v Matt Godbolt>And that seemed to work out pretty well. Um, yeah, so, so two thousands, 2010 was the, the sort of Sandybridge era, which, um, which had reintroduced ironically something called the micro operation cache, which is different from the trace cache, but morally in the same sort of ballpark, but it didn't try to be as clever. It wasn't storing entire threads of execution. Like, Hey, if you get to this address, then this is what you're going to do until you discover you've made a mistake, which is actually genuinely more like what a JIT does, right. Nowadays a JIT will kind of say, Hey, this is this bit of the code we did this last time. Let's do that. But it was more like a traditional cache, uh, where, um, there's still logic that sort of pumping out which instruction should be next. Then it goes well, rather than go to memory and pull down and decode the instruction.

88
00:41:17.000 --> 00:42:06.100
<v Matt Godbolt>Let's just see if that memory address maps to somewhere that's in our micro operation cache. Oh, it does. Okay. That saves us going out to memory at all and, and doing the decode. So it's like a cache on the other side of the decode. Um, yes. Oh, sorry. I think I made it, I've got a note here. Actually, the, the, the, the 2003 was the AMD Opteron, which was the first 64 bit machine. And in 2004 was the Intel Nocona. So, which was the 64 bit. But, uh, and it would be rude not to mention ARM processors here as well. Um, uh, so the ARM processors started off, um, as I love to say is they were a co-processor to the BBC micro, which was a great fun thing. In fact, the ARM1 was, uh, uh, the sort of like the engineering sample for that was, was a coprocessor and it printed out, hello, I am arm or something with that.

89
00:42:06.100 --> 00:42:57.580
<v Matt Godbolt>And the cool little thing about that story is that, um, they had no idea what the power consumption was gonna be like. And after they'd realized it was booted and running, they realised they hadn't actually turned it on yet. It was running, but it was the IO pins. There was enough parasitic power coming from the IO pins from the device that was plugged into that the thing just was like working. So at that point, I think they realized they were onto kind of a low power thing, which obviously has been a theme ever since for them they're in every cell cell phone ever. Um, but you know, all the levels of sophistication that have happened in Intel have happened in ARM. I haven't followed them as closely, but, you know, the, the, the M1 amazing achievement and the M2, that's going to be coming soon from, from Apple who have licensed the arm ISA, but essentially have completely reimplemented it themselves with all the same kind of tricks and the same depth of pipeline and the same...

90
00:42:57.580 --> 00:43:02.480
<v Ben Rady>I didn't realize the M one came from ARM's legacy. That's interesting, right? Yeah. These

91
00:43:02.480 --> 00:43:51.960
<v Matt Godbolt>Are all, I mean, they have to pay arm for the ISA to license. The, essentially the encoding of the instructions is what they're paying for. I believe in like the idea of it, otherwise, you know, I'm sure they would have preferred to do everything themselves, but it's only so far you can go. But, um, yeah, it's, it's uh, and then the other thing about the M series is that they've gone for, uh, both a big and a small core. So they've actually gone from a heterogeneous core. So at the moment, like any Intel CPU, if you've got eight CPU's, they're all exactly the same. But in, in an M1, you've got the Firestorm core, which I think is the big one. And the Icestorm core, which is like a small one, and there's like four of each. And so there's some, some clever power management stuff that the operating system can do to say, well, let's just put these lower priority things on the lower power consumption CPU that isn't as fast, whereas this graphics editing code or whatever can run on the other one.

92
00:43:51.960 --> 00:44:47.540
<v Matt Godbolt>And I'm super vague on this, but I just know that it exists and it's, it's, it's coming, I think, to everywhere, you know? So, so, yeah, there's an awful lot of things that go on inside that little, little chip. And, you know, we haven't even talked about how the caches work, we, how the branch predictor works. I often joke that if we were to somehow come up with a way of getting the previous year's worth of lottery numbers, encoded as a sequence of jumps, we could probably use the branch predictor to predict the next set, and then I can retire because they are so good. They are just so clever. And the levels of which the engineers go to. And obviously this is into trade secret area. So reverse engineering, those branch predictors is, is, is challenging. And even just getting the idea so that the, the pipelines now are so big and weighty that the, um, like the old school idea about what a branch predictor is, is like, is this branch taken or is this not taken?

93
00:44:47.540 --> 00:45:34.100
<v Matt Godbolt>That's kind of what, how I've described it, but really even the, the, is it a branch, right? I've just pulled out 16 bytes of memory. And certainly for Intel, is there a branch in there and does it have a destination, like an unconditional branch, a jump? Um, if I have to wait five or six cycles for it to get down all the various pre-decode code steps and all those bits and pieces before it gets to the actual decode, it goes, oh, sugar, there's a jump, an unconditional jump to 100 here. Then I have to resteer, all those four or five steps beforehand. So there's a predictor for the destination of a branch or even if there exists a branch in just an arbitrary block of code. And so before anything happens at all, there's this sort of magic eight ball that the BPU shakes, and it goes, "there will be a branch in the next fetch".

94
00:45:34.100 --> 00:46:27.920
<v Matt Godbolt>It goes "brilliant. Okay. Well, the next one we're not going to get from where we think it's going to come from, we're going to get from like 100 and off we go". It's just so clever. So clever and so much, so much of this stuff that we take for granted, we just write code and it runs, and largely we can ignore it, which is a testament to, to, to, uh, where we are. But I have a cool little example where you can demonstrate the effects of the branch predictor, even in Python, right? You can write a relatively straightforward piece of code that runs. I'm not going to say a lot faster, but measurably reliably measurably faster one way than another. And they're identical workloads. It's just whether or not the branch is predictable. This is the heart of it. And that's like 10 layers deep inside the Python interpreter for heck's sake. So it does affect us all. Um, it's just interesting to know about, and very rarely does it actually make a difference, but it's cool to know that it's there.

95
00:46:27.920 --> 00:46:30.840
<v Ben Rady>Hmm. Well, I feel like I've learned a lot today.

96
00:46:30.840 --> 00:46:39.340
<v Matt Godbolt>It's been fun, and I mean, if people are interested in hearing more of this stuff, we can dig into something else along this line another time, or,

97
00:46:39.340 --> 00:46:46.000
<v Ben Rady>Yeah, this is, this is, there's like two or three other podcasts: this is really good stuff!

98
00:46:46.000 --> 00:47:23.040
<v Matt Godbolt>I'd like to shout out as well. There's a friend of mine's podcast who, uh, he and a friend are doing far, far deeper and far more authoritative, uh, podcasts on these kinds of subjects. So I, if you're interested in this stuff, um, I would written heartedly recommending going to TLB hit, which is https://tlbh.it/ and subscribing to JF Bastien's and Uh, and oh, I forget his co-host's name, but their podcast is brilliant. Um, but ours is also going to cover some of these topics, hopefully, maybe to just a different way into a different kind of audience.

99
00:47:23.040 --> 00:47:34.560
<v Ben Rady>Well, you definitely getting the, uh, getting the layman's perspective on some of this. Hopefully I'm doing that Larry King thing where I ask the questions that people are wondering in their heads, but we'll see.

100
00:47:34.560 --> 00:47:37.620
<v Matt Godbolt>Fantastic. Well, I guess until next time!

101
00:47:37.620 --> 00:47:40.620
<v Ben Rady>Until next time.