WEBVTT

1
00:00:20.580 --> 00:00:21.360
<v Matt Godbolt>Hi, Ben.

2
00:00:21.360 --> 00:00:22.520
<v Ben Rady>Hi, Matt.

3
00:00:22.520 --> 00:00:23.880
<v Matt Godbolt>How are you doing?

4
00:00:23.880 --> 00:00:25.220
<v Ben Rady>I'm doing pretty well.

5
00:00:25.220 --> 00:00:27.960
<v Matt Godbolt>We've just had a moment of exasperation, haven't we?

6
00:00:27.960 --> 00:00:32.660
<v Ben Rady>Yeah, we're probably talking over the intro. Probably not at this point, but maybe earlier.

7
00:00:32.660 --> 00:00:43.740
<v Matt Godbolt>Maybe earlier. I don't know. Yeah, that... it'll... I'll fix it in post, as I always try to do, or increasingly less. In fact, I try.

8
00:00:43.740 --> 00:00:58.200
<v Matt Godbolt>I think I've kidded myself that the charm is all about like the fact that a dog's barking in the background and there are children screaming across the road and... we lip smack and other things.

9
00:00:58.200 --> 00:00:59.950
<v Matt Godbolt>It's kind of ASMR actually, isn't it?

10
00:00:59.950 --> 00:01:00.060
<v Ben Rady>Yeah.

11
00:01:00.060 --> 00:01:14.520
<v Matt Godbolt>No. Yeah. ASMR. So recently there's been a lot of talk about the Dutch manufacturing company that makes the lithographic systems for chip manufacture. And they're called ASML.

12
00:01:14.520 --> 00:01:17.140
<v Matt Godbolt>And I just can't help but get the two things mixed up.

13
00:01:17.140 --> 00:01:17.200
<v Ben Rady>Ah, every time.

14
00:01:17.200 --> 00:01:23.040
<v Matt Godbolt>They're very different things, but that's not what we came here to talk about today. Is it?

15
00:01:23.040 --> 00:01:27.900
<v Ben Rady>It's not, it's not. I have an idea. I haven't... I have a thing.

16
00:01:27.900 --> 00:01:30.800
<v Matt Godbolt>You have an idea. Okay.

17
00:01:30.800 --> 00:01:31.820
<v Matt Godbolt>Tell me about your thing.

18
00:01:31.820 --> 00:01:32.180
<v Matt Godbolt>Whoa.

19
00:01:32.180 --> 00:01:34.320
<v Ben Rady>Let's say, careful now.

20
00:01:34.320 --> 00:01:41.280
<v Ben Rady>Let's say for the sake of argument, I have a program.

21
00:01:41.280 --> 00:01:43.640
<v Ben Rady>Stop it, phone.

22
00:01:43.640 --> 00:01:43.700
<v Matt Godbolt>There we are.

23
00:01:43.700 --> 00:01:43.920
<v Ben Rady>I have, exactly.

24
00:01:43.920 --> 00:01:49.780
<v Matt Godbolt>Some more charm. and More charm has just entered the room. His phone, Ben's phone, unmuted phone.

25
00:01:49.780 --> 00:01:56.600
<v Ben Rady>I have a program and I want it to run fast. How would I do that?

26
00:01:56.600 --> 00:02:07.080
<v Matt Godbolt>Oh, wow. That is a great question. And we only have around about 30 minutes, 40 minutes to talk about it. So I've got a feeling.

27
00:02:07.080 --> 00:02:10.140
<v Ben Rady>So immediately this is gonna be our first 10 part series.

28
00:02:10.140 --> 00:02:25.120
<v Matt Godbolt>Maybe so. I don't know. I don't know. This is one of those questions because I think obviously deliberately you have asked me a leading and vague question. Because what do you mean by, in fact, each of those words, right?

29
00:02:25.120 --> 00:02:29.070
<v Ben Rady>Can you please define each of the words that you just used in that sentence?

30
00:02:29.070 --> 00:02:29.210
<v Matt Godbolt>Yeah.

31
00:02:29.210 --> 00:02:29.340
<v Matt Godbolt>Right.

32
00:02:29.340 --> 00:02:29.740
<v Ben Rady>Yes, exactly.

33
00:02:29.740 --> 00:02:32.720
<v Matt Godbolt>My program, make my program fast.

34
00:02:32.720 --> 00:02:33.340
<v Ben Rady>Right.

35
00:02:33.340 --> 00:02:44.840
<v Matt Godbolt>All right. So I think make, we can, yeah, cause it to become my program. You have something that you wish to go fast. Right. So the only word we really need to define then is fast.

36
00:02:44.840 --> 00:02:48.620
<v Matt Godbolt>And I'm pretty sure you don't mean stopping eating for a period of time.

37
00:02:48.620 --> 00:02:54.420
<v Ben Rady>Right. Well, I think you could also define my program because in this hypothetical thing is I haven't written anything yet.

38
00:02:54.420 --> 00:02:54.700
<v Matt Godbolt>Oh,

39
00:02:54.700 --> 00:02:54.980
<v Matt Godbolt>oh

40
00:02:54.980 --> 00:02:55.500
<v Ben Rady>Right.

41
00:02:55.500 --> 00:03:07.760
<v Ben Rady>It's like, I want to write a program that doesn't exist, but I want the execution of that program to be fast for some definition of fast that we will define shortly.

42
00:03:07.760 --> 00:03:07.860
<v Matt Godbolt>okay.

43
00:03:07.860 --> 00:03:21.150
<v Matt Godbolt>Right. And so, yes, not no eating, not also unmoving, which is the weird other case of you know stuck fast, which is like one of those really weird auto-antonymic words, you know, like inflammable or something like that.

44
00:03:21.150 --> 00:03:21.500
<v Ben Rady>Right. Right. Uh-huh. Yes. Stuck fast. It's the opposite of this. Yeah. Uh-huh.

45
00:03:21.500 --> 00:03:22.440
<v Ben Rady>Right.

46
00:03:22.440 --> 00:03:26.400
<v Matt Godbolt>You mean quick, to go quickly. All right. So let's talk about what that could mean. Right.

47
00:03:26.400 --> 00:03:26.780
<v Ben Rady>Yes.

48
00:03:26.780 --> 00:03:39.540
<v Matt Godbolt>So the instant you say that and because of what we've been doing professionally for a long time and for what I've been doing most of my career, I think of low latency as fast.

49
00:03:39.540 --> 00:03:50.320
<v Matt Godbolt>Like if you if you decide you need to do a thing, if your program needs to do a thing and it discovers that that need, it should act upon it as quickly as possible having discovered that need.

50
00:03:50.320 --> 00:03:50.820
<v Ben Rady>Mm-hmm.

51
00:03:50.820 --> 00:04:01.380
<v Matt Godbolt>That is what I mean by sort of low latency. It reacts to a real-time thing as quickly as possible. And I think a lot of embedded systems work in this world. A lot of UI, in fact, is this, right?

52
00:04:01.380 --> 00:04:05.940
<v Matt Godbolt>If you said, hey, I want my, you know responsiveness in a UI is about latency, right?

53
00:04:05.940 --> 00:04:06.040
<v Ben Rady>Right.

54
00:04:06.040 --> 00:04:08.790
<v Matt Godbolt>But that's one form of fast.

55
00:04:08.790 --> 00:04:09.940
<v Ben Rady>Mm-hmm.

56
00:04:09.940 --> 00:04:21.200
<v Matt Godbolt>But there's a very real, very important other kind of fast, which is I am... I'm writing a distributed message passing system.

57
00:04:21.200 --> 00:04:27.100
<v Matt Godbolt>You know I'm... I've Mastodon, right? I've... and I'm dealing with an enormous volume of information.

58
00:04:27.100 --> 00:04:27.100
<v Ben Rady>Yeah.

59
00:04:27.100 --> 00:04:40.000
<v Matt Godbolt>And if I was slow at doing that, eventually I would back up and there would be I would never keep up with with the current... Well, it doesn't matter. Individual messages may take a few seconds to go through if that's not fast.

60
00:04:40.000 --> 00:04:46.280
<v Matt Godbolt>I mean, maybe that is fast for that scalar network, right? But you're talking about a throughput problem, you know, or bank clearing is another one.

61
00:04:46.280 --> 00:04:46.320
<v Ben Rady>Right.

62
00:04:46.320 --> 00:04:47.340
<v Matt Godbolt>They've kind of got a bit of latency.

63
00:04:47.340 --> 00:04:47.500
<v Ben Rady>Mm-hmm.

64
00:04:47.500 --> 00:04:58.160
<v Matt Godbolt>You beep your card when you're at the ah the coffee shop and it goes beep and sort of you've got it waiting six or seven seconds and that's a completely fine human amount of time. You know, it's an eternity for computers.

65
00:04:58.160 --> 00:05:07.730
<v Matt Godbolt>But on the other hand, there are... several hundred million people in that same five second period tapping their card on their reader in the same time.

66
00:05:07.730 --> 00:05:07.920
<v Ben Rady>Right, right.

67
00:05:07.920 --> 00:05:17.300
<v Matt Godbolt>And so there's a lot of things to go. So those things, so we've got latency and throughput. And are there any other aspects?

68
00:05:17.300 --> 00:05:25.980
<v Ben Rady>I think for the definition of this conversation, I think latency and throughput are two dimensions of fast which we should consider.

69
00:05:25.980 --> 00:05:27.440
<v Matt Godbolt>Okay. Okay.

70
00:05:27.440 --> 00:05:32.180
<v Ben Rady>I think you could maybe come up with some other ones, but again, we have 30 minutes.

71
00:05:32.180 --> 00:05:37.110
<v Matt Godbolt>Well, I mean, again, we can make this into more. but So so do you want which one do you want to start with?

72
00:05:37.110 --> 00:05:37.440
<v Ben Rady>That's true.

73
00:05:37.440 --> 00:05:39.440
<v Ben Rady>I think we should start with latency.

74
00:05:39.440 --> 00:06:00.020
<v Matt Godbolt>Okay. Okay. So then we get into this sort of amazing set of kind of questions that we get into when we're dealing with computers because how fast is fast now? Right, how quick is – like we just said, if I'm a human waiting at a sales kiosk,

75
00:06:00.020 --> 00:06:08.700
<v Matt Godbolt>If it takes me 30 seconds to pay for something, that is not fast. If it takes me three seconds, it's probably okay.

76
00:06:08.700 --> 00:06:09.360
<v Ben Rady>Mm-hmm.

77
00:06:09.360 --> 00:06:13.700
<v Matt Godbolt>If it takes 0.3 seconds, I won't notice that that anything happened, really.

78
00:06:13.700 --> 00:06:13.780
<v Ben Rady>Yeah.

79
00:06:13.780 --> 00:06:22.470
<v Matt Godbolt>There's a beep and I'm out of there. And any other order of magnitude less than that, like 0.03 seconds or whatever, is immaterial to me at that point, and and it doesn't matter.

80
00:06:22.470 --> 00:06:22.550
<v Ben Rady>Right.

81
00:06:22.550 --> 00:06:22.640
<v Ben Rady>Mm-hmm.

82
00:06:22.640 --> 00:06:35.280
<v Matt Godbolt>But you and I work in a world where at least some of the things that we have done or are adjacent to, we are talking about maybe seven or eight orders of magnitude faster than those three seconds.

83
00:06:35.280 --> 00:06:35.600
<v Ben Rady>Mm-hmm.

84
00:06:35.600 --> 00:06:39.800
<v Matt Godbolt>We're talking... Hundreds of nanoseconds, hundreds of microseconds.

85
00:06:39.800 --> 00:06:39.940
<v Ben Rady>Yeah.

86
00:06:39.940 --> 00:06:44.730
<v Matt Godbolt>And I have to remind myself that a nanosecond is a billionth of a second, right?

87
00:06:44.730 --> 00:06:44.820
<v Ben Rady>Yeah. Yeah.

88
00:06:44.820 --> 00:06:56.520
<v Matt Godbolt>You know, sometimes like we throw these words around. I mean, maybe this is just, again, you and I and the weird world that we live in. But like nanoseconds are kind of a completely normal measurement for a, you know, three gigahertz computer.

89
00:06:56.520 --> 00:06:56.800
<v Ben Rady>Right.

90
00:06:56.800 --> 00:07:07.560
<v Matt Godbolt>A few clock cycles is a few nanoseconds. That's completely normal. Yeah, that's reasonable. And, you know, like to, you know, good old Grace Hopper's, you know, light foot measurement, that a nanosecond is.

91
00:07:07.560 --> 00:07:07.560
<v Ben Rady>Yeah. Yeah. Uh-huh.

92
00:07:07.560 --> 00:07:20.880
<v Matt Godbolt>Here's a foot of copper wire. That's a nanosecond. You're like, oh, shucks, man. You know, that's the speed of light. It's the fastest thing that there is can only go a couple of feet in the time it takes me to add two 64-bit numbers together.

93
00:07:20.880 --> 00:07:24.380
<v Matt Godbolt>That is insane, frankly, that we're at that stage.

94
00:07:24.380 --> 00:07:36.620
<v Ben Rady>It is insane. You know, whenever I think about that, I just think about Grace's wire just twisted around on the inside of the CPU and all the places where it would need to go in order for the light to move in that way to add those two numbers.

95
00:07:36.620 --> 00:07:37.980
<v Ben Rady>And it just makes my brain melt.

96
00:07:37.980 --> 00:07:56.370
<v Matt Godbolt>It's, yeah, it is quite something. And the fact that, yeah, we've taken these Factorio factories and we've crushed them down into the size of my you know thumbnail and somehow snaking around in there are many, many, many, many hundreds of feet of cable.

97
00:07:56.370 --> 00:07:57.460
<v Ben Rady>Yeah. Yeah.

98
00:07:57.460 --> 00:08:05.200
<v Matt Godbolt>As you say, each individual thing. Electron is – well, they don't go all the way down. It's complicated, right?

99
00:08:05.200 --> 00:08:07.090
<v Matt Godbolt>Let's stop pretending that we can talk about that as well.

100
00:08:07.090 --> 00:08:07.440
<v Ben Rady>Yeah.

101
00:08:07.440 --> 00:08:28.360
<v Matt Godbolt>So, yeah, we've got many orders of magnitude here. So, you know, we could break it down into things that need to be measured in nanoseconds, in microseconds, in milliseconds, and seconds. And, you know, that's, I think, probably – after that, I think, you know, the latency of something – maybe, I mean, maybe you still can measure it in minutes, right? I mean, let's think about like you know publishing a video on YouTube.

102
00:08:28.360 --> 00:08:40.750
<v Matt Godbolt>I've just updated my video on YouTube and I click the button to publish it and it takes how many minutes to transcode it to all the different formats that YouTube does before it becomes visible. Maybe that is measured in minutes and maybe I do care about that latency.

103
00:08:40.750 --> 00:08:41.350
<v Ben Rady>Mm-hmm.

104
00:08:41.350 --> 00:08:52.740
<v Matt Godbolt>But I think the thing that I can confidently talk about, and maybe you as well, is the sub-second orders of magnitude latency.

105
00:08:52.740 --> 00:08:52.940
<v Ben Rady>Yeah, yeah, yeah.

106
00:08:52.940 --> 00:08:55.800
<v Matt Godbolt>And so let's talk about what can you do in that amount of time?

107
00:08:55.800 --> 00:08:57.960
<v Ben Rady>Right, right.

108
00:08:57.960 --> 00:09:14.400
<v Matt Godbolt>What, you know, that's a good starting point of like, only, you know, if you say I would like my program to be fast without even knowing what your program is, I've got some bounds on what you could reasonably expect to do in hundreds of nanoseconds, hundreds of microseconds, hundreds of milliseconds, right?

109
00:09:14.400 --> 00:09:14.530
<v Ben Rady>Mm-hmm.

110
00:09:14.530 --> 00:09:15.220
<v Ben Rady>Right, right, right, right, yeah.

111
00:09:15.220 --> 00:09:26.150
<v Matt Godbolt>On commodity hardware, if we're just talking about things like a regular PC or a regular, you know, so a decent server in a data center somewhere or a laptop or something like that, they're all within spitting distance of each other.

112
00:09:26.150 --> 00:09:26.320
<v Ben Rady>Right.

113
00:09:26.320 --> 00:09:33.460
<v Ben Rady>Yeah. So if we think of latency as sort of like the time it between an input is provided and an output is produced,

114
00:09:33.460 --> 00:09:33.900
<v Matt Godbolt>Yeah, sounds good.

115
00:09:33.900 --> 00:10:24.060
<v Ben Rady>then automatically down in the like tens of nanoseconds range, you're sort of like, what output could you possibly consume or what input could you possibly consume and what output could you possibly produce in that span of time, right? Because it's got to go into the computer and then come out. So like maybe a theoretical example here would be like a robotics application where you have some like active control device of a a system or or you know a set of servos or something like that where you're trying to like, you know, imagine like the, you know, like the robotic dogs type thing where you're just like, you know, taking like some information from like the sensor on ah on ah on a robot and then feeding it back and using that to adjust. Like that's already going to be up probably in the like low milliseconds range, probably.

116
00:10:24.060 --> 00:10:26.740
<v Ben Rady>I mean, I don't know. I'm not an expert in robotics, but

117
00:10:26.740 --> 00:10:37.220
<v Matt Godbolt>I'm not sure either. No, but I mean, you think about things that do like autobalance, or you know, drones, like a still that can't be milliseconds because it's already upside down by the time a millisecond of wind has caught it.

118
00:10:37.220 --> 00:10:37.220
<v Ben Rady>Yeah, yeah.

119
00:10:37.220 --> 00:10:37.220
<v Ben Rady>Yeah, yeah, yeah, yeah, yeah.

120
00:10:37.220 --> 00:10:54.880
<v Matt Godbolt>Right. So they must be a little faster than that. But, you know, maybe we're talking microseconds at that level. It seems not totally unreasonable to have some sensor update and put something on a bus internal to a system on a chip kind of thing that can then go, oh, wait a second, I need to speed up this rotor a little bit more to account for that.

121
00:10:54.880 --> 00:10:55.140
<v Ben Rady>Mm-hmm, mm-hmm.

122
00:10:55.140 --> 00:10:55.860
<v Matt Godbolt>Right. But you you're absolutely right.

123
00:10:55.860 --> 00:11:21.520
<v Ben Rady>Yeah. Cause a lot of those times those outputs can be sort of like targets, right? So they don't, it's not necessarily that it needs to go like all the way out to like a servo or something. It's just kind of like, okay, I've gotten this input and now I need to adjust, you know, my targeted angle here. Right. And there's maybe even some other external system that is like actually making the adjustments to the physical objects, but like you might want that target to update on the order of,

124
00:11:21.520 --> 00:11:24.020
<v Ben Rady>you know, tens of mics or hundreds of mics.

125
00:11:24.020 --> 00:11:24.820
<v Matt Godbolt>Right.

126
00:11:24.820 --> 00:11:31.420
<v Ben Rady>Just because when you do, you get sort of like a smoother, better prediction of like where the arm needs to be or something like that.

127
00:11:31.420 --> 00:11:35.590
<v Matt Godbolt>Right, right, right, right. And yeah, so I mean, let's to sort of break it down to consumer hardware.

128
00:11:35.590 --> 00:11:35.600
<v Ben Rady>Yeah.

129
00:11:35.600 --> 00:11:52.950
<v Matt Godbolt>It is difficult to get a message in and out of a computer in, you know, I mean, like you just do a ping, right? You ping a computer on your local network, and it's less than a millisecond, but it's, you know, still hundreds and hundreds of microseconds probably to go through commodity hardware.

130
00:11:52.950 --> 00:11:53.280
<v Ben Rady>Yeah.

131
00:11:53.280 --> 00:12:03.540
<v Matt Godbolt>And, you know, for the kind of weird stuff that you and I have been involved with, there is some very specialist hardware that happens, which we can't really talk about but on the far end into the nanosecond range of things. So I suppose if we're talking about

132
00:12:03.540 --> 00:12:03.540
<v Ben Rady>Mm-hmm.

133
00:12:03.540 --> 00:12:21.800
<v Matt Godbolt>Things that most folks could reasonably get, put their fingers on a keyboard and program, be it a little embedded system like an Arduino or an ARM-y type thing, or indeed a, you know, a program like I'm typing on a computer and I'm expecting, when having pressed a key press to see the key appear in the terminal, that kind of thing.

134
00:12:21.800 --> 00:12:22.770
<v Ben Rady>Mm-hmm.

135
00:12:22.770 --> 00:12:31.420
<v Matt Godbolt>we we've We've kind of picked now our range. Now we're talking probably hundreds of mics to low millis.

136
00:12:31.420 --> 00:12:33.500
<v Ben Rady>Mm-hmm. Mm-hmm. So I've got a program.

137
00:12:33.500 --> 00:12:34.220
<v Matt Godbolt>So, Yeah, so we're down.

138
00:12:34.220 --> 00:12:34.320
<v Ben Rady>It's going to...

139
00:12:34.320 --> 00:12:36.620
<v Matt Godbolt>Yeah, that's right. I forgot where we were now. yeah

140
00:12:36.620 --> 00:12:51.380
<v Ben Rady>It's gonna read some input delivered from outside the computer, and this is a commodity computer, you know, like, and then it's gonna produce some output that is also going to leave the computer, right?

141
00:12:51.380 --> 00:12:52.380
<v Matt Godbolt>Yep. Yeah.

142
00:12:52.380 --> 00:12:59.360
<v Ben Rady>And we're thinking that this is gonna have to occur, it's gonna take at least hundreds of mics for that to occur. Is that a fair statement?

143
00:12:59.360 --> 00:12:59.780
<v Matt Godbolt>I think so. Without without getting...

144
00:12:59.780 --> 00:13:03.340
<v Ben Rady>So that's like the upper bound of how fast we can make it for my definition of fast.

145
00:13:03.340 --> 00:13:16.160
<v Matt Godbolt>I think so. I mean, you know, there are exotic techniques and things that you can even use on commodity hardware that is a little bit more exotic, like DPDK, which is a sort of network bypass thing that you can fiddle with.

146
00:13:16.160 --> 00:13:30.660
<v Matt Godbolt>And if you put a second network card in your computer, you can go faster. But yeah, I think let's say, yeah, already, whatever you're doing, if you're responding to an external network event, you're doing any kind of reaction to it and sending it back out again, we're in the...

147
00:13:30.660 --> 00:13:30.660
<v Ben Rady>Yeah.

148
00:13:30.660 --> 00:14:07.500
<v Matt Godbolt>Microsecond region but probably hundreds of microsecond region would be my instinct and maybe I'm wrong, maybe I've forgotten how, you know, computers are these days. But, or at least like normal computers, right. In normal computers, I mean actually I'm... I was telling you before we started recording I'm surrounded with networking gear and computers and things and AI agents running things for me as I'm setting up. I'm actually upgrading my home network for up to like two and a half gig because that's actually something that is not totally unaffordable for the home these days now and so I'm like, hey, maybe my ideas about what commodity hardware is is out of date but...

149
00:14:07.500 --> 00:14:07.730
<v Ben Rady>Yeah.

150
00:14:07.730 --> 00:14:07.850
<v Ben Rady>Yeah.

151
00:14:07.850 --> 00:14:08.750
<v Ben Rady>No one knows how computers actually work.

152
00:14:08.750 --> 00:14:09.000
<v Ben Rady>Yeah.

153
00:14:09.000 --> 00:14:22.660
<v Ben Rady>Yeah. So my first question would be when I'm writing my program, how do I make sure, because my assumption here is that the input here is unpredictable. It's not like you're gonna start the program and immediately be provided the input.

154
00:14:22.660 --> 00:14:24.330
<v Ben Rady>The program is running,

155
00:14:24.330 --> 00:14:25.280
<v Matt Godbolt>Mm-hmm. Yeah.

156
00:14:25.280 --> 00:14:42.380
<v Ben Rady>There's some thing outside of the computer that creates this input, and then the program has to react and create an output, right? So my first question to you is how do I make sure that there's the minimum possible latency between the generation of that signal and my the code in my program running?

157
00:14:42.380 --> 00:14:42.380
<v Matt Godbolt>Yeah.

158
00:14:42.380 --> 00:14:45.800
<v Ben Rady>Something, something ePoll, right?

159
00:14:45.800 --> 00:14:59.180
<v Matt Godbolt>Well, that's... No, exactly right. No, well, yes and no. I mean, that goes into the DPDK kind of sort of thing I alluded to. So yeah, you know a traditional server program that is maybe blocked on a socket.

160
00:14:59.180 --> 00:15:11.800
<v Matt Godbolt>So you've got a socket, you've got, you know let's say UDP just to make things maybe easier. I don't know, right? You've got a blocking call to receive, which means that your process has gone to sleep

161
00:15:11.800 --> 00:15:12.240
<v Ben Rady>Mm-hmm.

162
00:15:12.240 --> 00:15:30.600
<v Matt Godbolt>Told the operating system, hey, anything interesting coming in on this port, this UDP port, I'd like to know, please. And then I've gone to sleep and I've been descheduled. And now I've just been chucked in the list of all the things that are currently, all the processes that are currently asleep and have nothing to do.

163
00:15:30.600 --> 00:16:01.140
<v Matt Godbolt>So that's where we are now. Packet comes in, goes through the standard kernel process of saying, well, who's this for? Oh, it's for port 9007. That is registered to this process. So we're going to write it into a buffer now rather than discard it, or rather we keep it probably around in a buffer that exists. First of all, is there space in that buffer? Yes, there is. Okay, this is something you can configure for each socket, how much space that you can have, which gets complicated if you're sharing multiple processes listening on the same.

164
00:16:01.140 --> 00:16:17.820
<v Matt Godbolt>It's, yeah. Anyway, and then we go, well, who cares about this, right? We've put it into the buffer now. We've copied the data. Who cares? Oh, there's a process. Okay, well, this process is waiting on this, so we should mark it as ready. This process is now ready to be scheduled. And then...

165
00:16:17.820 --> 00:16:30.300
<v Matt Godbolt>You're done. And then as part of the whole, you're done process, the kernel goes back to, well, I guess I'm now finished, but back to the user process. Now who, who should be run?

166
00:16:30.300 --> 00:17:01.850
<v Matt Godbolt>And maybe there was something else that was going on on that CPU when the packet came in and that other process is actually still higher priority and whatever queuing fairness there is, in which case it will merrily do whatever it was doing before until it runs out of the time slice that it had allocated it and then it goes, oh, someone else's go now, oh look this thing's ready. So but what I'm sort of getting at here is there is a lot of things that happen between that packet arriving in your, even in your kernel, and your process being woken up to do anything about it, and that's a hugely variable amount of work too.

167
00:17:01.850 --> 00:17:02.700
<v Ben Rady>Mm-hmm. Mm-hmm.

168
00:17:02.700 --> 00:17:22.620
<v Matt Godbolt>And you can definitely reduce the amount of work. So the amount of other things that the CPU or the kernel is trying to do on a CPU so that your process is more likely to be woken up. There are schedulers that let you set these things or priorities and stuff, but,

169
00:17:22.620 --> 00:17:37.620
<v Matt Godbolt>The way that I intuitively do these type of things is using some kind of, yeah, like we said, DPDK, which is this sort of networking layer that lets you use shared memory with the network card.

170
00:17:37.620 --> 00:17:47.540
<v Matt Godbolt>And then instead of the kernel being involved at all, you just keep continuously reading a place of memory that the network card will write directly to and say, hey, I've got a packet for you. And then you just pick it up.

171
00:17:47.540 --> 00:17:57.950
<v Matt Godbolt>And then essentially your latency goes down to how fast a PCIe bus transaction can notify that a piece of memory has changed and you pick it up and off you go.

172
00:17:57.950 --> 00:17:58.080
<v Ben Rady>Mm-hmm.

173
00:17:58.080 --> 00:18:19.180
<v Matt Godbolt>Right. But that is still fairly esoteric. So it seems like that's not a fair thing. But I guess it's the difference between... right, if the program part that we've not even started talking about now is itself going to take tens of milliseconds because of the innate amount of work that it has to do.

174
00:18:19.180 --> 00:18:32.550
<v Matt Godbolt>And so the upper bound is dominated by its processing time. Then maybe the effort and all the work to put in shaving tens of microseconds off on the like the interrupt handling time isn't worth it, right?

175
00:18:32.550 --> 00:18:32.660
<v Ben Rady>Mm-hmm. Mm-hmm.

176
00:18:32.660 --> 00:19:06.920
<v Matt Godbolt>It's a lot of engineering effort. It's a lot of complexity. It's a lot of non-standard nonsense. But if we wanted to make something go as fast as possible, those are the kind of things that I would take to doing is to take the kernel out of the equation and burn a CPU core just spinning, doing nothing other than saying, is there anything to do? How about now? How about now? How about now? In a tight loop and potentially reschedule everything else so they can't run on the same CPU as me. There are ways that you can do this with CPU sets with isolation.

177
00:19:06.920 --> 00:19:22.300
<v Matt Godbolt>Trickery, isolcpus in a kernel command. There's a lot of weird things you can do. Anyway, right. And then you can get it down to, in the limit, single digit microseconds between a packet arriving and you going, okay, I have work to do.

178
00:19:22.300 --> 00:19:46.020
<v Ben Rady>Okay. So if we assume that the work that needs to be done is relatively simple, right? You're going to do some adds, you're going to do some multiplies, you're going to try to structure your code so that you're really only doing adds and multiplies, and then you're going to produce some output that is purely a function of that input and whatever state you may have held in memory.

179
00:19:46.020 --> 00:20:05.100
<v Matt Godbolt>All right. So what you're talking about, what we've got is an ALU as a service. You know, we are, have you seen this? There's a guy who's written a microservices based CPU emulator where every single instruction is handled by a different microservice, each of which is written in a different language and the thing runs and it's so... but yeah, isn't it cool?

180
00:20:05.100 --> 00:20:07.520
<v Ben Rady>Oh my God, that's amazing. Man,

181
00:20:07.520 --> 00:20:10.680
<v Matt Godbolt>I love that people think of these things.

182
00:20:10.680 --> 00:20:13.760
<v Ben Rady>you think you were in microservice hell. You should.

183
00:20:13.760 --> 00:20:20.810
<v Matt Godbolt>Yeah, you've got like literally 256 different things and a RAM service, you know, memory service so that everyone can agree on this.

184
00:20:20.810 --> 00:20:21.180
<v Ben Rady>Oh, that's amazing.

185
00:20:21.180 --> 00:20:24.550
<v Matt Godbolt>It's cool. but So you're talking about something like that. Let's just say you know it is...

186
00:20:24.550 --> 00:20:24.600
<v Ben Rady>Yeah.

187
00:20:24.600 --> 00:20:36.590
<v Matt Godbolt>In fact, honestly, probably a decent... Unless you want to... I'm going to say these and you can tell me if where you're trying to steer it is right or not. But like something like a Redis memory cache is like one of these things.

188
00:20:36.590 --> 00:20:36.680
<v Ben Rady>Yeah. Right.

189
00:20:36.680 --> 00:20:39.440
<v Matt Godbolt>It's like something comes in, I look it up in RAM and I'm like, yes, I've got it.

190
00:20:39.440 --> 00:20:39.480
<v Ben Rady>Sure.

191
00:20:39.480 --> 00:20:46.390
<v Matt Godbolt>Here it is. It's like a small piece of data. You know, a key value store that is just, you know, like a cache, right?

192
00:20:46.390 --> 00:20:47.120
<v Ben Rady>Yeah.

193
00:20:47.120 --> 00:20:48.740
<v Matt Godbolt>You're not doing very much work.

194
00:20:48.740 --> 00:20:49.100
<v Ben Rady>Yeah.

195
00:20:49.100 --> 00:20:54.640
<v Matt Godbolt>It's not as as a banal as like I'm adding the three numbers you've given to me and giving them back to you.

196
00:20:54.640 --> 00:20:54.780
<v Ben Rady>Right.

197
00:20:54.780 --> 00:20:58.760
<v Matt Godbolt>It is, that you can imagine there is actual value in that, but it's still, you know.

198
00:20:58.760 --> 00:21:08.540
<v Ben Rady>Yeah. I'm trying to draw a line about the the other parts of the computer that you're going to be talking to. And what I'm trying to say by this is that you can't ignore memory, right?

199
00:21:08.540 --> 00:21:08.620
<v Matt Godbolt>I see.

200
00:21:08.620 --> 00:21:13.570
<v Ben Rady>It's not simply a function of the data that you've read in, but you don't have to go beyond that, right?

201
00:21:13.570 --> 00:21:13.900
<v Matt Godbolt>Yep.

202
00:21:13.900 --> 00:21:19.720
<v Matt Godbolt>So, okay. So let's so let's use like ah an in-memory cache, like a yeah Redis is like without the disk backing type stuff.

203
00:21:19.720 --> 00:21:20.320
<v Ben Rady>Yeah. Right, right. Mm-hmm.

204
00:21:20.320 --> 00:21:34.340
<v Matt Godbolt>Yeah. Yeah, yeah. That's perfect. Okay. So yeah, you're looking, I mean, at that point, there's a, without going into the extremely fun and interesting rabbit hole of how you actually do the lookup,

205
00:21:34.340 --> 00:21:45.340
<v Matt Godbolt>and what data structures you use. But that is probably the key thing is that like, you must consider the data structures you use. But the interesting thing is that with a latency lens,

206
00:21:45.340 --> 00:22:09.820
<v Matt Godbolt>You're not looking at data structures in quite the same way as you do when you are doing like CS, whatever. I don't know what numbers that you use, but you know when you're doing your data structures courses and you're like looking at big O notation, like, you know, big O of N, whatever, N log N for your, you know, so like a not unreasonable data structure for you

207
00:22:09.820 --> 00:22:11.700
<v Ben Rady>Right. Yeah. Right, right.

208
00:22:11.700 --> 00:22:20.350
<v Matt Godbolt>a key value store is something like a balanced red-black tree, right? It's log N to look into it, whatever.

209
00:22:20.350 --> 00:22:20.950
<v Ben Rady>Okay.

210
00:22:20.950 --> 00:22:33.610
<v Matt Godbolt>Another one is a B tree or a B plus tree or one of those kinds of trees, you know, hash maps are other ones as well. And, you know, obviously hash maps do have in theory O of one. So you'd probably want to pick that out of big O notation or whatever.

211
00:22:33.610 --> 00:22:34.420
<v Ben Rady>Right.

212
00:22:34.420 --> 00:22:46.960
<v Matt Godbolt>But the other ones seem reasonable as well. Like it's like, but, the dominating factor with something as straightforward as the thing you've just described, where we're literally just looking something up in the key value store, is going to be reading from RAM.

213
00:22:46.960 --> 00:23:07.270
<v Matt Godbolt>Now, if you're talking about something that's so small it fits into L1, L2, or L3, which are in the of the order of tens of Ks in the L1 case up to several megabytes at L3, and you know bigger than that on bigger service servers and whatnot, but like we're still talking like not all that much information.

214
00:23:07.270 --> 00:23:07.680
<v Ben Rady>Right.

215
00:23:07.680 --> 00:23:14.280
<v Ben Rady>can Can you describe for our listeners, he said, covering for his own ignorance, what L1, L2, and L3 actually are?

216
00:23:14.280 --> 00:23:36.440
<v Matt Godbolt>Oh, I'm so sorry. So there are layers of caches. Caches are expensive both in terms of the amount of space they take up on a chip and how fast, you know going back to our Grace Hopper, you know the bigger something is, the longer it takes to get to the other side of it.

217
00:23:36.440 --> 00:23:39.010
<v Matt Godbolt>So the way that CPUs

218
00:23:39.010 --> 00:23:39.560
<v Ben Rady>Mm-hmm.

219
00:23:39.560 --> 00:23:53.320
<v Matt Godbolt>prevent us from having to read from the very slow RAM, which we'll talk about in a second, is that we have on-chip caches, and they are layered so that you have a very small, incredibly fast level one cache, which is usually...

220
00:23:53.320 --> 00:24:09.560
<v Matt Godbolt>not very large, tens of kilobytes kind of area. Then you have a layer two cache, which is maybe a meg, maybe hundreds of K, that kind of area. And then a layer three cache, which is maybe tens of megs, maybe even a little more than that.

221
00:24:09.560 --> 00:24:20.940
<v Matt Godbolt>And each of them is slightly further away, like literally physically. They're still on the actual thumb size chip inside the machine. They're further away, but it takes longer to look through them to find the data.

222
00:24:20.940 --> 00:24:26.590
<v Matt Godbolt>And so if you can fit something in L1, it's kind of almost free.

223
00:24:26.590 --> 00:24:26.840
<v Ben Rady>Mm-hmm.

224
00:24:26.840 --> 00:24:54.460
<v Matt Godbolt>I'm going to put air quotes over this. We're talking somewhere in the region of, you know, five or six cycles to read from L1. And that includes all of the other things that are like the fixed cost of reading from memory includes doing the lookup for the logical address that you've decided to read from and then turning it into the actual physical address that it needs to go to the correct chip, you know, different processes think that address 2000 has different data in it, right?

225
00:24:54.460 --> 00:24:54.860
<v Ben Rady>Yeah, yeah, yeah.

226
00:24:54.860 --> 00:25:06.380
<v Matt Godbolt>And something has to do that translation. And that's not at this scale. When we're talking about, you know, a three gigahertz machine, a third of a nanosecond is a clock cycle. It's that that look up itself takes time.

227
00:25:06.380 --> 00:25:06.640
<v Ben Rady>Right, right.

228
00:25:06.640 --> 00:25:19.380
<v Matt Godbolt>And so, you know, we need to be doing something. So that's about the limit of it. And there's some other really complicated things that the poor memory system has to do because of the weird out-of-order engine. But we'll ignore that for now. So five or six cycles for like an L1.

229
00:25:19.380 --> 00:25:30.520
<v Matt Godbolt>Now I'm going to forget this off the top of my... should have brought this up beforehand, you know, but like tens of cycles, low tens of cycles for L2, maybe late tens, you know, 80, 90 cycles, maybe L3. No, I don't think it's that much now.

230
00:25:30.520 --> 00:25:57.530
<v Matt Godbolt>Someone's, so you know, order of magnitude-y things. It's, you know, still, probably only tens of nanoseconds but then you go out to RAM and it can be hundreds of nanoseconds if you actually need to go out to RAM. So those are the layers. The other thing that's interesting is that usually on chips the L1 and maybe the L2 are unique to the CPU core that you're running on so you just get a copy of your own and then that's yours and no one can do anything with it with the

231
00:25:57.530 --> 00:25:57.590
<v Ben Rady>Right.

232
00:25:57.590 --> 00:25:57.640
<v Ben Rady>Okay.

233
00:25:57.640 --> 00:26:08.760
<v Matt Godbolt>massive asterisk and footnote for older CPUs, but we maybe won't go there. And then the L3 typically is shared amongst the the other cores that are physically on the same chip as you.

234
00:26:08.760 --> 00:26:17.230
<v Matt Godbolt>And so there are some contentions and there are obviously conversations that have to happen between the chips to say, hey, I'm reading that. Oh, I was just writing to it. Oh, okay. Well, we need to make sure that we do this in the right order.

235
00:26:17.230 --> 00:26:17.300
<v Ben Rady>Right. Mm-hmm.

236
00:26:17.300 --> 00:26:38.800
<v Matt Godbolt>So there's a little bit more complexity there. But yeah, so if we're still talking in single, you know measurable, human, countable numbers of cycles, keeping it inside the L1 or L2 is great. L3 is okay. And if we have to go to RAM, we have to be a bit careful about things. And at that point, two things are important to note about the way caches work.

237
00:26:38.800 --> 00:26:52.590
<v Matt Godbolt>One is that the sort of the whole... the whole theory behind a cache is that having read a byte of memory, it's really, really likely you're going to read the bytes that are around that byte of memory.

238
00:26:52.590 --> 00:26:53.380
<v Ben Rady>Mm-hmm.

239
00:26:53.380 --> 00:27:32.890
<v Matt Godbolt>And so the fundamental unit of work from the memory system on the CPU to the caches and beyond is a sequence of bytes, which is usually 64 or maybe 128 bytes long. And that's kind of like the atom that gets shipped back and forth. And some chips even do things where you say, well, you ask for this block of 64 bytes, I'll just get the next one and maybe the one after that while we're going, because we might as well having just done that. And so this sort of temporal usage pattern is incredibly important for latency because if you're following a linked list, as you may be if you just have a naive hash map, suddenly you're jumping randomly around in memory and the poor system can't make a guess.

240
00:27:32.890 --> 00:27:33.220
<v Ben Rady>Right.

241
00:27:33.220 --> 00:27:43.920
<v Matt Godbolt>But if you are if you just have a damn array of everything and just go, well, it's expensive, it's an order N operation for me to search through these things, but I'm going to run through them linearly,

242
00:27:43.920 --> 00:27:58.040
<v Matt Godbolt>suddenly it may be more worthwhile from a cache standpoint, especially if a memory, a cache misses hundreds of cycles. Well, I can do a lot of work in a hundred cycles on an out-of-order execution machine that has 12 execution units, right?

243
00:27:58.040 --> 00:28:09.350
<v Matt Godbolt>So why don't just do tons of work and hope that that's quicker than going to RAM? So anyway, popping the stack all the way back, to keep the latency as low as possible, you're going to have to think about this.

244
00:28:09.350 --> 00:28:09.480
<v Ben Rady>Yeah. Mm-hmm.

245
00:28:09.480 --> 00:28:36.440
<v Matt Godbolt>And so you might want to make some concessions where the distribution of your latencies needs to be taken into account. Now, if you're going to say like, well, most of the time, I think I'm going to be... this like 80% of the stuff we're reading is hot. Like it's going to be commonly used again, in which case I'm going to use a data structure, which is fine.

246
00:28:36.440 --> 00:28:43.940
<v Matt Godbolt>And that 80% fits inside. It doesn't matter what I'm doing, whether it's a balanced tree or whatever, it fits inside L3, let's say.

247
00:28:43.940 --> 00:28:44.140
<v Ben Rady>Right.

248
00:28:44.140 --> 00:29:04.700
<v Matt Godbolt>But the 20% that's outside of it is just outside. It's going to RAM. We're just going to have to take the hit every time. And the fact that it's incredibly expensive because we're using a balanced tree. So I guess now we turn around and say, like, it's not just how fast to go. It's like, what is the shape of your distribution? Is it better to have the average case fast and the bad case awful?

249
00:29:04.700 --> 00:29:05.060
<v Ben Rady>Right.

250
00:29:05.060 --> 00:29:25.000
<v Matt Godbolt>Or is it better to bound the bad case and say, well, actually, I'm going to use a linear search every single time. And I'm going to say that even the fast one where the linear search, you know, it would have been faster just to check, I don't know, the first one or pack them in a different way. But, you know, I'm paying the cost to do 16 compares,

251
00:29:25.000 --> 00:29:26.060
<v Ben Rady>Yeah, yeah, yeah.

252
00:29:26.060 --> 00:29:29.080
<v Matt Godbolt>Every time and just go, well, that costs me a little bit more than one.

253
00:29:29.080 --> 00:29:29.120
<v Ben Rady>Right.

254
00:29:29.120 --> 00:29:36.030
<v Matt Godbolt>Of course it does, but it makes the worst case go away, but at the lowest.

255
00:29:36.030 --> 00:29:36.160
<v Ben Rady>Right.

256
00:29:36.160 --> 00:29:42.750
<v Matt Godbolt>So yeah, sorry. I've just gibbered around in a big, as I've sort of like realized the enormity of the question you've asked. Yeah. Yeah.

257
00:29:42.750 --> 00:29:51.110
<v Ben Rady>that was That was the intention of the question. But yes, whenever you're talking about latency, you need to not, it's not like, oh, it's 10 microseconds of latency.

258
00:29:51.110 --> 00:29:51.240
<v Matt Godbolt>Yeah.

259
00:29:51.240 --> 00:30:01.420
<v Ben Rady>It's no, it's 10 microseconds of latency in what case? What percentage of the time? You know, oftentimes you talk about like P95, P99, P99.999 numbers.

260
00:30:01.420 --> 00:30:01.420
<v Matt Godbolt>Yeah.

261
00:30:01.420 --> 00:30:11.860
<v Ben Rady>Of like what percentage of the responses are going to be less than some particular target time. And you may have multiple layers of that, right?

262
00:30:11.860 --> 00:30:22.780
<v Ben Rady>Like you may have a maximum threshold that you need and like a P99.99 and then like a P95 and then like a median, right?

263
00:30:22.780 --> 00:30:40.600
<v Ben Rady>And you might want to like structure things you know differently based on whatever those targets might be. So in our case, sort of this hypothetical, you know, robot dog type plugged into a PC perhaps that's receiving the signal.

264
00:30:40.600 --> 00:30:49.840
<v Ben Rady>Maybe what we'd want to say here as one fork of the tree to go down is actually the thing that I care about is the worst case latency, right?

265
00:30:49.840 --> 00:30:54.580
<v Ben Rady>I want the... because if the latency gets too high, the dog is going to fall over and that will be bad, right?

266
00:30:54.580 --> 00:30:54.900
<v Matt Godbolt>Right.

267
00:30:54.900 --> 00:30:59.110
<v Matt Godbolt>That's a very common thing for like, yeah, control boards and things.

268
00:30:59.110 --> 00:30:59.280
<v Ben Rady>Mm-hmm.

269
00:30:59.280 --> 00:31:12.040
<v Matt Godbolt>And you know this is why in those situations, typically embedded systems who have that kind of very strong reliability requirements will use not regular operating systems.

270
00:31:12.040 --> 00:31:19.470
<v Matt Godbolt>They're using you know real-time operating systems that can give you hard deadlines for things happening within a certain amount of time and or do no preemption at all.

271
00:31:19.470 --> 00:31:19.560
<v Ben Rady>Mm-hmm.

272
00:31:19.560 --> 00:31:30.300
<v Matt Godbolt>So it's not like a case that your your process will have its CPU time taken away from It's like, you know, it's your responsibility to to keep running until you get to the end and then yield to the next guy so that you can...

273
00:31:30.300 --> 00:31:40.940
<v Matt Godbolt>But and this is a huge problem. As computers have gotten more complicated, it becomes harder and harder to reason about what an upper bound could be.

274
00:31:40.940 --> 00:31:45.060
<v Matt Godbolt>You know, picking up my 6502 in its box over here.

275
00:31:45.060 --> 00:31:45.680
<v Ben Rady>Right. It's a little easier.

276
00:31:45.680 --> 00:31:52.490
<v Matt Godbolt>You know, it's a little easier. You know, it's it's like I can tell you exactly how long any single thing is going to take all the time.

277
00:31:52.490 --> 00:31:52.540
<v Ben Rady>Yeah.

278
00:31:52.540 --> 00:31:56.340
<v Matt Godbolt>It's completely deterministic and it's well documented.

279
00:31:56.340 --> 00:31:56.740
<v Ben Rady>Right.

280
00:31:56.740 --> 00:32:02.860
<v Matt Godbolt>But we have to reverse engineer what the hell an Intel chip is doing in order to understand...

281
00:32:02.860 --> 00:32:02.860
<v Ben Rady>Essentially.

282
00:32:02.860 --> 00:32:20.690
<v Matt Godbolt>Sort of maybe what the worst case is in that particular situation, right? You know, hey, what happens if, you know, we are trying, we have an L3 cache miss at the same time as another CPU core, like a physically distant core has an L3 cache, both go for it at the same time.

283
00:32:20.690 --> 00:32:20.880
<v Ben Rady>Mm-hmm.

284
00:32:20.880 --> 00:32:45.750
<v Matt Godbolt>How does that work? Is there some really weird edge case where, you know, in the fabric, it times out after some amount of time, I don't know, and it's hard to find out, but if I need to give you an actual, like, someone will get run over if the brakes do not come on when I put my foot on the brake, then maybe I can't, I certainly can't think about the world the same way, but I certainly can't use certain, like, CPUs and designs and operating systems like that.

285
00:32:45.750 --> 00:32:45.810
<v Ben Rady>Mm-hmm.

286
00:32:45.810 --> 00:32:45.860
<v Ben Rady>Mm-hmm.

287
00:32:45.860 --> 00:32:46.600
<v Matt Godbolt>So that's, but, it's interesting.

288
00:32:46.600 --> 00:33:13.380
<v Ben Rady>Yeah. So if I come to you and I'm asking this question of how do I make this program fast? And you go through all of these follow-up questions and what it turns out that I'm asking for here is how do I make sure that the round trip latency of this program that I have running on a just regular commodity PC is not going to ever be higher than 10 microseconds?

289
00:33:13.380 --> 00:33:14.980
<v Ben Rady>The answer is you can't.

290
00:33:14.980 --> 00:33:29.460
<v Matt Godbolt>I think that's a fair... Yeah, that the nearest people I can think of that have to deal with that specific problem are people who make digital audio workstations for you know consumer PCs because they they have...

291
00:33:29.460 --> 00:33:40.680
<v Matt Godbolt>A genuine latency bound because if you're playing a keyboard and there's a one second delay, that's not, between you're pressing the button and hearing the noise change.

292
00:33:40.680 --> 00:33:40.790
<v Ben Rady>Yeah, right.

293
00:33:40.790 --> 00:33:40.960
<v Ben Rady>Yeah. Yeah. Yeah.

294
00:33:40.960 --> 00:33:42.880
<v Matt Godbolt>That is not good for anybody.

295
00:33:42.880 --> 00:33:43.380
<v Ben Rady>Yeah.

296
00:33:43.380 --> 00:34:06.400
<v Matt Godbolt>And so the only way that you can get that is to make everything as low latency as possible. But if you go too far and you don't have enough buffering between you and the set of audio that's coming out, then if you, for whatever reason, can't keep up, you're going to get a massive pop or a crack or a click as suddenly you have to catch up and just drop a whole bunch of samples that you were generating.

297
00:34:06.400 --> 00:34:06.580
<v Ben Rady>Yeah, yeah, yeah.

298
00:34:06.580 --> 00:34:20.360
<v Matt Godbolt>So that's, I mean, I don't know how they do it, honestly. In fact, I do know people who have worked on them and like we should perhaps consider getting them on to talk about this kind of stuff because it's a fascinating problem.

299
00:34:20.360 --> 00:34:23.700
<v Matt Godbolt>But yeah, anyway, but back to your...

300
00:34:23.700 --> 00:34:29.560
<v Ben Rady>Yeah. So the result of this, you can't, so now it's like, okay, yeah, right.

301
00:34:29.560 --> 00:34:31.220
<v Matt Godbolt>Yeah, you can't, yeah. Or it's really, really, really hard.

302
00:34:31.220 --> 00:34:40.980
<v Ben Rady>It's like, you're putting constraints on this problem that you can't really solve for. For example, you'd probably be better off using some specialized hardware rather than trying to do this on a commodity machine.

303
00:34:40.980 --> 00:34:47.380
<v Ben Rady>If you truly have a constraint that, you know, of, of let's just say 10 microseconds or something like that.

304
00:34:47.380 --> 00:34:54.090
<v Matt Godbolt>This is to an extent, I mean, obviously, microcontrollers have been around longer than computers because they've been in things or that's the original design.

305
00:34:54.090 --> 00:34:54.140
<v Ben Rady>Mm-hmm.

306
00:34:54.140 --> 00:35:05.190
<v Matt Godbolt>But that's that is why the brake disc controller in my car is not controlled by the one PC that is like running the the AV display in the middle panel, right?

307
00:35:05.190 --> 00:35:05.250
<v Ben Rady>Right.

308
00:35:05.250 --> 00:35:05.300
<v Ben Rady>Yes.

309
00:35:05.300 --> 00:35:15.200
<v Matt Godbolt>It is like a very specialist piece of software running on a very specialist operating system, which probably isn't an operating system by anyone else's definition of operating system.

310
00:35:15.200 --> 00:35:15.650
<v Ben Rady>Right.

311
00:35:15.650 --> 00:35:15.760
<v Ben Rady>Right. Right.

312
00:35:15.760 --> 00:35:22.620
<v Matt Godbolt>It's just a library you link against and then sets up the interrupt vector and hands you a few nice abstractions, good luck.

313
00:35:22.620 --> 00:35:23.540
<v Ben Rady>Yeah. Yeah.

314
00:35:23.540 --> 00:35:47.840
<v Matt Godbolt>But that's why it's doing nothing but monitoring the disc because then finally you can potentially say, all right, we can upper bound this stuff. We do know what the maximum latency could be, and it is, and I can sit and work it out on paper because it's an ARM, this is an in-order ARM and it only has an L1 cache and even if it's whatever, you know, there are things that you can say about it and you can bound it that way.

315
00:35:47.840 --> 00:35:49.260
<v Ben Rady>Mm-hmm. Mm-hmm.

316
00:35:49.260 --> 00:36:00.980
<v Ben Rady>Okay. So I've got, we've got our inputs. Maybe we've had some discussions about our constraints here and we're like, all right, well, we can't do this to a maximum level of 10 microseconds.

317
00:36:00.980 --> 00:36:01.100
<v Matt Godbolt>Mm-hmm.

318
00:36:01.100 --> 00:36:07.800
<v Ben Rady>That's insane. We could maybe do something where our like P95's, maybe less than that, more than that.

319
00:36:07.800 --> 00:36:11.370
<v Matt Godbolt>Let's just say, I mean, yeah, you and I, we don't know.

320
00:36:11.370 --> 00:36:13.740
<v Ben Rady>Let's just something like that. Yeah. And then,

321
00:36:13.740 --> 00:36:15.260
<v Matt Godbolt>We're making this up as we go along as if it wasn't...

322
00:36:15.260 --> 00:36:20.010
<v Ben Rady>Yeah, and then and then our maximum is something much more reasonable like a second, right?

323
00:36:20.010 --> 00:36:20.760
<v Matt Godbolt>Okay, yeah.

324
00:36:20.760 --> 00:36:32.240
<v Ben Rady>It's like, all right, it's never going to take more than a second. If it does, then we have a bug and we should fix it, right? As opposed to, yeah, that's how the system works, right?

325
00:36:32.240 --> 00:36:49.080
<v Ben Rady>So now we've gotten our message from the outside world. We've gone and looked up the value that we've cached. Hopefully that's an L1 cache and hopefully we've done some smart things to make sure that it's there. But you know again, this is why we have these ranges of latency.

326
00:36:49.080 --> 00:36:58.160
<v Ben Rady>How do we write out the response? Do we have the same sort of operating system packet issue when we're trying to write this thing back out?

327
00:36:58.160 --> 00:37:05.060
<v Matt Godbolt>I mean, yes. If we're talking about standard, like, I mean, we keep changing what it is we're doing. Is it a dog that's falling over? Is it a UDP packet going into a...

328
00:37:05.060 --> 00:37:05.280
<v Ben Rady>Yeah, yeah, yeah.

329
00:37:05.280 --> 00:37:19.100
<v Matt Godbolt>Yeah. But, you know, like, if we're going to send a packet back out again, then yes, there's a similar dance on the way out through a traditional operating system where you write to a bit of memory that you can read and write in user space, and then you say, hey, send it, please.

330
00:37:19.100 --> 00:37:19.220
<v Ben Rady>Mm-hmm.

331
00:37:19.220 --> 00:37:30.320
<v Matt Godbolt>And and the kernel goes, well, I need to send it. I can't hand this bit of memory to... the network card because this bit of memory is just wherever you decided to put it.

332
00:37:30.320 --> 00:37:46.910
<v Matt Godbolt>So maybe on the stack, whatever. And certainly the DMA engine for your network card does not have carte blanche to read any piece of memory anywhere in the whole system for all the obvious reasons of like separation of concerns and to do with some of the ways that they work internally.

333
00:37:46.910 --> 00:37:47.200
<v Ben Rady>Mm-hmm.

334
00:37:47.200 --> 00:38:10.100
<v Matt Godbolt>So the kernel has to copy it somewhere else in a buffer where it knows it could hand the address of that to a network card. And then it returns, potentially returns back to you and says, yes, it's sent, even though it isn't. It's sat in a buffer. Time will pass. Eventually, the kernel will decide to schedule sending it. I mean, probably it will actually send it immediately, all things considered, unless the network card's already busy doing something else. You know, there's these...

335
00:38:10.100 --> 00:38:26.660
<v Matt Godbolt>And then the network card will be notified through some mechanism. And that's relatively expensive and requires a ping across the network. Sorry, I say a network. It is a network. The PCIe bus itself, or however your network card is plugged into your computer, is effectively yet another network.

336
00:38:26.660 --> 00:38:26.740
<v Ben Rady>Mm-hmm.

337
00:38:26.740 --> 00:38:37.220
<v Matt Godbolt>And there's another hop and there are... complicated things going on there. So the message is addressed to it. The network card goes, oh, you say you have a packet for me. How quaint, how lovely. Okay, I'll go and get the address from the ring buffer.

338
00:38:37.220 --> 00:38:41.000
<v Matt Godbolt>Here it is. Okay, and I'll start streaming it out. And so that process can be

339
00:38:41.000 --> 00:38:41.440
<v Ben Rady>Mm-hmm.

340
00:38:41.440 --> 00:39:00.270
<v Matt Godbolt>relatively latent too. And so again, you're less likely to suffer from the scheduling randomness that the kernel does, because it's not like something asynchronous happened and now the kernel has to pick you being amongst all the things that it's next to do.

341
00:39:00.270 --> 00:39:00.480
<v Ben Rady>Mm-hmm.

342
00:39:00.480 --> 00:39:04.170
<v Matt Godbolt>It's like, now you've given it a piece of work and it's very obvious what it should be doing with that work.

343
00:39:04.170 --> 00:39:04.580
<v Ben Rady>Uh-huh.

344
00:39:04.580 --> 00:39:14.690
<v Matt Godbolt>But there still is some redundant copying of data. There is still some messaging to an asynchronous process, which is the network card itself.

345
00:39:14.690 --> 00:39:14.860
<v Ben Rady>Again?

346
00:39:14.860 --> 00:39:25.240
<v Matt Godbolt>The network card can be like, yes, sure, fine. And then it could be busy doing something for all we know, right? You know, I don't know what what magic network cards are doing. Maybe it's, you know, a virtual network card that doesn't even exist as a physical thing.

347
00:39:25.240 --> 00:39:40.640
<v Matt Godbolt>And really it's pretending to be a network card, but in fact it is, in a data center on a shared piece of infrastructure. And what you're really doing is talking to the hypervisor, which then goes, oh, cool, a packet from one of my many virtual machines.

348
00:39:40.640 --> 00:39:41.250
<v Ben Rady>Hypervised. Yeah, right.

349
00:39:41.250 --> 00:39:42.480
<v Matt Godbolt>Yeah, hypervised children, right.

350
00:39:42.480 --> 00:39:54.340
<v Ben Rady>So I'm kind of, I was maybe teeing this up a little bit, but this is like somebody comes along and says, hey, Ben, you made this program to control this robot and you did it in a very silly way.

351
00:39:54.340 --> 00:39:54.340
<v Matt Godbolt>Yeah.

352
00:39:54.340 --> 00:39:58.060
<v Ben Rady>Why are you using a UDP network? That is not necessary.

353
00:39:58.060 --> 00:39:58.360
<v Matt Godbolt>Right.

354
00:39:58.360 --> 00:40:12.960
<v Ben Rady>Are there other devices that one could use, I mean, like a USB-C connection or something else that would be available to, you know, just a regular consumer, commercial computer, something.

355
00:40:12.960 --> 00:40:31.890
<v Matt Godbolt>Yeah, that's it. I don't know too much about the ins and outs of how the kernel interacts with the USB. I know that the underlying USB protocol and the thing you talk with, the hub or whoever you're talking to on the end of it is sort of a negotiated thing. You can ask for, you negotiate bandwidth, you negotiate...

356
00:40:31.890 --> 00:40:45.660
<v Matt Godbolt>I don't think you negotiate latency. I think you can negotiate isochronous, that is repeating things. So like if you've got a webcam plugged in, it can book, hey, I need to send this much data just all the time.

357
00:40:45.660 --> 00:41:04.240
<v Matt Godbolt>And it gets sort of time on the USB bus to be able to send packets over that data. But I don't know about, like, latency. And obviously things like you know USB keyboards and things and mice. And you know gamers obviously use USB mice and whatever, but we're still talking almost human level levels of latency there, you know millisecond, sub-millisecond, I'm sure.

358
00:41:04.240 --> 00:41:04.280
<v Ben Rady>Right,

359
00:41:04.280 --> 00:41:04.360
<v Ben Rady>Right, right.

360
00:41:04.360 --> 00:41:21.200
<v Matt Godbolt>So yeah. I don't really know. My best guess I would be able to have is not to say something like USB, because again, there are some layers of indirection between you and the actual device, even if you're talking the USB protocol through the kernel somehow.

361
00:41:21.200 --> 00:41:36.330
<v Matt Godbolt>But if you had like either, you know, GPIO on a, which is general purpose IO pins on one of the more microcontrollery type systems where you can literally just say, if I write to this memory address, it's not really a memory address.

362
00:41:36.330 --> 00:41:36.560
<v Ben Rady>Mm-hmm.

363
00:41:36.560 --> 00:41:40.580
<v Matt Godbolt>It is whatever data I put there is the highs and lows of these eight pins, right?

364
00:41:40.580 --> 00:41:40.720
<v Ben Rady>Yeah, yeah, yeah. Right.

365
00:41:40.720 --> 00:41:56.020
<v Matt Godbolt>And at that point, you're like, yeah, you're off to the races, right? Or, you know, say a serial interface, which although serial is incredibly slow and awful, it is also something where you're kind of directly attached to and you can say, like, I'm just driving it myself. Now, I'm sure that's no longer true. Yeah.

366
00:41:56.020 --> 00:42:07.540
<v Matt Godbolt>Now I say it out loud, I'm sure that it's virtualized through some mechanism in the kernel. And so it's no longer the case if you open /dev/tty0 that like if you get the OK, you are the only person talking to that now.

367
00:42:07.540 --> 00:42:08.060
<v Ben Rady>Mm-hmm.

368
00:42:08.060 --> 00:42:15.840
<v Matt Godbolt>And still, it would would suffer from that. If you read from it, then you would block and the kernel would put you to sleep until a byte came in.

369
00:42:15.840 --> 00:42:15.940
<v Ben Rady>Right.

370
00:42:15.940 --> 00:42:34.060
<v Matt Godbolt>So even that doesn't make sense. So GPIO is the nearest thing we have to the sit in a tight loop and wait kind of approach that we were talking about before, you know, for both input and output, you could say like, hey, you know, I'm doing a quiz show, whoever presses the button first kind of thing.

371
00:42:34.060 --> 00:42:38.560
<v Matt Godbolt>And there are eight buttons and they go into the eight bits inside my controller.

372
00:42:38.560 --> 00:42:38.560
<v Ben Rady>Yeah, yeah, yeah, yeah.

373
00:42:38.560 --> 00:42:49.160
<v Matt Godbolt>And I just sit in a tight loop reading from it. Obviously there are better hardware ways of doing that, but I'm just sort of thinking out loud. Yeah. Yeah, I don't know. Did you have an idea?

374
00:42:49.160 --> 00:42:54.940
<v Matt Godbolt>You've been driving this in a direction, and I'm wondering if you have a solution in mind that I have not got to.

375
00:42:54.940 --> 00:42:55.980
<v Ben Rady>No, no, no, I mean

376
00:42:55.980 --> 00:42:58.900
<v Matt Godbolt>By the way, have I got the job? Or oh...

377
00:42:58.900 --> 00:43:02.480
<v Matt Godbolt>is it Is it too late?

378
00:43:02.480 --> 00:43:05.680
<v Ben Rady>Well, you passed to the next round, but I'm going to have to, yeah, I know.

379
00:43:05.680 --> 00:43:08.430
<v Matt Godbolt>Oh, okay. We'll have to see. We're now 40 minutes in, so yeah.

380
00:43:08.430 --> 00:43:09.620
<v Ben Rady>Do you have any questions for me?

381
00:43:09.620 --> 00:43:13.400
<v Matt Godbolt>Do I not have questions for you?

382
00:43:13.400 --> 00:43:15.720
<v Matt Godbolt>We've done far too many interviews, haven't we?

383
00:43:15.720 --> 00:43:18.130
<v Ben Rady>Yeah, yeah, yeah, yeah, yeah.

384
00:43:18.130 --> 00:43:18.740
<v Matt Godbolt>Yeah.

385
00:43:18.740 --> 00:43:34.020
<v Ben Rady>I mean, I think that, I mean, we made the full kind of round trip, and I think that's good. I think the other, and if there is a part two to this, I think the part two should be, okay, now what happens when it's the actual like calculation that is the part that needs to quote unquote go fast, right?

386
00:43:34.020 --> 00:43:34.500
<v Matt Godbolt>Oh, yeah, yeah, yeah, yeah.

387
00:43:34.500 --> 00:43:47.520
<v Ben Rady>Like like the IO no longer dominates as it did in this example. And it's basically just this exercise. And how do you take this signal, make sure that you don't accidentally make it slow when it's being processed in the CPU and then get a thing back out.

388
00:43:47.520 --> 00:43:49.080
<v Matt Godbolt>That's an interesting one. Yeah, we should.

389
00:43:49.080 --> 00:43:52.990
<v Ben Rady>Now it's, okay, this is gonna take a material amount of time and we're trying to make it as fast as we can.

390
00:43:52.990 --> 00:43:53.460
<v Matt Godbolt>All right.

391
00:43:53.460 --> 00:44:03.380
<v Matt Godbolt>Yeah, which is probably what most folks think of when you say, can I make my code go fast, is the code part, not all the other stuff around the outside, but often that can dominate. Yeah, no, that would be an interesting one.

392
00:44:03.380 --> 00:44:03.600
<v Ben Rady>Right.

393
00:44:03.600 --> 00:44:13.110
<v Matt Godbolt>So yeah, okay, we'll... well and then we've got to talk about throughput at some point, and maybe that will leg into throughput, because, you know, I think there is part of it, but we'll see.

394
00:44:13.110 --> 00:44:13.240
<v Ben Rady>Yes. Yeah.

395
00:44:13.240 --> 00:44:13.300
<v Ben Rady>Mm-hmm.

396
00:44:13.300 --> 00:44:13.970
<v Matt Godbolt>Well, all right.

397
00:44:13.970 --> 00:44:14.020
<v Ben Rady>Maybe.

398
00:44:14.020 --> 00:44:23.120
<v Matt Godbolt>We rather mysteriously then, and now you and I have to look each other in the eye and go, we will remember that the next episode we record will be the continuation of this one.

399
00:44:23.120 --> 00:44:23.200
<v Ben Rady>Yeah.

400
00:44:23.200 --> 00:44:25.430
<v Matt Godbolt>And we won't just leave our poor audience hanging.

401
00:44:25.430 --> 00:44:25.800
<v Ben Rady>Yes.

402
00:44:25.800 --> 00:44:26.100
<v Ben Rady>Right.

403
00:44:26.100 --> 00:44:31.890
<v Matt Godbolt>So I apologize in advance if we, if we do in fact, but I think this has been very interesting.

404
00:44:31.890 --> 00:44:32.240
<v Ben Rady>Right.

405
00:44:32.240 --> 00:44:33.520
<v Ben Rady>Yeah, that's good.

406
00:44:33.520 --> 00:44:38.260
<v Matt Godbolt>Hopefully our listener agrees and we will, I will see you next time, my friend.

407
00:44:38.260 --> 00:44:45.670
<v Ben Rady>Next time.

408
00:44:45.670 --> 00:44:47.670
<v Matt Godbolt>And cut.