WEBVTT

1
00:00:18.970 --> 00:00:20.160
<v Matt Godbolt>Hey, Ben.

2
00:00:20.160 --> 00:00:20.980
<v Ben Rady>Hey, Matt.

3
00:00:20.980 --> 00:00:31.300
<v Matt Godbolt>I am not convinced my levels are right here, so I apologize if the audio is awful on this. um Something isn't going well here.

4
00:00:31.300 --> 00:00:33.360
<v Matt Godbolt>ah I think it's too quiet. i mean, how does it sound to you?

5
00:00:33.360 --> 00:00:33.740
<v Ben Rady>You think it's too quiet?

6
00:00:33.740 --> 00:00:35.580
<v Ben Rady>It sounds good to me.

7
00:00:35.580 --> 00:00:39.610
<v Matt Godbolt>But I've turned the gain up here and it looks really... Okay, well, we'll go with that. Anyway, hi!

8
00:00:39.610 --> 00:00:40.000
<v Ben Rady>yeah hello.

9
00:00:40.000 --> 00:00:53.460
<v Matt Godbolt>How are you doing, friend? This is a catastrophe ah for for for editing Matt as we were just... ah During the the opening jingle, we were we were taking the mickey out of editing Matt because he's a jerk.

10
00:00:53.460 --> 00:00:55.760
<v Matt Godbolt>But that's fine because he's not here.

11
00:00:55.760 --> 00:00:56.300
<v Ben Rady>Yeah.

12
00:00:56.300 --> 00:01:02.450
<v Ben Rady>I usually find that yesterday Ben is a jerk and tomorrow Ben is the most responsible person on the face of the earth.

13
00:01:02.450 --> 00:01:03.230
<v Matt Godbolt>That's... Yeah.

14
00:01:03.230 --> 00:01:06.900
<v Ben Rady>He's going to take care of everything because I'm not doing anything today.

15
00:01:06.900 --> 00:01:18.950
<v Matt Godbolt>Yeah, you know that thing? Yeah, that's 100% how I see the world too. Yeah. So we we we had started this, as we often do, by chatting in a Google Meet before this.

16
00:01:18.950 --> 00:01:36.060
<v Matt Godbolt>and And then we were like, let's just record. Let's do it. And then we had to switch out. And then unfortunately, because we're now using a separate recording system, ah that involves fiddling around with web browser settings and plugging in different microphones everything. And now I can't even remember what it was we were talking about.

17
00:01:36.060 --> 00:01:42.440
<v Ben Rady>Uh, so we, I, I, when I just cut over to this, I was of course in the middle of writing a test cause you know, what else would I be doing?

18
00:01:42.440 --> 00:01:43.960
<v Matt Godbolt>Of course.

19
00:01:43.960 --> 00:01:55.630
<v Ben Rady>And, uh, I was saying how one of the things that I love to do is test my metrics. And you had made a ah very interesting point about performance sensitive code and this technique.

20
00:01:55.630 --> 00:01:56.060
<v Matt Godbolt>That's right.

21
00:01:56.060 --> 00:01:57.660
<v Ben Rady>So do you, do you recall what you said?

22
00:01:57.660 --> 00:02:03.300
<v Matt Godbolt>I do now. Thank you for being the responsible adult and remembering from one minute to the next what on earth is going on.

23
00:02:03.300 --> 00:02:05.320
<v Ben Rady>Tomorrow Ben, has manifested today.

24
00:02:05.320 --> 00:02:19.000
<v Matt Godbolt>Yeah, it seems so. It seems so. So yeah, you had said that you were writing tests for the metrics in your code. And that's a fantastic thing to do because if you're relying on those metrics, you probably want to make sure they're right.

25
00:02:19.000 --> 00:02:20.080
<v Ben Rady>Mm-hmm. Mm-hmm.

26
00:02:20.080 --> 00:02:44.080
<v Matt Godbolt>And then I'd said for some areas of performant code, so certainly in C++ land, sometimes there isn't a seam for you to put ah like ah an interface or some kind of like a measuring point in your ah your regular code to write a test around. So like the classic example I can think of is if you've got a high performance like cache,

27
00:02:44.080 --> 00:02:55.620
<v Matt Godbolt>of information that is like a software cache, just to be clear, then you wanna be able to test whether or not you hit the cache or not. But that's not, the cache is there to be transparent, right?

28
00:02:55.620 --> 00:03:00.830
<v Matt Godbolt>It's either there or it's not there or whatever, or you know it fetches a value or it may be maybe it's more like a memoization cache.

29
00:03:00.830 --> 00:03:00.960
<v Ben Rady>right right right

30
00:03:00.960 --> 00:03:05.130
<v Matt Godbolt>So you know it either computes a value or it returns the value that you did before.

31
00:03:05.130 --> 00:03:05.540
<v Ben Rady>Mm-hm

32
00:03:05.540 --> 00:03:13.680
<v Matt Godbolt>And the caller doesn't need to care about that. And you don't want to have to break your interface just to write a test. And you don't want to have to pass in ah a listener

33
00:03:13.680 --> 00:03:13.680
<v Ben Rady>right

34
00:03:13.680 --> 00:03:22.170
<v Matt Godbolt>class that says, hey, on cache result, because that's all, you know, what is it it's not good for the design of your system. It's maybe not good for the performance.

35
00:03:22.170 --> 00:03:22.540
<v Ben Rady>ah huh

36
00:03:22.540 --> 00:03:34.170
<v Matt Godbolt>But you probably do want metrics about how often your cache is being hit. And if you're writing performant code, you've probably written a really relatively performant ah metric system.

37
00:03:34.170 --> 00:03:34.660
<v Ben Rady>Right, right. Yeah.

38
00:03:34.660 --> 00:03:44.780
<v Matt Godbolt>And then it becomes a natural way of measuring whether your code is testing the things that you're wanting. Am I getting a cache hit? Am I getting a cache miss? by looking at the metrics.

39
00:03:44.780 --> 00:03:45.420
<v Ben Rady>Yes.

40
00:03:45.420 --> 00:03:50.140
<v Matt Godbolt>And so that was where we were when we said, let's record this. And now that's the end of the podcast.

41
00:03:50.140 --> 00:03:50.440
<v Ben Rady>Yeah, yeah, yeah. and

42
00:03:50.440 --> 00:03:51.520
<v Matt Godbolt>Thank you for listening, everybody.

43
00:03:51.520 --> 00:04:07.940
<v Ben Rady>ah Thanks, everybody. Outro music plays. um No, and I mean, I think that this is a a great specific example of a general thing, which I think we've talked about on the on the podcast before, of like, what does it mean to have testable code, right?

44
00:04:07.940 --> 00:04:09.800
<v Ben Rady>What does it mean to have code that is testable?

45
00:04:09.800 --> 00:04:10.200
<v Matt Godbolt>Yes.

46
00:04:10.200 --> 00:04:23.120
<v Ben Rady>And there is a sort of a premise that is baked into all of this and it's woven into like test-driven development a bunch of other things, which is if you build software that has a nice interface for some definition of nice,

47
00:04:23.120 --> 00:04:33.700
<v Ben Rady>it will be easy to test and writing the tests helps you create that nice interface. And this is a specific example of that because the thing that's nice about this is the observability.

48
00:04:33.700 --> 00:04:35.800
<v Ben Rady>We want to have code that is observable.

49
00:04:35.800 --> 00:04:36.120
<v Matt Godbolt>yes

50
00:04:36.120 --> 00:04:55.320
<v Ben Rady>We want to have code where we can know what it's doing. And the tests in this case is giving you like a very specific thing of like, I need to know if I hit the cache. You need to know that for the test and you need to know that when you're running your software. And that is the same problem. It is the exact same problem.

51
00:04:55.320 --> 00:05:16.410
<v Matt Godbolt>Yeah, but I suppose specifically in this instance, like a ah not totally unreasonable API to your ah whatever this thing, cached thing is, is that you return like a tuple of the thing that you got out of the cache and some status object that said, um did I get it from the cache? Was it a hit? Was it a miss?

52
00:05:16.410 --> 00:05:16.840
<v Ben Rady>And so the tests are there to do that.

53
00:05:16.840 --> 00:05:17.540
<v Ben Rady>Sure, yeah.

54
00:05:17.540 --> 00:05:19.740
<v Matt Godbolt>what you know that That's a reasonable interface, in which case now...

55
00:05:19.740 --> 00:05:20.680
<v Ben Rady>You can also do it that way, yeah.

56
00:05:20.680 --> 00:05:41.130
<v Matt Godbolt>Well, that that's my point, right? That is a reasonable way to write this ah system. But very specifically, oftentimes, if you're like in a very high performance kind of piece of code, that you buy writing the interface that way, you have pessimized the case where you don't care if it was in the cache or not, which is the very common case.

57
00:05:41.130 --> 00:05:41.880
<v Ben Rady>Mm-hmm.

58
00:05:41.880 --> 00:05:50.420
<v Matt Godbolt>And you're forcing everything through this. But the metrics are something that are a sort of a side channel that are something you still care about and are performant.

59
00:05:50.420 --> 00:05:50.640
<v Ben Rady>Mm-hmm.

60
00:05:50.640 --> 00:06:01.860
<v Matt Godbolt>And you're now using them to test the inner workings of something that should be transparent. And you, in fact, want it to be transparent. And I think there's sort of... ah yeah maybe you're saying those are the same things.

61
00:06:01.860 --> 00:06:12.390
<v Matt Godbolt>um I've just found it as being ah ah an interesting way of saying like, there's some internal workings of a class that I would like to be able to test, but I don't really want to expose it to the outside world.

62
00:06:12.390 --> 00:06:13.140
<v Ben Rady>Mm-hmm. Mm-hmm. Mm-hmm.

63
00:06:13.140 --> 00:06:31.640
<v Matt Godbolt>And I can't expose it to the outside world through either because the performance characteristics would be different, but the, the, Well, I can't directly expose it to this word. And so this metrics represents an indirect way of me accessing interesting things that happened in my class.

64
00:06:31.640 --> 00:06:38.340
<v Ben Rady>Yeah, I kind of get what you're saying there, but I think it all hinges on the definition of outside world, right?

65
00:06:38.340 --> 00:06:38.720
<v Matt Godbolt>Right. Okay.

66
00:06:38.720 --> 00:06:47.750
<v Ben Rady>Like the caller of the code is – let's live in ah a multiverse for a second here. The caller of the code is one world.

67
00:06:47.750 --> 00:06:48.100
<v Matt Godbolt>Sure.

68
00:06:48.100 --> 00:06:55.300
<v Ben Rady>But another world is sort of you as an operator of the software. And the tests can stand in for both of those things.

69
00:06:55.300 --> 00:06:55.580
<v Matt Godbolt>Of course.

70
00:06:55.580 --> 00:07:05.640
<v Ben Rady>You don't have to do them all as one thing, right? The caller of the code can be like, yeah, I asked for this value. I got this value back. I'm not getting a tuple that indicates whether it was a cache hit or a cache miss or...

71
00:07:05.640 --> 00:07:05.640
<v Matt Godbolt>Because I don't care.

72
00:07:05.640 --> 00:07:09.850
<v Ben Rady>some sort of other metadata that rides along with it because I don't actually care about that.

73
00:07:09.850 --> 00:07:10.080
<v Matt Godbolt>yeah. yeah

74
00:07:10.080 --> 00:07:38.720
<v Ben Rady>Right. And I shouldn't have to care about that ah just for the purposes of testing to make sure that my caching system works. But you as an operator of that software, you as somebody who is going to be, you know, watching it run and making sure that the performance is good and making sure that the changes that you've made have taken place effect as you would have expected, need another dimension into the multiverse of ways to see what is going on. And the tests can stand in for that too.

75
00:07:38.720 --> 00:07:50.380
<v Ben Rady>Um, There is another actual aspect of this. I was just ranting to you earlier this week about this, actually, when we were at lunch, there is another aspect of this um that I think holds here.

76
00:07:50.380 --> 00:07:50.380
<v Matt Godbolt>Yeah. Right.

77
00:07:50.380 --> 00:07:58.510
<v Ben Rady>And that is logging. Usually what people do is they have a logging system and they just dump things into the logs.

78
00:07:58.510 --> 00:07:58.620
<v Matt Godbolt>right

79
00:07:58.620 --> 00:08:11.360
<v Ben Rady>You know, it's like, oh, I've got this variable here. I'll log this out or whatever it is. Right. And you know, it's the really unfortunate case when it's like, okay, yeah, we have log statements in the code for when the terrible thing happens.

80
00:08:11.360 --> 00:08:28.430
<v Ben Rady>And then you go and you look in the logs because the terrible thing has happened. And what you see is like the magic value is "%s". Because your whatever logging thing that you set up didn't actually capture the value that you wanted because you thought it was templated and then was it wasn't or whatever it is that happened.

81
00:08:28.430 --> 00:08:28.610
<v Matt Godbolt>ah

82
00:08:28.610 --> 00:08:28.980
<v Matt Godbolt>So sad.

83
00:08:28.980 --> 00:08:33.520
<v Ben Rady>And like the one in a million moment is come and gone and you're never going to see it again, right?

84
00:08:33.520 --> 00:08:33.980
<v Matt Godbolt>Yeah.

85
00:08:33.980 --> 00:08:34.400
<v Ben Rady>And so

86
00:08:34.400 --> 00:08:37.380
<v Matt Godbolt>Are you about to write tests for logging is what you're about to say, isn't it?

87
00:08:37.380 --> 00:08:51.820
<v Ben Rady>another that is exactly what I'm saying. This is exactly what I'm saying is that i think that one of the benefits of structured logging is that you can approach it in the exact same way that we approach are talking about these metrics, right?

88
00:08:51.820 --> 00:09:02.360
<v Matt Godbolt>Right. They are similar sounding things in this instance. It's just a different way of structured logging. And one is a counter and the other one is maybe a sequence of events that you've logged.

89
00:09:02.360 --> 00:09:02.680
<v Ben Rady>So the...

90
00:09:02.680 --> 00:09:04.380
<v Ben Rady>Exactly, exactly right.

91
00:09:04.380 --> 00:09:04.540
<v Matt Godbolt>Yeah.

92
00:09:04.540 --> 00:09:11.090
<v Ben Rady>but it it is But it is exactly this thing of the tests are not just standing in for like the caller of the code as they usually do.

93
00:09:11.090 --> 00:09:11.520
<v Matt Godbolt>Right.

94
00:09:11.520 --> 00:09:30.120
<v Ben Rady>but they are standing in for the the sort of tomorrow you, ah who's a very responsible person and wants to know what their metrics are and what their logs are and wants to make sure that they're correct. And then you can also use both of those dimensions of kind of observability to understand what your code is doing and verify that it is correct, right?

95
00:09:30.120 --> 00:09:33.020
<v Ben Rady>The the tests can operate on both of those dimensions at the same time.

96
00:09:33.020 --> 00:09:33.250
<v Matt Godbolt>Yeah. Right.

97
00:09:33.250 --> 00:09:33.360
<v Ben Rady>Mm-hmm.

98
00:09:33.360 --> 00:09:49.100
<v Matt Godbolt>I mean, who among us hasn't written that warning statement like "this is weird" And then, you know, your test coverage says, hey, you never hit the "this is weird" log line. And you're like, oh, I should write a test for it. But realistically speaking, what am I going to do? All it does is log, "this is weird".

99
00:09:49.100 --> 00:09:49.600
<v Ben Rady>Right, right.

100
00:09:49.600 --> 00:10:05.000
<v Matt Godbolt>And you know i'm I'm sure you've done this before. you know Even with you know most logging systems, are certainly ones that I've interacted with, you have a test fixture that can capture the log. So you can write it and and then then, but your assertion is something weak, like assert "this is weird" in captured.log.

101
00:10:05.000 --> 00:10:05.860
<v Ben Rady>Yeah.

102
00:10:05.860 --> 00:10:06.580
<v Ben Rady>Right, right.

103
00:10:06.580 --> 00:10:18.230
<v Matt Godbolt>And that's better than a kick in the teeth, but it is not ideal. And what you're saying is with a more, print you know, but certainly in terms of the textual mapping and you know, it makes your test quite brittle.

104
00:10:18.230 --> 00:10:18.600
<v Ben Rady>Right.

105
00:10:18.600 --> 00:10:28.940
<v Matt Godbolt>But if you can have a structured log, so like I think we have talked about structured logging before, but do you want to just give us a quick recap of what you think of or what right now, in the middle of it all, what you think of as structured logging?

106
00:10:28.940 --> 00:10:42.280
<v Ben Rady>Yeah. Yeah. I mean, and I, and I grant that people have, have differing takes on this and I think you can do it in different ways, but I, I think that if I were to try to summarize all of the different approaches that I've seen that have been called structured logging, it is kind of, you alluded to it earlier.

107
00:10:42.280 --> 00:10:53.640
<v Ben Rady>It is treating your logs as a stream of events, right? um Sometimes multiple streams of events. Like you can think of like the info logs as one stream and the error logs is a separate stream and the warning logs, another stream.

108
00:10:53.640 --> 00:11:06.420
<v Ben Rady>Or you can mush them all together and have a heterogeneous thing. But the the basic idea is that you are going to not ah think of your logs as I'm just puking some text out to standard error or standard out.

109
00:11:06.420 --> 00:11:23.680
<v Ben Rady>It is, no, there's a stream of events that is coming out of my system. And I can turn those into human readable logs if I want, but I can turn them into whatever I want because I'm a wizard and I have programming skills and I can transform a stream of events into anything.

110
00:11:23.680 --> 00:11:35.490
<v Ben Rady>And so it solves a number of of kind of problems. and And one of them is this sort of case of like making sure that you are actually capturing the information in your logs that you think you are.

111
00:11:35.490 --> 00:11:35.800
<v Matt Godbolt>Right.

112
00:11:35.800 --> 00:11:42.940
<v Ben Rady>Another one is this sort of case of like, well, how do i make sure that we are responding to this situation in which I want to do nothing?

113
00:11:42.940 --> 00:11:55.000
<v Ben Rady>And in fact, the thing that sort of kicked off this whole conversation 10 minutes ago was me writing a test for a situation where I was skipping a trade that I wanted to ignore intentionally because it was being replayed, right?

114
00:11:55.000 --> 00:11:57.800
<v Ben Rady>Like it was it was like, oh, I want to make sure this is idempotent.

115
00:11:57.800 --> 00:11:58.120
<v Matt Godbolt>Right.

116
00:11:58.120 --> 00:12:16.950
<v Ben Rady>We've seen this trade already. I don't want to publish it again. So like the correct action is to do nothing. Right now, in that case, I was making an assertion about a metric, but you could easily imagine that that could also be a log statement and testing that testing that nothing has happened is a very important thing to be able to do, right?

117
00:12:16.950 --> 00:12:17.380
<v Matt Godbolt>Yes.

118
00:12:17.380 --> 00:12:29.620
<v Matt Godbolt>and and And more importantly, discriminating between the nothing has happened because I processed the event correctly and determined that nothing should happen compared to you didn't call the process event function at all in your test.

119
00:12:29.620 --> 00:12:29.980
<v Ben Rady>right Yep, exactly.

120
00:12:29.980 --> 00:12:56.240
<v Matt Godbolt>Therefore, nothing happened. Right. Which is the. Yeah. So you can distinguish them when the nothing that happened is actually something did happen. The something was I bumped a metric saying ignored_events++, or I logged warning: this event was skipped because it's a replay" or whatever it is that you've done. Yeah, that makes a lot of of sense there. It certainly gives you a lot more, lets you sleep at night a bit more comfortably because, you know, again,

121
00:12:56.240 --> 00:13:02.740
<v Matt Godbolt>How many times have we written tests where you realize this test is passing and then like scratching your head like wait, it's not being run, is it?

122
00:13:02.740 --> 00:13:04.990
<v Matt Godbolt>That's what I've missed out test as "tset".

123
00:13:04.990 --> 00:13:05.370
<v Ben Rady>Yeah, right.

124
00:13:05.370 --> 00:13:05.740
<v Ben Rady>Right. Yes.

125
00:13:05.740 --> 00:13:11.500
<v Matt Godbolt>And now my my my system that looks for only the word test is not actually running any of these files at all. Right.

126
00:13:11.500 --> 00:13:18.740
<v Ben Rady>Right, right. The test where it's like, you can comment out all of the code that you thought you were testing and the test still passed because there's, there's no assertion in it.

127
00:13:18.740 --> 00:13:18.880
<v Matt Godbolt>Yeah.

128
00:13:18.880 --> 00:13:24.960
<v Ben Rady>Right. It's just like run some code and hope an exception doesn't happen. Right. Like those are, those are very unfulfilling tests.

129
00:13:24.960 --> 00:13:25.440
<v Matt Godbolt>Right.

130
00:13:25.440 --> 00:13:25.440
<v Ben Rady>And so.

131
00:13:25.440 --> 00:13:38.720
<v Matt Godbolt>This gives you a way of measuring some of the, some of those types of events and or quantifying them and saying that this, yeah, gathering confidence that actually you are doing the thing that you thought that you were doing.

132
00:13:38.720 --> 00:13:39.700
<v Ben Rady>Yeah, yeah, yeah, yeah.

133
00:13:39.700 --> 00:14:00.360
<v Ben Rady>Another thing that you can do with structured logging, which has another sort of flavor of this is, you know, you you have these moments sometimes where you're you're you're trying to test something and you're like, part of me just wants to like reach into the center of this class and pull out this state. But I don't want to really do that because that's going to break the encapsulation of the class, right? Like,

134
00:14:00.360 --> 00:14:09.310
<v Ben Rady>you know, I want to be able to refactor this code. I want to be able to change things, certain things about this code without having to change the tests, because that's what refactoring is.

135
00:14:09.310 --> 00:14:09.940
<v Matt Godbolt>Thank you.

136
00:14:09.940 --> 00:14:38.640
<v Ben Rady>ah And I don't want to reach into the guts of this class, because that'll make my test less valuable and make it so that I can't refactor. But I really want to know like what this value is. And so one of the things that you can do with structured logging, which I think is really interesting, is it gives you a conduit to sort of more carefully and selectively pull pieces of information out of the internals of a class in a way that doesn't expose all of the guts. It just sort of exposes like the one little piece of information that you want.

137
00:14:38.640 --> 00:14:57.880
<v Ben Rady>And the example of this is like, you're gonna have a log statement that says like the queue size is five, right? Well, it's like, I don't wanna reach into the guts of the class and check to see what the queue size is. But like in the instances where it's important to log what the queue size is, I can use that as a way to confirm my suspicions about what it should be, right?

138
00:14:57.880 --> 00:15:19.620
<v Ben Rady>And you can go another level deep with this if you if you want to. And I have, and don't know if it's generally a good idea, but I think it's an interesting thing to talk about, which is when you have structured logs and you can find a way to do object serialization in those structured logs in a way that's not totally insane or sometimes just mildly insane,

139
00:15:19.620 --> 00:15:27.200
<v Ben Rady>You can have complete objects that come out of there and go into your logging system and can be reconstituted later.

140
00:15:27.200 --> 00:15:28.140
<v Matt Godbolt>Right. Mm-hmm. Mm-hmm.

141
00:15:28.140 --> 00:15:38.580
<v Ben Rady>And the one place where I think I have seen this done the least insane is with exceptions, right? Like you have, ah you know, part of your logging system where if an exception occurs,

142
00:15:38.580 --> 00:15:57.000
<v Ben Rady>You have a reasonably high confidence serialization system that allows you to capture that exception, maybe with some special cases in there and make sure it's not too big or contains like a reference to like a, you know, ephemeral resource or some other thing like that, but you have some confidence where you can turn it into something.

143
00:15:57.000 --> 00:16:22.720
<v Ben Rady>And then when you're troubleshooting that error later, you can reconstitute it. And I think that is a more obvious way to do this kind of thing. But I could also see situations in which that that structured logging allows you to sort of, in a in a less brittle way, in a less encapsulation violating way, check to make sure that the internal state of things is what you expect it to be without creating direct dependencies from the tests into the internal parts of the code.

144
00:16:22.720 --> 00:16:30.810
<v Matt Godbolt>And I think that's a special case of what I was talking about right at the beginning, which is to say, you know, again, the the internal state in this instance is whether the cache was hit or not.

145
00:16:30.810 --> 00:16:30.920
<v Ben Rady>Mm-hmm.

146
00:16:30.920 --> 00:16:42.170
<v Matt Godbolt>And it's just a way of exposing that internal state without making it either in the face of the caller or having to add a whole metric subsystem into the specifically to that cache and say, did the last thing, all those kinds of things.

147
00:16:42.170 --> 00:16:42.200
<v Ben Rady>Yeah.

148
00:16:42.200 --> 00:16:42.260
<v Ben Rady>Yeah.

149
00:16:42.260 --> 00:16:55.090
<v Matt Godbolt>So it's a really nice way of, yeah, like, kind of side channel attacking the internal state of your, you know, and slightly better than, you know, like having the, um ah the other sort of, I guess, is it an anti-pattern?

150
00:16:55.090 --> 00:16:55.500
<v Ben Rady>hu Yeah. yeah

151
00:16:55.500 --> 00:17:06.400
<v Matt Godbolt>Let me see what you think. You know, how many times have you written something that's like, you know, um get cacheForTesting the function called that, which is, know, like you look at it and you say, this uses the same um other functions.

152
00:17:06.400 --> 00:17:06.400
<v Ben Rady>Yeah, yeah.

153
00:17:06.400 --> 00:17:17.730
<v Matt Godbolt>this uses the same functionality as the real test function, sorry, the real cache function, but it it it does return that tuple with all this extra information about it.

154
00:17:17.730 --> 00:17:17.740
<v Ben Rady>Yeah, yeah.

155
00:17:17.740 --> 00:17:49.580
<v Matt Godbolt>And you're kind of like, you look at it and you say like, I hope that I don't have a bug that is represented in the untested cache function get function that isn't in my ah you know for testing cache and you kind of look at it and you go like it's three lines i think it's fine or you know sometimes you can implement one in terms of the other and hope fingers crossed that the optimizer throws away the fact that in your not test version you always discard that kind of side channel and therefore you know all goes it nets out that's a nice way of doing it but um

156
00:17:49.580 --> 00:17:50.080
<v Ben Rady>Yeah, yeah. Yeah.

157
00:17:50.080 --> 00:18:00.500
<v Matt Godbolt>Yeah, so that that, yeah, do you think anytime, i mean, I certainly think of it, anytime I write a test that has "xxForTesting" in it, i I do die inside a little bit, but sometimes it's a necessary evil if I haven't got this.

158
00:18:00.500 --> 00:18:11.630
<v Ben Rady>Yeah, it's it's not great, but if I have to choose between adding a little bit of extra complexity to my code and not being confident that it works, I'm going to go with a little complexity is worth knowing that it actually works.

159
00:18:11.630 --> 00:18:12.020
<v Matt Godbolt>Right.

160
00:18:12.020 --> 00:18:25.440
<v Ben Rady>But if there's a way to do both of those things at the same time, or do it in a way where that sort of surface area of the "for testing" is not only smaller, but also useful for other things, then I think that's a better way to do it.

161
00:18:25.440 --> 00:18:30.060
<v Matt Godbolt>Right. In which case it should, it loses the "for testing" at that point. Right. It just becomes, yeah, it is just like, Hey, this is a, ah yeah.

162
00:18:30.060 --> 00:18:30.360
<v Ben Rady>Yeah.

163
00:18:30.360 --> 00:18:32.280
<v Matt Godbolt>A window into this class that is useful.

164
00:18:32.280 --> 00:18:32.920
<v Ben Rady>Yeah, exactly.

165
00:18:32.920 --> 00:18:36.310
<v Matt Godbolt>Yeah. And the metrics exemp are exemplified that metrics and all structured logs.

166
00:18:36.310 --> 00:18:36.520
<v Ben Rady>Mm-hmm.

167
00:18:36.520 --> 00:18:37.360
<v Matt Godbolt>Yeah. Yeah.

168
00:18:37.360 --> 00:18:37.680
<v Ben Rady>Mm-hmm. Mm-hmm.

169
00:18:37.680 --> 00:18:38.840
<v Matt Godbolt>No, that's cool.

170
00:18:38.840 --> 00:18:39.500
<v Ben Rady>Yeah.

171
00:18:39.500 --> 00:18:48.250
<v Matt Godbolt>Um, well, that's kind of all we had. I mean, i was going to say, that's what we had planned, but we had no plans. We were just talking and they were like, we should probably record this.

172
00:18:48.250 --> 00:18:49.500
<v Ben Rady>Yeah, we had zero plan.

173
00:18:49.500 --> 00:18:52.760
<v Matt Godbolt>Uh, so here we are. Um,

174
00:18:52.760 --> 00:18:56.120
<v Ben Rady>yeah We could talk about metrics some more. i have lots of ah ideas on metrics and good ways to use metrics.

175
00:18:56.120 --> 00:19:01.070
<v Matt Godbolt>Well, let's do that. Let's do that then. Yeah, I didn't want it to like peter out awkwardly here as it was.

176
00:19:01.070 --> 00:19:01.200
<v Ben Rady>so

177
00:19:01.200 --> 00:19:09.840
<v Ben Rady>No, so one one one thing that I debate a lot is the sort of, i would say the difference between push and pull metrics.

178
00:19:09.840 --> 00:19:10.320
<v Matt Godbolt>Yeah.

179
00:19:10.320 --> 00:19:13.710
<v Ben Rady>So let's contrast two systems in particular as examples here.

180
00:19:13.710 --> 00:19:14.040
<v Matt Godbolt>Hmm.

181
00:19:14.040 --> 00:19:30.080
<v Ben Rady>So one of them that is ah ah kind of top of mind for me recently, actually, is ah a system like StatsD, right? The way StatsD works is ah you have a centralized metrics collection service.

182
00:19:30.080 --> 00:19:30.080
<v Matt Godbolt>Hmm.

183
00:19:30.080 --> 00:20:00.880
<v Ben Rady>And you create, and there's clients that do this for you, but just describing how the protocol works. When you have like a metric, like a a counter that you want to increment, or maybe a gauge that it's like, yeah, the disk is like 96% then you create a very small human readable text snippet, which is like, I think it's like the metric name and then a pipe and then a value and then a pipe and then like ah the type, or whether it's a gauge or a counter or something like that. I think that's roughly the StatsD thing.

184
00:20:00.880 --> 00:20:08.440
<v Ben Rady>And then you put that in a datagram and you send that datagram off to your central collection server and you have no idea whether it got there, but

185
00:20:08.440 --> 00:20:15.040
<v Matt Godbolt>right And you mean like literally a network packet, a single network fire-and-forget network packet: UDP.

186
00:20:15.040 --> 00:20:25.760
<v Ben Rady>correct. Yep. Yes. Yes. UDP datagram just goes, whoop. And ah the idea is that this is really useful for metrics where you don't want to block the sender, right?

187
00:20:25.760 --> 00:20:29.090
<v Ben Rady>Like you don't want the sender to be like, I'm waiting to send this metric somewhere.

188
00:20:29.090 --> 00:20:29.820
<v Matt Godbolt>Right.

189
00:20:29.820 --> 00:20:35.300
<v Ben Rady>um But if the if it doesn't get to where it's going, it's maybe not the end of the world, right?

190
00:20:35.300 --> 00:20:36.420
<v Matt Godbolt>Right.

191
00:20:36.420 --> 00:20:42.580
<v Ben Rady>um So that's that is sort of one style. And there are other ways to you know maybe make that a little bit more reliable.

192
00:20:42.580 --> 00:20:43.030
<v Matt Godbolt>Yep. Yep.

193
00:20:43.030 --> 00:20:56.040
<v Ben Rady>And you know certainly if you use gauges and things like that more frequently than counters, you can get like pretty reliable success out of that. But one of the great advantages of that is that the senders or the receiver doesn't need to know that the senders exist.

194
00:20:56.040 --> 00:21:04.270
<v Ben Rady>You can have a situation where it's like a new system comes up and it starts publishing its metrics and the receiver is just like oh, I guess i have a new thing that I need to worry about.

195
00:21:04.270 --> 00:21:04.310
<v Matt Godbolt>yep

196
00:21:04.310 --> 00:21:04.400
<v Ben Rady>Cool.

197
00:21:04.400 --> 00:21:08.760
<v Matt Godbolt>Right, it just receives a datagram from someone else and goes, new client, fantastic, right.

198
00:21:08.760 --> 00:21:08.860
<v Ben Rady>Right.

199
00:21:08.860 --> 00:21:18.110
<v Matt Godbolt>And then there's exactly exactly one piece of configuration, which is in all of the clients where the aggregator is, the the one receiver is, got it.

200
00:21:18.110 --> 00:21:18.180
<v Ben Rady>Yep.

201
00:21:18.180 --> 00:21:18.260
<v Ben Rady>Yes.

202
00:21:18.260 --> 00:21:18.400
<v Ben Rady>Yes. Yes.

203
00:21:18.400 --> 00:21:22.380
<v Matt Godbolt>Okay, so that's the, presumably that's the push case, you're pushing out

204
00:21:22.380 --> 00:21:31.900
<v Ben Rady>Yeah. yeah Yeah. Yeah. yeah And then you have systems like Prometheus. where the way Prometheus works is you've got an endpoint. I think it's usually an HTTP endpoint. I think it has to be an HTTP endpoint, actually.

205
00:21:31.900 --> 00:21:46.030
<v Ben Rady>Could be wrong about that. um But you've got some endpoint that's in your program that is being monitored, right, that is being observed. And the Prometheus kind of scraper reaches out to you on some periodic basis and says, like, give me your metrics, right?

206
00:21:46.030 --> 00:21:46.320
<v Matt Godbolt>mmhm

207
00:21:46.320 --> 00:21:59.060
<v Ben Rady>And so internally, you can have a thing where it's not like blocking the hot loop of any part of your execution. It's just sort of stashing the metrics in memory. to be available the next time it comes around.

208
00:21:59.060 --> 00:22:04.920
<v Ben Rady>But it's just taking this sort of like periodic snapshot of what is going on with with the metrics, right?

209
00:22:04.920 --> 00:22:05.360
<v Matt Godbolt>Right. Right.

210
00:22:05.360 --> 00:22:12.380
<v Ben Rady>Now, I'm not even talking about like the actual metric collection internally, because there's like a billion different ways to do that.

211
00:22:12.380 --> 00:22:12.660
<v Matt Godbolt>right

212
00:22:12.660 --> 00:22:19.690
<v Ben Rady>I'm kind of just talking about like, okay, assume you have a program that's got application level metrics. How does it get to somewhere else other than that machine?

213
00:22:19.690 --> 00:22:20.140
<v Matt Godbolt>Right.

214
00:22:20.140 --> 00:22:25.380
<v Ben Rady>And I think these are the sort of two basic ways that but I've seen people do it.

215
00:22:25.380 --> 00:22:26.260
<v Matt Godbolt>Right.

216
00:22:26.260 --> 00:22:39.640
<v Matt Godbolt>Absolutely. Push and pull. I mean, we've talked, I think, about um various um UDP-based systems before. I mean, we had one ah ah several companies ago that I know you worked on, which was a metric collection system that was more of the UDP datagram-based thing.

217
00:22:39.640 --> 00:22:51.410
<v Matt Godbolt>Obviously, StatsD is an example of that. It has a lot of benefits. You mentioned the configuration is straightforward. um the It's non-blocking for some definition of non-blocking in the publisher.

218
00:22:51.410 --> 00:22:51.900
<v Ben Rady>Yeah, yeah, right.

219
00:22:51.900 --> 00:22:55.280
<v Matt Godbolt>I mean, sending a UDP datagram is kind of a heavyweight activity in some...

220
00:22:55.280 --> 00:22:55.280
<v Ben Rady>Yeah.

221
00:22:55.280 --> 00:23:19.680
<v Matt Godbolt>worlds uh but it's straightforward relatively speaking and you so certainly of the the StatsD format is very straightforward so you blast it off obviously the drawbacks are it might not get there which reminds me of a joke um which i tell you but you know it's about udp i don't think i don't think you get it

222
00:23:19.680 --> 00:23:21.700
<v Ben Rady>Mm-hmm. Mm-hmm.

223
00:23:21.700 --> 00:23:31.290
<v Matt Godbolt>It might not get there. um if it does if the If the collector is down or misconfigured, you'll never know. You're just sending it out into the into the ether, literally.

224
00:23:31.290 --> 00:23:31.880
<v Ben Rady>Right.

225
00:23:31.880 --> 00:23:39.420
<v Matt Godbolt>And um the there could be a bottleneck if you're generating a ton of of statistics back to back.

226
00:23:39.420 --> 00:23:39.440
<v Ben Rady>Yeah.

227
00:23:39.440 --> 00:24:05.940
<v Matt Godbolt>if you've got like um If you try and update your counter on every single update, then you're sending a blast of relatively... heavyweight packets at a machine, and that machine has to be able to deal with all of that data. And in fact, you might back up trying to send it. So those are the drawbacks, but it's very, very appealing because um also if you're a very short-lived application, if you're like a command line client, you might not live long enough to be scraped by a different system.

228
00:24:05.940 --> 00:24:07.310
<v Ben Rady>Right, right, right, right, right.

229
00:24:07.310 --> 00:24:08.140
<v Matt Godbolt>Right, that's that.

230
00:24:08.140 --> 00:24:21.590
<v Matt Godbolt>Then let's talk about the pull-based systems. And let me just read that back to you. So in this instance, somehow some centralized system has to know about all of the places that have metrics.

231
00:24:21.590 --> 00:24:22.180
<v Ben Rady>Yeah.

232
00:24:22.180 --> 00:24:32.880
<v Matt Godbolt>And then it is responsible for connecting to them in turn or however, and saying, give me a snapshot of your metrics, please, over HTTP or TCP or something like that.

233
00:24:32.880 --> 00:24:33.300
<v Ben Rady>Mm-hmm.

234
00:24:33.300 --> 00:24:53.040
<v Matt Godbolt>So obviously the the pro points there are um you the collection system is responsible for the period upon which it is collecting these statistics. So it could be like, well, i can do it once a second or once a minute or once an hour. It doesn't matter as long as you know I can configure that in one place.

235
00:24:53.040 --> 00:25:02.800
<v Matt Godbolt>And you're not being swamped by millions of intermediate values because you only care about it on the cadence that you care about. Yeah.

236
00:25:02.800 --> 00:25:05.360
<v Matt Godbolt>The drawback is how do you find all your clients?

237
00:25:05.360 --> 00:25:05.640
<v Ben Rady>Right.

238
00:25:05.640 --> 00:25:15.620
<v Matt Godbolt>That sounds relatively complex. and now you can... Now I've got another problem. So yeah, okay. I've just read those back to you, but obviously you you brought this subject up for because I believe you probably have opinions and I'd be interested in your opinions on those things.

239
00:25:15.620 --> 00:25:26.020
<v Ben Rady>I do have opinions. i do I do want to make the point though, by the way, about sending the datagrams is that you don't have to do that in process, just as with Prometheus, you're going to store your metrics in memory and then it's going to get scraped.

240
00:25:26.020 --> 00:25:31.580
<v Ben Rady>You can also store your metrics in memory and then send them out with some cadence over UDP, right? Like you can do them inline.

241
00:25:31.580 --> 00:25:32.340
<v Matt Godbolt>That makes sense, yeah.

242
00:25:32.340 --> 00:25:34.040
<v Ben Rady>You don't have to, right?

243
00:25:34.040 --> 00:25:35.420
<v Matt Godbolt>ah Yeah, that makes sense. Yeah.

244
00:25:35.420 --> 00:26:01.360
<v Ben Rady>Yeah. ah But i and I am a huge fan. One of the one of the sort of um ah you know scary bedtime stories that ah finance dads tell their kids is the story of Knight Capital and how and how a trading firm lost you know hundreds of millions of dollars in 45 minutes, something like that.

245
00:26:01.360 --> 00:26:01.720
<v Matt Godbolt>Oh.

246
00:26:01.720 --> 00:26:02.360
<v Matt Godbolt>Yes.

247
00:26:02.360 --> 00:26:16.600
<v Ben Rady>um And it's a terrible story. And it's it's funny because I actually, ah you we we used to work, you used to work, I work with somebody who actually is very familiar with this process, was was was directly involved with some of the companies that cleaned up afterwards anyway.

248
00:26:16.600 --> 00:26:18.140
<v Matt Godbolt>Very familiar.

249
00:26:18.140 --> 00:26:25.910
<v Ben Rady>And ah it's funny how much of this has turned into sort of like lore and, you know, it's been, you kind you know,

250
00:26:25.910 --> 00:26:26.500
<v Matt Godbolt>Folklore, yeah.

251
00:26:26.500 --> 00:26:41.920
<v Ben Rady>a kind of you know the the game of telephone has been told many times, but it is nonetheless true that like one of the problems that happened there is that they had software running that they did not realize was running, right?

252
00:26:41.920 --> 00:26:44.980
<v Ben Rady>They didn't realize that it was doing what it was doing, right?

253
00:26:44.980 --> 00:26:45.340
<v Matt Godbolt>Yeah.

254
00:26:45.340 --> 00:27:00.340
<v Ben Rady>And I generally feel like I sleep better at night knowing that there's a central server, everything that is running is at least trying to publish to that central server.

255
00:27:00.340 --> 00:27:16.300
<v Ben Rady>And if something comes up unexpectedly, there's at least a chance, probably a very good chance, that those messages will suddenly appear on that central server and it will have the the ability at least to detect that something is running that should not be running.

256
00:27:16.300 --> 00:27:16.750
<v Ben Rady>Right.

257
00:27:16.750 --> 00:27:17.160
<v Matt Godbolt>Mm-hmm.

258
00:27:17.160 --> 00:27:29.640
<v Ben Rady>um You can kind of do a little hybrid of both of these things. If you want, you can have like, you know, the central server then reach back out to the sending clients.

259
00:27:29.640 --> 00:27:40.460
<v Ben Rady>It can even like give them like an aggregated ACK where it's like, yeah I've received 300 messages from you in the last minute or something like, just so you know, I'm i'm actually receiving your messages. um You can do things like that.

260
00:27:40.460 --> 00:28:03.220
<v Ben Rady>But um the thing that really makes me sleep well at night with a lot of these systems is having a way so that if someone were to start a piece of software like on their desktop or in some test server or somewhere else, it would at least try to tell someone about it as opposed to, well, it hasn't been added to the central configurations.

261
00:28:03.220 --> 00:28:04.380
<v Ben Rady>so There's no way we could ever know.

262
00:28:04.380 --> 00:28:16.750
<v Matt Godbolt>Got it. Yeah. I mean, there are different ways of solving that problem. Obviously one way, because, you know, again, if you try and reach out to a server, but it doesn't come back to you, you still have this problem, right?

263
00:28:16.750 --> 00:28:16.800
<v Ben Rady>There are.

264
00:28:16.800 --> 00:28:16.820
<v Ben Rady>Mm-hmm.

265
00:28:16.820 --> 00:28:29.130
<v Matt Godbolt>You know, and in the finance worlds that we're talking about, we have very strict network segregation, which means that you might not be able to send the ping to the central servers to say like, Hey, I'm a production machine.

266
00:28:29.130 --> 00:28:29.460
<v Ben Rady>Mm-hmm.

267
00:28:29.460 --> 00:28:31.650
<v Matt Godbolt>So there's issues of that nature like that.

268
00:28:31.650 --> 00:28:31.940
<v Ben Rady>Yep.

269
00:28:31.940 --> 00:28:37.800
<v Matt Godbolt>Um, And so I feel that like there is there's always an incomplete part to this. There's always a slightly of a blind spot here.

270
00:28:37.800 --> 00:28:37.800
<v Ben Rady>Yep.

271
00:28:37.800 --> 00:28:37.800
<v Ben Rady>Yeah.

272
00:28:37.800 --> 00:28:55.300
<v Matt Godbolt>because um But in general, a service discovery mechanism that's robust to these is useful whether or not you're pushing information to a centralized server or whether or not you are ah being scraped by some centralized server.

273
00:28:55.300 --> 00:29:06.340
<v Matt Godbolt>And that seems to me the more the the thing here is but where your in saying like if you're sending these periodic metric pings to some system, you could notice that something was alive and doing something that unexpected.

274
00:29:06.340 --> 00:29:15.780
<v Matt Godbolt>um That's kind of begging the question of like, why are you using your metric system to determine the liveness of software? Why don't we have a software liveness indicator?

275
00:29:15.780 --> 00:29:17.640
<v Matt Godbolt>Maybe you are talking about that as well here, but that's, you know,

276
00:29:17.640 --> 00:29:17.640
<v Ben Rady>Oh, sure.

277
00:29:17.640 --> 00:29:31.230
<v Ben Rady>Yeah, I mean, I'm kind of like, I'm talking about this in the context where where everything is already broken, right? It's sort of like both of these systems work great when everything works great, right? And it's like, when they break, what are some of the different ways in which they break?

278
00:29:31.230 --> 00:29:31.270
<v Matt Godbolt>Right.

279
00:29:31.270 --> 00:29:44.260
<v Ben Rady>And you may you're absolutely right that it's like network partitioning is one way in which the sort of like push-based model, you know, the StatsD model doesn't save you because it's like you have a test server that's configured and running in prod and it can't reach the test network.

280
00:29:44.260 --> 00:30:02.770
<v Matt Godbolt>but then, you know, so we're there's ah there's ah so there's another sort of solution. There's ah another solution. There's another potential here, which is if we don't use the fire and forget, single UDP datagram thing and you have instead the TCP connection, then obviously you get the positive code connection that that you are talking to the central server.

281
00:30:02.770 --> 00:30:02.860
<v Ben Rady>Mm hmm.

282
00:30:02.860 --> 00:30:02.960
<v Ben Rady>Yeah. Yeah.

283
00:30:02.960 --> 00:30:07.120
<v Matt Godbolt>You get your ticket from it that says, yes, you're okay to run or whatever, you those kinds of things.

284
00:30:07.120 --> 00:30:07.240
<v Ben Rady>Yeah, ye yeah,

285
00:30:07.240 --> 00:30:08.020
<v Matt Godbolt>But then you are solving-

286
00:30:08.020 --> 00:30:18.620
<v Matt Godbolt>But then you are sort of solving the similar problem to, and excuse the dog, um oh you are solving similar problems to, and now I can't even remember what the thing's called now.

287
00:30:18.620 --> 00:30:22.430
<v Matt Godbolt>What a is is it we use for service discovery, the old company?

288
00:30:22.430 --> 00:30:23.800
<v Ben Rady>Consul.

289
00:30:23.800 --> 00:30:25.350
<v Matt Godbolt>Consul. Yeah, which is, you know,

290
00:30:25.350 --> 00:30:26.800
<v Ben Rady>Yeah.

291
00:30:26.800 --> 00:30:44.510
<v Matt Godbolt>um Chubby in Google terms, I think is the equivalent. And, you know, it's so it's a centralized lock manager, but it's sort of a small amount of shared state between things. And so people can go in and now obviously that's still opt in and you still have to be part of the Consul cluster or you have your, your system has to be registered with Consul cluster in order for it to be noticed.

292
00:30:44.510 --> 00:30:44.540
<v Ben Rady>Mm-hmm.

293
00:30:44.540 --> 00:30:52.440
<v Matt Godbolt>But that's what it's supposed to be. That's one of the things that's meant to be there for is to say like, Hey, find me all the things that say that they are metrics producers or,

294
00:30:52.440 --> 00:30:52.440
<v Ben Rady>Mm-hmm.

295
00:30:52.440 --> 00:31:08.600
<v Matt Godbolt>everything that says I'm a web browser or a web sorry web server or that kind of thing. And so that feels like a good solution. But ah just like my network partition example and the whatever, you can still break it because if you're not in the Consul cluster, then you're in a partitioned world of your own, right?

296
00:31:08.600 --> 00:31:11.350
<v Matt Godbolt>And so, yeah, there's not an easy solution to any of these things.

297
00:31:11.350 --> 00:31:11.380
<v Ben Rady>Yeah, yeah.

298
00:31:11.380 --> 00:31:23.380
<v Matt Godbolt>But I do wonder if conflating metrics gathering with this is... is is a good thing, whether or not, you know, you just mentioned in passing that this is a useful thing to be able to do.

299
00:31:23.380 --> 00:31:25.400
<v Matt Godbolt>It certainly is a surprise if you get a...

300
00:31:25.400 --> 00:31:33.760
<v Ben Rady>Yeah, it's this is this is one of those things where it's like this is not this is not a real solution to the problem that you're talking about it being like you like we have A and B.

301
00:31:33.760 --> 00:31:33.760
<v Matt Godbolt>Yeah.

302
00:31:33.760 --> 00:31:38.330
<v Ben Rady>We're trying to choose between A and B. And I'm like, I think I like A better than B. And I was like, why?

303
00:31:38.330 --> 00:31:38.440
<v Matt Godbolt>Yeah.

304
00:31:38.440 --> 00:31:46.660
<v Ben Rady>Well, it's like, well, because in certain situations, it'll solve this problem. It's like, well, but in other situations, it won't. It's like, yeah, but that's not why we're talking about A and B. I'm just trying to pick between two options.

305
00:31:46.660 --> 00:31:46.840
<v Matt Godbolt>Yeah.

306
00:31:46.840 --> 00:31:46.980
<v Ben Rady>Right.

307
00:31:46.980 --> 00:31:48.380
<v Matt Godbolt>No, that's really interesting. Yeah, yeah. and

308
00:31:48.380 --> 00:31:49.300
<v Ben Rady>So it's just it.

309
00:31:49.300 --> 00:32:00.180
<v Matt Godbolt>yeah Yeah, no, no, I got it. And I mean, ultimately, it's it's almost like, what if you were to do, ah if you were doing metrics gathering, the hybrid solution where, you know, instead of proactively

310
00:32:00.180 --> 00:32:00.180
<v Ben Rady>Mm-hmm.

311
00:32:00.180 --> 00:32:24.180
<v Matt Godbolt>being scraped you just connect into the central server and then it asks you so it's still push and pull like you connected into it and it knows that you existed and you therefore service discovery is if you telnet to port 8000 of the central machine then we care about what information that you have um but you get scraped by it saying okay give me what you got

312
00:32:24.180 --> 00:32:24.530
<v Ben Rady>Mm-hmm. Yeah.

313
00:32:24.530 --> 00:32:24.880
<v Ben Rady>Yeah. yeah

314
00:32:24.880 --> 00:32:25.060
<v Ben Rady>Mm-hmm.

315
00:32:25.060 --> 00:32:29.610
<v Matt Godbolt>But obviously that doesn't work over a HTTP, which obviously has convenience methods.

316
00:32:29.610 --> 00:32:29.840
<v Ben Rady>Mm-hmm.

317
00:32:29.840 --> 00:32:44.860
<v Matt Godbolt>Certainly when I'm a developer, it's useful to be able to hit my own web server. And in fact, some of the tests I wrote ah involved scraping back over the HTTP port to check that I was actually exposing the metrics that I thought I was exposing when I was writing my own Prometheus endpoint.

318
00:32:44.860 --> 00:32:45.420
<v Ben Rady>Yeah.

319
00:32:45.420 --> 00:32:46.910
<v Matt Godbolt>So I think, yeah.

320
00:32:46.910 --> 00:32:47.580
<v Ben Rady>Mm-hmm.

321
00:32:47.580 --> 00:33:07.110
<v Matt Godbolt>Yeah, and yeah, to to say that the Knight Capital um legend was purely, and not that you did, but like there were so many other aspects to that. It was very much the the Swiss cheese and eventually all the holes lined up and one of the things got through.

322
00:33:07.110 --> 00:33:07.500
<v Matt Godbolt>um

323
00:33:07.500 --> 00:33:08.270
<v Ben Rady>Mm-hmm.

324
00:33:08.270 --> 00:33:10.720
<v Matt Godbolt>but But yes, metrics, very much like metrics.

325
00:33:10.720 --> 00:33:30.990
<v Ben Rady>Yeah. Well, so the the the real thing the real thing here is sort of bringing this back to observability in general a little bit is like, I think, I mean, and I do this in the systems that I have, what you probably want to do in a system that has discovered that it is no longer observable is to stop.

326
00:33:30.990 --> 00:33:31.400
<v Matt Godbolt>Yeah.

327
00:33:31.400 --> 00:33:36.160
<v Ben Rady>Because it's sort of like the last gasp of like, someone pay attention to me.

328
00:33:36.160 --> 00:33:36.270
<v Matt Godbolt>Yeah.

329
00:33:36.270 --> 00:33:36.280
<v Matt Godbolt>Yep.

330
00:33:36.280 --> 00:33:51.410
<v Ben Rady>Right? um And so you want to do that in multiple situations. You want to probably have something like that at startup, like registering with some sort of central discovery service or sending out some sort of message saying like, hey, I'm starting up.

331
00:33:51.410 --> 00:33:51.660
<v Matt Godbolt>Yep.

332
00:33:51.660 --> 00:34:00.940
<v Ben Rady>And if you don't have a way to acknowledge that someone heard you, be like, okay, well, then I guess I'll stop then. Like having some mechanism to do that is a great sort of safety mechanism.

333
00:34:00.940 --> 00:34:01.380
<v Matt Godbolt>Mm-hmm.

334
00:34:01.380 --> 00:34:05.700
<v Matt Godbolt>along with heart beating to make sure that everyone on both ends are still actually there.

335
00:34:05.700 --> 00:34:05.820
<v Ben Rady>um Heartbeating.

336
00:34:05.820 --> 00:34:08.140
<v Matt Godbolt>And like, are you still there? And i don't just mean TCP level stuff.

337
00:34:08.140 --> 00:34:08.260
<v Ben Rady>Yep.

338
00:34:08.260 --> 00:34:10.480
<v Matt Godbolt>I mean, actual application level.

339
00:34:10.480 --> 00:34:10.660
<v Ben Rady>Yeah. Application level heartbeats.

340
00:34:10.660 --> 00:34:11.350
<v Matt Godbolt>Like, are you there?

341
00:34:11.350 --> 00:34:11.480
<v Ben Rady>Yes.

342
00:34:11.480 --> 00:34:15.140
<v Matt Godbolt>Yes. Okay. I'm back. Yes. And just that kind of stuff. That's always a good thing.

343
00:34:15.140 --> 00:34:36.110
<v Ben Rady>Yeah, yeah. um And then one of the more interesting ones, and i've I've had some debates with people about this, but I still think this is the way that I do it, is if you have a system that encounters a fault, so going back to our sort of structured logging, like I've logged and an error or an exception, and I try to send that to somewhere to notify somebody, right?

344
00:34:36.110 --> 00:34:36.560
<v Matt Godbolt>Yeah. Yep.

345
00:34:36.560 --> 00:34:39.980
<v Ben Rady>What happens when /that/ fails?

346
00:34:39.980 --> 00:34:40.340
<v Matt Godbolt>yeah

347
00:34:40.340 --> 00:34:52.840
<v Ben Rady>I think the right thing to do with a certain amount of retries, like keep retrying, but like if you retry for some period of time, eventually you probably just want the system to stop.

348
00:34:52.840 --> 00:35:06.250
<v Ben Rady>Now, that's not universally true for every single system. There are things where it's like, no, this just needs to keep trucking, even if it's having failures. But all other things being equal, my base argument is if you have a system that has an error, fine.

349
00:35:06.250 --> 00:35:06.620
<v Matt Godbolt>yeah

350
00:35:06.620 --> 00:35:16.440
<v Ben Rady>Errors happen. If you have a system that has an error and tries to report it it its error and it can't, OK, it should keep retrying. But at a certain point, it should just exit.

351
00:35:16.440 --> 00:35:27.900
<v Matt Godbolt>I would not disagree with you on that. I mean, just to sort of like remind the the the listener, though, that, you know, you and I come from a world of finance where there's a lot of regulatory stuff around.

352
00:35:27.900 --> 00:35:27.920
<v Ben Rady>Yeah.

353
00:35:27.920 --> 00:35:38.250
<v Matt Godbolt>If we can't log what we're doing, of you know, again, Knight Capital type stuff, if if we can't tell somebody that something is up, then the best course of action is to to stop doing anything further.

354
00:35:38.250 --> 00:35:38.400
<v Ben Rady>Mm-hmm.

355
00:35:38.400 --> 00:35:40.460
<v Matt Godbolt>Log everything you can to disk and then kill the process.

356
00:35:40.460 --> 00:35:40.460
<v Ben Rady>Mm-hmm.

357
00:35:40.460 --> 00:36:03.890
<v Matt Godbolt>and be done with it and hope that that gets someone's attention right why are we not trading it anymore oh it turns out the process self-destructive why is that well there's been a network split and it can't tell us that the position's out you know those kinds of things and those are more defensible but but yeah if my pacemaker um can't log an error then maybe i don't want it to stop um but you know i yeah obviously there are

358
00:36:03.890 --> 00:36:04.180
<v Ben Rady>Right.

359
00:36:04.180 --> 00:36:11.940
<v Ben Rady>Yeah. I don't want my home wifi router to turn off because it can't send logs to some place that I don't care about the logs for. Right.

360
00:36:11.940 --> 00:36:12.080
<v Matt Godbolt>yeah Exactly.

361
00:36:12.080 --> 00:36:25.860
<v Matt Godbolt>So there are ways, and but but I think as ah as a sensible, um and even within the finance industry, I think, you know, this is something that I've worked on desks where it is okay to not be up and running.

362
00:36:25.860 --> 00:36:26.080
<v Ben Rady>Yeah.

363
00:36:26.080 --> 00:36:26.900
<v Matt Godbolt>Like, it's not great.

364
00:36:26.900 --> 00:36:27.180
<v Ben Rady>Yeah.

365
00:36:27.180 --> 00:36:37.880
<v Matt Godbolt>You know, and people there's going to be some very long meetings that you can have to explain yourself, but it's like not... in the if you're not on and trading, the only thing is is an opportunity cost.

366
00:36:37.880 --> 00:36:43.320
<v Matt Godbolt>you know you You weren't able to make money or whatever, and there are there are manual ways of trading out of positions and those kinds of things.

367
00:36:43.320 --> 00:36:43.400
<v Ben Rady>yeah yeah

368
00:36:43.400 --> 00:36:58.600
<v Matt Godbolt>But if you have obligations to an exchange or downstream clients, then maybe you have to limp on and say, look, it's better for us to continue to be able to provide this service, albeit disrupted,

369
00:36:58.600 --> 00:37:09.170
<v Matt Godbolt>um But I've never worked on a situation like that. So I'm always down with yes. you know like Literally my C++ exception handling stuff is like log everything you can to disk and then kill -9 myself.

370
00:37:09.170 --> 00:37:09.290
<v Ben Rady>yeah

371
00:37:09.290 --> 00:37:09.420
<v Ben Rady>so

372
00:37:09.420 --> 00:37:19.570
<v Matt Godbolt>like you know does There's no way we can carry on after this point here. right We are done and dusted. I don't care if like the destructors don't run properly. Just kill the process at this point.

373
00:37:19.570 --> 00:37:20.080
<v Ben Rady>huh

374
00:37:20.080 --> 00:37:21.500
<v Matt Godbolt>And that's always okay. yeah

375
00:37:21.500 --> 00:37:26.280
<v Ben Rady>Yeah. I tell you though, ah just to tie this back to testing, because why not? That's where we started.

376
00:37:26.280 --> 00:37:26.880
<v Matt Godbolt>Why not?

377
00:37:26.880 --> 00:37:33.400
<v Ben Rady>The one piece of code I've never really come up with a great way to test is the code that kills the program.

378
00:37:33.400 --> 00:37:47.450
<v Matt Godbolt>So there is, at least in a C++ framework I'm familiar with, there is a death test. And it works by forking the process and then communicating between the two processes to make sure that this actually kills the process.

379
00:37:47.450 --> 00:37:47.580
<v Ben Rady>Oh.

380
00:37:47.580 --> 00:37:55.260
<v Matt Godbolt>Now, unfortunately, Unix being as complicated as it is, there's signal handling and there's like child-parent relations and you can still not always get it right.

381
00:37:55.260 --> 00:37:55.380
<v Ben Rady>That's clever.

382
00:37:55.380 --> 00:38:01.910
<v Matt Godbolt>But it's not a bad way of saying this should abort the process, right? Literally kill the process and be done with it.

383
00:38:01.910 --> 00:38:02.260
<v Ben Rady>Mm-hmm.

384
00:38:02.260 --> 00:38:16.310
<v Matt Godbolt>And you go, well, okay, I'll fork myself here. No snickering in the back. And the the child process will do that. And then the the the parent process monitors to make sure that that's what happens through some you know Unix domain thing.

385
00:38:16.310 --> 00:38:16.380
<v Ben Rady>Interesting.

386
00:38:16.380 --> 00:38:19.040
<v Matt Godbolt>So you can write tests for these things.

387
00:38:19.040 --> 00:38:22.180
<v Matt Godbolt>ah There's never an excuse not to write a test for something, he says.

388
00:38:22.180 --> 00:38:22.180
<v Ben Rady>Yeah.

389
00:38:22.180 --> 00:38:31.660
<v Matt Godbolt>Very well aware that I've just spent the last two weeks writing very limitedly tested code, but that's a whole other story for another time.

390
00:38:31.660 --> 00:38:33.200
<v Ben Rady>Yeah. Yeah. yeah yeah

391
00:38:33.200 --> 00:38:34.680
<v Matt Godbolt>All right, friend.

392
00:38:34.680 --> 00:38:35.960
<v Ben Rady>and yeah That's probably a good place to call it.

393
00:38:35.960 --> 00:38:36.080
<v Matt Godbolt>I think we should call it.

394
00:38:36.080 --> 00:38:36.620
<v Ben Rady>Right.

395
00:38:36.620 --> 00:38:43.700
<v Matt Godbolt>Yeah, this expanded from a, I have an idea, to 40 minutes worth of conversation, which is how it should be.

396
00:38:43.700 --> 00:38:44.460
<v Ben Rady>Yeah.

397
00:38:44.460 --> 00:38:45.310
<v Matt Godbolt>And I've enjoyed it.

398
00:38:45.310 --> 00:38:45.960
<v Ben Rady>Right.

399
00:38:45.960 --> 00:38:53.740
<v Matt Godbolt>ah But metrics are more useful than you might think. And you should keep them. And structured logging is always a choice too. so

400
00:38:53.740 --> 00:38:59.020
<v Ben Rady>Yeah, it's a choice. That's for sure.

401
00:38:59.020 --> 00:39:00.340
<v Matt Godbolt>All right, friend.

402
00:39:00.340 --> 00:39:00.500
<v Ben Rady>Cool.

403
00:39:00.500 --> 00:39:06.890
<v Matt Godbolt>Until next time.

404
00:39:06.890 --> 00:39:09.890
<v Ben Rady>and Until next time.