Virtual Db2 User Group Sponsors

IntelliMagic

Virtual Db2 User Group | November 2023

Gaining Insights from Db2 Statistics and Accounting Data

Todd Havekost
Senior z/OS Performance Consultant,
IntelliMagic

Read the Transcription

[00:00:00] – Amanda Hendley

Today’s session is Gaining Insights from Db2 Statistics and Accounting Data and Todd is joining us. He is a Senior z/OS Performance Consultant for IntelliMagic. He’s got 45 years of IT experience with a primary focus on system and subsystem performance across the z platform. He joined IntelliMagic in 2015 after experiencing firsthand what IntelliMagic Vision did for z/OS infrastructures. He spends his days helping customers get value from their own SMF data to improve availability and performance. You’ve probably heard from Todd before because he is such a well-known speaker and advisor. I know he’s given presentations at CMG and Share across the world. So, Todd, thank you for joining us today.

[00:00:54] – Todd Havekost

All right, great. Thank you, Amanda. Let me get my screen sharing going. All right. Sorry. There we go. One more button. There we go. Okay. Can you see my screen?

[00:01:13] – Amanda Hendley

It looks great.

[00:01:15] – Todd Havekost

Okay, great. All right, well, thanks everybody for joining today, and I’m always eager to learn myself and help others gain insights from SMF data in general. And today, of course, we’re focused in this session on Db2 statistics and accounting data. So, after touching really briefly on some challenges that are often encountered trying to gain insights from SMF data, then the focus of the session is just what the title says: What kind of insights can we gain from Db2 statistics data and from Db2 accounting data. All right, well, we’re extremely fortunate on the z platform in general, and then, of course, specifically with Db2 because they produce a vast set of metrics that have the potential to provide great operational value and insights into how Db2 and all the other z/OS technologies are operating in the environment, in your environment. But unfortunately, sometimes sites face challenges as they seek to realize that potential value. And those can include SMF data primarily being used in a reactive manner. It’s not until a post-mortem following an outage that it’s discovered that a key metric had started showing signs of stress days before the outage actually occurred.

[00:03:48] – Todd Havekost

Another challenge is siloed tooling across disciplines, with each infrastructure team using specialized tooling unique to their area that can often be difficult to learn, and it certainly creates barriers to cross team collaboration and learning. Static reporting solutions require deciding in advance how you want to define and present the data, and so that can be a challenge when you’re trying to do detailed analysis. Another is a challenge that comes certainly into play in our topic today with Db2 accounting data is the ability to focus the analysis despite massive data volumes. And oftentimes we will hear from people who’ve got so much data that finding what we want to find is like trying to find a needle in a haystack. And in some cases, customers are so overwhelmed with the data volumes that they basically kind of give up and look for ways either to reduce the volume of the data that they collect or don’t even collect it at all. And then a final challenge that we hear from most sites is the declining expertise with retirements of tenured staff and the skills that have formerly been involved in writing programs, custom to that site are becoming increasingly scarce.And a lot of the common approaches and tooling today create significant barriers for entry for people that are newer to the platform, or steep learning curves to learn the tooling, and limited ability to visualize the data. So anyway, those are some of the challenges, and as we talk today, we’ll see how those can be potentially overcome.

[00:04:40] – Todd Havekost

All right, so we’ll start out with statistics data. Db2 statistics data is reported by interval at the Db2 subsystem or member level, and for buffer pool metrics, also at the buffer pool level. And so as we look at statistics here, we’ll talk about the kinds of insights we can gain from assessments of key metrics, focus specifically on buffer pool metrics, and then talk a little bit about insights that we can gain from integrating Db2 IFCID 199 data at the buffer pool and database level with another type of SMF data: SMF 42 IO performance data. And then finally, some insights from other statistics metrics. Okay, so talk a little bit about assessments of key metrics, and especially with respect to industry best practice threshold. So, the value of automated assessments is to call your attention to areas that may warrant some additional investigation.

[00:05:55] – Todd Havekost

There’s just too many components and too many metrics to succeed with a manual approach. So some kind of proactive approach that can enable you to say, “hey, I need to start looking at this area” before it turns into a major outage. All right, so I’m going to today, as we try to see what kind of insights we can gain from Db2 SMF data, I’ll be using IntelliMagic Vision as the tool to explore the data. And again, this isn’t a product feature presentation. It’s really focused on what can we learn from kind of insights can we gain from the data. So here’s one implementation of automated assessments with assessments across the top, grouped in various categories. So I’m going to start with buffer pool health. It’s an easy one to conceptualize. And Db2 performance. Good performance, of course, relies on good buffer pool performance and hit ratI/Os and so on. So here’s about a dozen metrics that are being assessed for each member of each data sharing group. And since this is a buffer pool metric for each buffer pool as well. And so we can see some exceptions for one particular data sharing group.

[00:07:17] – Todd Havekost

Db2 buffers are organized by size. So it will initially kind of look at it by what size of buffers and of course, all the sites I’ve ever seen, the highest volume of activities going on in the 4K buffers. And so here the exceptions are there as well. And so then looking at the buffer pools that make up the 4K pools here in this environment, we can see there’s exceptions for several different buffer pools. The icons here are larger for buffer pool 20, indicating that there are more intervals that have exceptions for that particular buffer pool. So here are the metrics kind of viewed across the time interval that’s selected here. It’s a 24 hours day. We’ll talk about the different metrics that are looked at in buffer pool tuning in just a minute here. So won’t get into that. But anyway, so here’s an example, right page residency time is generating some exceptions based on the thresholds that are in place. Here a warning at 90 seconds and an exception at 60 seconds. And for much of the day, the residency time. And this buffer pool is below 60 seconds. And so it can be helpful to kind of put this in context, what kind of levels of activity are going on while this is happening.

[00:08:46] – Todd Havekost

And so let me just get pages on the secondary axis and we can see, as we pretty much would expect, pretty much an inverse relationship. When the volume of Get Page activity is high, the residency time is lower, and that relationship sticks for most of the day. That isn’t quite hold up in the late evening. And then two, this is the profile for this particular interval. You always want to know, is this the common behavior or is this an anomaly? So if we compare it with the prior day. We can see it’s a pretty common profile here for this environment. All right, so again, let’s talk now kind of dive into buffer pool metrics. Specifically, it’s a primary focus for Db2 performance and efficiency helps from a CPU point of view as well as performance to maximize the frequency that get pages are going to be satisfied. From data that already resides in a buffer through various prefetch mechanisms and avoiding I/Os that are synchronous with the unit of work, those I/Os introduce delays into response times, and they consume more CPU than a memory access would. So Db2 statistics data provides really rich metrics for managing and tuning buffer pools.

[00:10:16] – Todd Havekost

So the amount of available memory or real storage in z/OS terminology, z hardware, has increased dramatically in recent generations. It was not too long ago, 1 TB was the max. Now on a z16, you can have up to 40 terabytes. Now, the software support hasn’t grown as rapidly, but since the OS two two, a single image can support up to four terabytes. So that’s still a good size number. And even if your site can’t afford 40 terabytes ODS are your processors have significantly more memory today than they did in a few years ago. So how do you take advantage of that memory? Well, across the z/OS infrastructure, Db2 is the biggest mainstream exploiter in Db2 releases Db2 function levels. Db2 keeps delivering new ways to leverage z/OS real storage to improve performance and CPU efficiency. And within that, Db2 buffer pools represent the single biggest opportunity to take advantage of that additional memory. So that’s why important to have great visibility into your vital buffer pool metrics. All right, so just kind of as a starter having visibility in the buffer pool sizes are provided from the statistics data.

[00:11:49] – Todd Havekost

And so here’s the total size of the 4816 and 32K pools, and then for each specific pool, the relative size. And then of course, we can look at that by member as well. Typically in a data sharing environment, unless a member has a unique function, you’ll have common levels of buffer pool sizes across the data sharing group. So different Db2 industry experts kind of focus on different metrics to identify what buffer pools are candidates for memory. So here are three of the methodologies that are commonly put out there total read io rate, page residency time, and random buffer hit ratio. So let’s kind of do a little bit of exploring around the total read I/O rate, and the concepts will apply to the other two metrics there as well. All right, so looking at I/Os again for this particular high volume data sharing group, again, looking at it by buffer pool, we can see that buffer pool 20 here has the most total I/Os. We’ve got two types of sync reads here, random and sequential. Then we’ve got the three types of prefetch I/Os there. And then in addition, we’ve got also in this view, we’ve got the write I/Os.

[00:13:44] – Todd Havekost

So let’s customize this because we’re trying to focus on read I/O rates here. So let’s remove those write variables and then let’s update the title here to just indicate now that we’ve just got reads. So as we zoom into the reads, we can see that it’s still buffer pool 20 is by far got the highest I/O rate per second. So if I’m using that as my criteria, then I’m going to start there in terms of focus for additional analysis. All right, so a rule of thumb that’s mentioned in the presentations is buffer pools that exceed 1000 read I/Os a second would be good candidates. So let’s look at buffer pool 20 here. And now let’s look at it across the members. So when we do that, this is for the entire day interval, you can see that each member averages more than 1000 read I/Os a second. So they would all be prime candidates for additional analysis here. If we take maybe this member that’s got the highest read I/O rate and let’s kind of just zero in on that one and look at it over time, five and view it over time, we can see that even by type, the random sync readers themselves are over 2000 a second.

[00:15:47] – Todd Havekost

And again, if we’re looking at the cumulative amount of read I/Os, make it an area chart. Instead, we’re commonly north of 3000 I/Os a second, read I/Os a second for that member. So again, it certainly fits the criteria for additional tuning. So here I’ve just kind of collected some of the views that we’ve looked at. The first two on the top here when we were looking at read I/O rates that identified buffer 20, looked at buffer 20 by memory, by member, excuse me, page residency time is another criteria. So again, in this site, three and 20 are the lowest, especially when you look at day shift. And when you compare them over time, 20 has got the lowest residency time. So for that criteria rises up as well. And then finally, in terms of get page hit ratI/Os, buffer pool 20 has got the lowest around 50%. And here that is and also as compared to residency time. So anyway, by all three criteria in this environment, buffer pool 20 would be kind of the place to start here, right? But no matter what buffer pool tuning methodology you use, and no matter what key metrics drive your analysis, your process will be greatly enhanced by having easy accessibility to those metrics, both for your initial analysis and then when you’re going to want to come back periodically to revisit it.

[00:17:26] – Todd Havekost

So some way of collecting the views that you want to look at and then changing the time intervals to reflect the new time interval would greatly enhance the buffer tuning methodology process in your environment. Okay? Next, I said I was going to talk a little bit about insights that the Db2 IFCID 199 data provide into buffer hit ratios and buffer efficiency because they capture buffer pool statistics at the data set level. And then here in some of the examples here, too, I’m integrating that with SMF 42, which is data set I/O performance data. So that can provide some additional insights as well. So when a get page can’t be satisfied from a buffer and so you’ve got a sync read I/O being able to view the IFCID 199 data and the SMF 42 data gives you response time and cache characteristics of those I/Os at the database and buffer pool level. So let’s look a little bit at some of the insights that we can gain from those metrics. All right, so this view kind of across the top here, we’ve got metrics that are by buffer pool, and along the bottom, they’re by Db2 database.

[00:19:00] – Todd Havekost

So just kind of starting here at the top here again here’s, buffer pool 20 that we’ve been seeing so far all the time today, it’s doing way more I/Os than any of the other buffer pools. And here with the SMF 42 data, now, we can have that broken out, whether it’s a cache hit that’s green or a cache miss is yellow, or whether it’s actually a sequential I/O. And that’s in the red there. And then also, the SMF 42 data fills in the response times for those I/Os by the common components that are for disk response times, right? I/Osq, pinned, disconnect, and connect. And so those are there as well. So, for example, here, we can see that the I/Os here, the great majority of them, are cache hits. And then we got that first buffer pool over here. When we look at the disconnect time right, which corresponds to a cache miss, we can see that it’s a very low value, which corresponds to the fact we get mostly cash hits. If we look at that over time, we can see that disconnect time is particularly low during the day, gets a little bit larger at night.

[00:20:21] – Todd Havekost

And so that would maybe kind of raise the question, how does the volume of cache misses correlate with that? So when we pull that up, we see kind of the relationship that we would expect that when the cache misses go up, the disconnect time goes up, right? But still really good response time for that buffer pool. All right, so that was kind of some looking at the buffer pool level. Now let’s kind of look at the database level again, because IFCID 199 gives us both of those, and these are sorted by level of I/O activity. So for this top activity, top I/O activity database, it’s overwhelmingly cache hits, which, again, corresponds to a tiny disconnect time. And then the second database here has much more sequential I/O. So when we go back and look at that over here. Now, for that database, the connect time is much larger, right, because sequential I/Os are going to be transferring much more data per I/O. So, again, kind of going back here and let’s look a little bit more closely at that database that’s doing a lot of sequential I/O. And initially here, I’m going to look at that across time.

[00:22:09] – Todd Havekost

And again, we’re interested we’re kind of focusing here on the sequential I/Os. So let me just kind of zero in on those. And then within that database, we might be interested in how the activity is distributed across the page sets in that database. So let me kind of zero in on the we’re looking at sequential I/Os here. There we go. And so we look at the top page sets by number of sequential I/Os, and we can see a very different profile, right? This page set is doing the great majority of its sequential activity during prime shift, whereas this other page set is doing its activity at different times. So even within the database, then we’ve got page sets that have very different time of day profiles. One other way to leverage this integration of the IFCID 199 and SMF 42 data is with a view of overall get page efficiency. So, again, if I’ve got a get page, I’ve either got a buffer hit and then if it’s not a buffer hit and I have to go do the I/O, then is it a cache hit on disk or is it a cache miss?

[00:23:45] – Todd Havekost

Right? So that’s a way that be real helpful to identify how efficient my operation, my get page operations are. Right. So here’s again, breaking it down by overall buffer size 4K. So a couple of ways I might want to look at this one is by buffer pool. So now I can compare my buffer pools and see which ones are doing the most I/OS, which ones have the lowest hit percentages. But again, that might not correspond to actual I/Os. So I can look below that and see in this case, it’s about 1000 I/Os a second. And we’ll see in other buffer pools can certainly have higher rates, which we’ve seen earlier today. So, anyway, we can look at the efficiency by buffer pool, or we can look at the efficiency kind of at a database level. When we do that. We can see here here’s that high activity database. And we’ve seen earlier, you’ve got 18,000 plus I/Os a second going on. But the I/Os that are happening are all pretty much all cache hits. So that’s real helpful. And then again, just to kind of correlate the data, if I take those cache hits and look at what buffer pool they’re coming from, we can all guess the answer.

[00:25:20] – Todd Havekost

The high volume buffer pool there, again, is going to be buffer pool 20. All right. So anyway, those are just some examples of the kinds of insights we can gain from IFCID 199 data and combining it or integrating with SMF 42 data. So here we saw the response time over time for that buffer pool and the cache misses at different times of the day. For this database that had a lot of sequential I/Os. We could see how the profile differed by page set. Here was response time components and cache hits and miss at database levels. And then we looked at efficiencies by buffer pool, efficiency by buffer pool and by database. All right, so we’ve focused so far on a lot of buffer pool information. There’s lots and lots of other metrics that are available in the Db2 statistics data. They can kind of be grouped into the categories you see on the right, the two right hand columns here, some of the major categories. And we’ll talk about just a couple of things that are listed on the left here. We’ll do kind of flyovers of those. Again, there’s just, again, so many opportunities here that we’ll just have time to just touch on a couple.

[00:26:55] – Todd Havekost

The commit activity is captured in the statistics and it also captures commit types. So here again for this high volume data sharing group, there’s logging activity captured in the statistics data. For this high volume data sharing group, we may be interested in looking kind of the profile over time. So I’ve got a week’s worth of data in this demo database. If I look at logging throughput time of day, profile day by day, I can see we’ve got a really consistent profile with much higher logging volumes in the 8:00 to 11:00 PM time frame in the evening. Statistics data also captures information about SQL statements by type. So if I view that across time here’s, my accounts of fetches, opens, closes, selects, and so on. Statistics data as metrics on CPU use by type and by address space. So let me look here first. So the kind of the four major types of Db2 address spaces, and also kind of the type of CPU it is, right? So the green is zip time. So we can see an overwhelming percentage of the CPU time that the database address space is using is running on zips here, and about half for the disk address space, for DDF and so on.

[00:29:45] – Todd Havekost

And then there’s a lot of information about prefetch activity in the statistics data. Just a couple of quick looks there some of the metrics we can see the number of pre fetch requests that are going on, and then also the number of I/Os that are involved with those requests. Again, the three types here, right, sequential, dynamic, and list. And so if we want to look specifically at pages per prefetch, then of course, not surprisingly, there’s much more pages per sequential prefetch than the other types. So again, just had time to just touch on a few of the many other types of metrics available in the statistics data. Here’s, examples of locking and IRLM data, P-LOCK, global lock, false lock contention percentages, and then some of the other metrics that we had an opportunity to look at. So again, you can gain a lot of insights from exploring your own data, I’m sure you already do. And maybe this has suggested some other areas that you can explore out of your Db2 statistics data. Okay, so that’s statistics. Now we’re going to look at accounting. Accounting data is captured in SMF 101 records that provides thread or transaction level data.

[00:30:31] – Todd Havekost

And at many sites that leads to a massive amount of data. And so that makes having capabilities to focus your analysis on specific aspects of that data very valuable. So, I’ve kind of organized this agenda on four potential ways to navigate through your accounting data and zero in on what you want to analyze connection type, correlation, ID, authorization ID, and plan or package name. So here are some of the types of metrics that we’re going to encounter in the accounting data. And again, we’ve already touched on several of these in the statistics section. But the key extension here is that the accounting data makes these available at additional levels of detail below that of the Db2 member. As we just suggested in the previous agenda slide, Db2 uses class one, two, and three terminology for classifying the components of elapsed time. So let’s just briefly level set here as we use this terminology throughout the rest of the session. Class one time includes application and Db2 Elapse time. Beginning from when the Db2 thread is created, the class two time breaks out Elapse time within Db2, it includes CPU, both ECP and zip and wait times.

[00:32:10] – Todd Havekost

And then the class three breaks out those wait times into approximately 30 components that are captured in the accounting data. All right, so let’s begin exploring some of this data in this environment, which we pretty much already noticed, there’s one data-sharing group that has got the overwhelming amount of the activity. So I’m going to just set a global filter here. So that going forward here. Now we will just be looking at accounting data from that data sharing group. All right, so first thing we’ll look at is analysis by connection type. That’s a very common starting point for subsetting the Db2 accounting data, because the characteristics of work coming from various Db2 callers often differs dramatically from one another. So here’s some initial views by connection type. And again, as we just look across the top here, here are the types of metrics we saw in the previous slide and that we also have seen in the statistics data as well, but now available at lower levels, including here to start out with by connection type. So for example, for this data sharing group, there’s a pretty similar volume of SQL statements between the work coming in from CICS and the work coming in from IMS batch, we just look one level below that.

[00:34:03] – Todd Havekost

For the CICS work, it follows a real typical online profile, whereas the IMS batch work is very much skewed towards evening batch type times. We jump over now to look at log throughput. Again, not much difference in logging volumes between these three top connection types. But if we want to look at this over time, we can see that again, very different time of day profiles. There’s a lot of logging activity driven by the two utilities here in the early morning hours. During the prime day, it’s mainly coming from CICS. And in the evening here, IMS batch is driving almost all the logging volume. All right, so elapsed time profiles per commit often reflect big differences between Db2 callers. We’ve got to certainly see that in this example, the lapse time per commit for the work coming in from CICS is much, much lower than that from coming in from IMS batch. And we’re going to look at that in a little bit more detail in the next section when we see some of the insights that are available by correlation ID. So one more example here at the connection type level, let’s look at the various types of prefetch activity.

[00:35:42] – Todd Havekost

For most of the caller types, dynamic prefetches counts for most of the activity. But here, the DDF work also drives a good amount of sequential prefetch. So let’s just look at that over time. When we view the time of day profile, we can see that during the day, it’s a lot of sequential prefetch is the majority, and then in the other hours, a dynamic prefetch makes up the majority. All right, let’s look at a little bit more detail at the sequential prefetch here. So let me just kind of isolate on that and let’s look at that over time during the week. And then let me broaden out the time interval. Instead of just one day here, let me broaden it out to the entire week. When we do that, I see I’ve got this huge spike of prefetch activity coming from Db2 call attach here on the 30th. And so we’re going to just kind of register that for right now. And then when we look at additional levels of detail, we’ll come back and see what we can learn about what workload was driving that big surge in sequential prefetch at that time.

[00:37:17] – Todd Havekost

Okay, so from that initial categorization of Db2, work by connection type, analysis often proceeds along one of three primary next steps. And so first, let’s look at correlation ID. All right, the contents of the correlation ID field differ based on the connection type. For CICS, it’s a transaction ID. For call attach work, it’s a job name. For DDF, it’s the distributed data management external name for the client. For IMESS, it’s the program specification block, or kind of an application oriented name. And for. Tso logon ID. All right, so let’s begin exploring kind of what we can learn from the correlation ID. So let’s start out by looking at CPU consumption and so here’s how it compares across the different caller types and got significant amount of work coming in CPU wise from CICS, from IMS batch, from DDF, and from IMS online. So let’s begin, let’s analyze the general purpose CPU use that’s in the red here. Let’s turn this into a time chart. And let me tweak the title here to reflect the fact that I’m now just looking at general purpose CPU, all right? And I’m going to again extend this, just pass this one day to the entire week.

[00:39:19] – Todd Havekost

So when I do that I can kind of see kind of what my CPU profile is. I’ve got an online profile coming from CICS that’s pretty repeatable. The other big CPU driver of CPU within Db2 is IMS batch. And that’s pretty consistent in the evening hours, except I’ve got this one outlier here that has consumed a huge amount of CPU. So let me focus in on that. I want to look at the IMS batch work only. And now that I’ve got that, I want to look at it by correlation name, which again, as I said earlier, for IMS, that’s a PSB name, kind of an application oriented construct. And I’m going to then kind of zero in kind of on the top five CPU consuming PSBs here. And when I do that, I see that I’ve got this particular PSB that was driving for that hour, almost eight CPS worth of work. So that helps me understand what was driving the application that was driving that big spike in CPU at that particular time. All right, now going back to the CPU, but now looking at it from CICS. So correlation name in Db2 captures the CICS transaction ID.

[00:41:07] – Todd Havekost

So that’s useful in many different ways. So here we kind of see the comparative profiles for CICS transactions here for their work that’s both within Db2 and then just the overall includes the class one time here from thread creation. So a lot of activity going on within CICS too, as well as within Db2. So let me just, let me look at now let me look at elapsed time profiles again for CICS. And again I’m going to look at it by correlation ID. So that’s going to give me the CICS transaction. So we can see here I’ve got very different elapsed time profiles per commit for the work coming from different CICS transactions here. Kind of common major elements is CPU time, general purpose CPU time within Db2, and then sync I/O eight, right? I got a miss on a get page and I got to go do the I/O synchronous to the unit of work. Now I haven’t mentioned it up to now here, but the accounting data also captures the Db2 member name, just like statistics data does. So we’re going to see where that’s going to come in real handy for us later.

[00:42:39] – Todd Havekost

Right here, we’re just looking, how does the elapsed time profile compare across the members in this data sharing group? And we see that it’s very comparable, which is certainly what we would expect. All right, so let’s look at the kind of time of day profile for one particular transaction. Again, this is the lapse time profile within Db2, the class two and three time. So we can see here again, a significant portion is general purpose CPU time and sync I/O weight. And then for this transaction, there are some intervals where there’s some global contention for L-locks. So the area chart lets me see it at a cumulative point of view, certainly a valid way to look at it. Or if I want to compare the components against one another, I can look at this as a line chart. And when I do that, I see that the CPU time is relatively consistent across the time span. But the synch I/O wait time gets quite a bit higher in the evening when there’s likely a lot more contention for the buffers from batch. And then here’s the global lock contention there as well. And of course, we can always compare it to some other time interval to see again if that’s the usual profile or not.

[00:44:10] – Todd Havekost

And in this case, I think we’ll see that it’s a very similar profile day to day. All right. Another thing we can do, but we can take advantage of the fact that Db2 accounting data captures the CICS transaction ID. And we can kind of integrate data. So across the bottom here’s, the accounting data for that correlation ID for this transaction across the top. Now this is CICS transaction data 110, subtype one. So, I can do some comparing between the two. For example, the CPU that CICS reports is all in includes the Db2 time. So here we can kind of see how the Db2 CPU time compares to that total value. It looks like about 10% of the time is spent within Db2. Also, we could compare profiles, the per commit profile of elapsed time within Db2 to the much longer milliseconds per transaction for the total transactions. All right, so earlier, now that we’re looking at correlation ID, earlier we identified this spike in sequential prefetch activity that was driven by Db2 call attach. So we said when we were looking at correlation IDs, that for the call attach connection type, we’re going to get the job name.

[00:45:48] – Todd Havekost

So here we have the job names that were driving the most sequential prefetch activity. Again, let’s focus on that and remove the other two. And let’s kind of reduce the noise just to the top five. And we’re focused here on sequential prefetches. And let’s look at that over time. And when we do that, we see that yeah, at that very time, we had a huge amount of sequential prefetch activity. It was these jobs driving that activity at that time. Now, we’ve been talking about the great insights that we can gain from the additional levels of detail in the Db2 accounting data. But the accounting data for batch jobs has a significant limitation, and that is that Db2 does not produce interval accounting data. Those records are written only at thread termination. So, for long running work, you get a single set of metrics at end of job often and have no idea the level of activity that was going on across the life of the job. So, in this case, again, we’ve identified these three jobs that were responsible, but we’re getting data at the very end of the job. And is it possible that we could get some information about the duration and activity that was going on during the execution of these jobs?

[00:47:23] – Todd Havekost

And so if we’re kind of thinking about this, well, Db2 statistics data does have interval-based data, so can we leverage that and help fill in the gaps here? Well, we’ve talked about that. The accounting data also captures the Db2 member name. So, let’s see if we can take advantage of that. So here again, here’s this big spike. And so let me drill into this by Db2 member name. And again, I care about the sequential here, and I want to view it over time. And sure enough, all that sequential prefetch activity was taking place in two Db2 members, right, HE and H5. Okay? So that’s what the accounting data can tell me. But again, I/Only get an observation at the end of the job. I don’t get interval data. So now let me go back over to the statistics data and look at prefetch activity. And this is all the prefetch, but I care about sequential here. So let me look at, actually, I’ll look at the request, sequential prefetch request, that’s what we’ve been talking about. And I want to look at those by member. So I’m in statistics data, I get interval data.

[00:48:55] – Todd Havekost

And now again, I want to go back and look at the data on that day when I had that huge spike, right? And sure enough, here’s he and H five, and they’re running at really high levels for many hours. In fact, looks like maybe even prior to midnight here. So let me just broaden the interval and reach back into the later hours on the prior day, and then I can kind of trim this one down a little bit, maybe stop around noon. And when I do that, sure enough, looks like that activity geared up just shortly before midnight and carried on at one level or another all the way till close to 10:00 AM. So that’s an example where we can combine the data that we get from the accounting data, told us it was from Db2 call attach accounting data told us what jobs it was. Accounting data told us which two members it was. But then we went over to the statistics data to see what was the profile during the multi hour duration of those jobs, what was the profile going on in those two members. So we can take advantage of the fact those two types of data can complement one another.

[00:50:24] – Todd Havekost

Okay, want to talk a little bit about another way to explore accounting data through the Auth ID. It’s captured for all the connection types, but it can be particularly helpful for insights into DDF work. DDF work is often analyzed based on Auth ID connecting back to where the work is coming from. So let me pick a report set now that’s geared specifically to DDF. One of the challenges with DDF can be that you got work coming in from wherever. It can be the wild, wild west. So sometimes just having awareness of what’s my CPU profile did, I have some runaway work. In this case, I don’t have runaway work, but I do have a kind of an interesting spike in CPU that’s happening around midnight each day. So if I go back here and again, just look at the data by Auth ID. For example, for this Auth ID, the second one here, it follows a kind of a time of day, an online profile. So it’s likely part of some business process. But I do want to see who’s driving those midnight spikes, right? So again, I’m going to zero in on the general purpose CPU.

[00:51:50] – Todd Havekost

I want to look at it over time and let me just look at the top five or so again to kind of reduce the noise. And then let me broaden this out across the entire week. And I see now that it was this particular Auth ID that apparently there’s some kind of process that runs every day at midnight that consumes a very significant amount of CPU. So now that I’m armed with an Auth ID, I can go back and tie that typically to an application or where that work is coming from. All right, and then one other common way to want to slice the accounting data is by plan or package name. So again, an example of to that typically comes into play for online work, or at least commonly does, like CICS or IMS. So I’m going to look here at CICS online work and look at the data now by plan, look at CPU by plan. So again, I can see plan names typically align with business applications. So again, let me zero in on the general purpose CPU time and look at that over time. And again, let me kind of zero in on the top five, get a little clearer picture, and then finally let me broaden this out to look at the entire week.

[00:53:41] – Todd Havekost

When I do that, for this top CPU consuming plan, I see it’s got an online profile to be sure, but then it’s got these kind of spikes at 08:00 PM most of the days. And so if I want to understand that further, let me isolate on that. And now let me look at that data by how do I want to look at it? Right? What can I leverage from what we’ve seen already today? I can look at it by correlation ID, which is going to be what the transaction ID. So now when I do that, I can see that, sure enough, this transaction in green here is the one that has some executions at that kind of 08:00 P.m. Type time frame that are driving a lot of CPU. So I can also do this by package as well if you’re generating package data. So here’s an example of this, of a type of report we’ve been looking at by package collection ID. So again, just kind of recapping what we’ve seen out of the accounting data. We can see again by connection type. We saw SQL statements and log throughput. We saw this big spike in CPU correlation ID helped us narrow that down to an IMS PSB correlation ID, also for CICS work gives me transaction IDs, and I can see what are the components that are driving lapse time there.

[00:55:60] – Todd Havekost

In the DDF world, I’m often going to be looking at Auth IDs. And so I had these spikes in CPU and I could isolate that to an Auth ID. And then for plan data, I looked at the profile here and saw that it was the transaction that was driving that kind of 08:00 P.m. Type time frame. Okay, so again, those are the types of metrics we look at. Again, we talked at the beginning about some of the challenges in gaining insights from SMF data. Let me just briefly touch on that as we close. The data being primarily used in a reactive manner. One way to kind of gain more help there is to have some kind of proactive assessments going on that can identify potential risks to kind of direct your attention to work those ideally proactively before they turn into outages siloed tooling. Across disciplines, we mainly worked out of Db2 data. Today we touched briefly on CICS data and SMF 42 data. But having a common interface into data types across the platform promotes collaboration. There’s lots of other examples we could have talked about there. When you’re exploring massive data like the accounting data, there’s so many different paths.

[00:56:56] – Todd Havekost

The ability to dynamically navigate based on the current view can be extremely helpful. There again, huge data volumes with Db2 accounting data. The ability to narrow the views like we looked at today, right, by connection type, by correlation ID, by Auth ID, by plan or package name, can really get me focused in on the subset of data that I want to see. And then finally, the expertise brain drain, intuitive visibility, and the ability to dynamically create reports can help address that. All right, I’ll pause here. I think we’re right at the hour point. Open it up for any questions.

Upcoming Virtual Db2 Meetings

May 21, 2024

Virtual Db2 User Group Meeting

Can Db2 for z/OS be hacked?
Emil Kotrc, IBM Champion, Software Architech
Broadcom Software

Register here

July 16, 2024

Virtual Db2 User Group Meeting

Dale McInnis
IBM

Register Here