In this blog post we’re preparing to revisit how SQL Server on Linux responds to external memory pressure. They can be a very long post, plus it ends by himself not knowing exactly what’s going on…but your journey is pretty fun…let’s go!
On Windows-based SQL Server systems we’ve become accustomed to the OS signaling to SQL Server that there’s a memory shortage. When signaled, SQL Server will kindly start shrinking it’s memory caches, just like buffer pool, to have overall system stability and usability. Well that story turns out to be different in SQL Server on Linux…last year I wrote a very similar post for SQL Server 2017 RTM and also the way it reacted to external memory pressure, ensure that out here! This led to quite a dramatic story, you can easlily literally cause SQL Truly swap nearly its entire process address space out over disk! Now, let’s look and enjoy how SQL Server on Linux responds to external memory pressure in SQL Server 2019 CTP2.
The Scenario
Our SQL Server on Red hat is a CentOS 7.5 VM, running SQL Server 2019 (CTP2). There is 12GB of physical memory and 80% in the is dedicated to SQL Server via mssql-conf about 9.2GB, the default configuration. I’ve also set Max Server memory to 8GB for ones SQL Server instance.
So let’s start off SQL Server on Linux 2019 CTP2 and query the SQL Server’s cache sizes when there’s no data loaded. I’m gonna look in sys.dm_os_memory_clerks for memory allocation accounting information. In Figures 1 you can easliy get a sense of the memory distribution over the system’s caches. Special shoutout on the new “Azure Data Studio” I’m using through these demos.
Now, I’m preparing to load an 8GB table inside the buffer pool, maxing out SQL Server’s MAX memory instance setting. Now we’ll check out the allocations again. In Figure 2, capable to see the buffer pool is going to be dominant consumer of memory and the our database is definitely the dominant consumer inside the buffer pool. Happen to be so good, eh?
Getting Memory Information From Linux
We will use tools like ps, top and htop to appear fashionable our are virtual and physical memory allocations. Capable to also try looking in the /proc virtual file system for your process and check out the status file. In here we’ll chose the point in time status with regards to a process, have the ability to the types of memory allocations for just a process. We’ll get granular data with the virtual memory allocations in addition to resident set scale the process. There are the interesting values inside the status file we’re heading to focus on today.
VmSize – total current virtual address space of an process
VmRSS – amount of physical memory currently invested on the process
VmSwap – total amount of virtual memory currently paged appear to the swap file (disk)
We use the command pidof sqlservr to determine the process ID to our SQL on Linux process, within our case it’s 14689. When running SQL Server on Linux interest in two processes for sqlservr, we’re going to look the particular with the higher process ID for the one using the lower PID is a watchdog process.
How to access the status information to our process we look in /proc/14689/status
Below certainly is the output from process’ status file, filtering for a values we’re inquisitive about. You can see about 16.4GB of Virtual Memory (VmSize), 7.9GB ones is in actual RAM (VmRSS) and 0MB from the swap file (VmSwap). Wait an instant, 16GB of for VmSize? Yea, 16GB for VmSize. That’s the reserved allocation for the complete process. More that on a second…
VmSize: 17165472 kB
VmRSS: 8579016 kB
VmSwap: 0 kB
Now, after we look back at a previous post here on SQL Server 2017 RTM, the memory distribution will be a bit different. You could see about 10.16GB of memory in VmSize
VmSize: 10661380 kB
VmRSS: 8201240 kB
VmSwap: 0 kB
Phantom Memory Allocation?
Now, I hardly ever post something while i don’t have an answer to get a specific behavior even so case I don’t. My head on this are, that 16GB might be the reserved allocation for the whole process. Linux relies on a demand paging allocation system therefore majority of the process access space is just a reservation and isn’t backed by actual pages. So it’s not *really* consuming physical memory. But things i do see as relevant is the process appears to be carry that allocation forward, since loaded memory in, the VmSize went from 8GB (the VmSize when SQL Server Starts) to 16GB (next time i loaded the 8GB table). I’ve reported this into your product team that i’m awaiting to an answer within the ~8GB discrepancy between the two tests.
Let’s Then add activity Pressure
By having a small C program I wrote, I use calloc to allocate large parts of contiguous memory thereafter I have the system continually write data within the memory buffer to ensure that those pages holiday in physical memory. Utilizing this program, let’s allocate 7GB of RAM. I’m choosing this value because its travelling to cause my system to exceed it’s physical memory however exceed the amount of money virtual memory with my system when for instance the swap file’s size which could be 5.8GB tiny system. Things will receive really dicey in case you exceed physical and virtual memory size, processes will quickly get killed. Microsoft documents an instance here. So you know how important adding a monitor for external memory pressure is, photographs allocated 7.5GB and exceeded physical + virtual memory Linux killed their memory allocation program and SQL Server to cover the stability among the system, oomkiller is one area to look out for!
Now, let’s check out the memory accounting information at /proc/PID/status in my memory allocator program. Within a output below you can understand we have near enough 7GB of virtual memory (VmSize) allocated and nearly an identical amount of memory that’s resident in physical memory (VmRSS) and without any data for the swap file.
VmSize: 7344248 kB
VmRSS: 7340312 kB
VmSwap: 0 kB
Now in SQL Server 2017 to be able 7GB program running would cause Linux to need to make room in physical memory of this process. Linux performs this by swapping least recently used pages from memory in the market to disk. So under external memory pressure, let’s check the SQL Server process’ memory allocations per Linux. In the output reduce see we still a VmSize of around 10GB, but our VmRSS value has decreased dramatically. In reality, our VmRSS will now be only 2.95GB. VmSwap has risen to 5.44GB. Wow, that’s an obvious portion of the SQL Server process swapped to disk.
VmSize: 10700328 kB
VmRSS: 3098456 kB
RssAnon: 3096308 kB
VmData: 10442840 kB
VmSwap: 5711856 kB
In SQL Server 2019, there’s a distinct outcome! While in the data in this article see our 16GB VmSize which won’t change much with thanks to the virtual address space for the process. Bring back large external process running SQL Server reduced VmRSS from 7.9GB (from Table 1) to 2.8GB only placing about 4.68GB inside the swap file. That doesn’t sound considerably better, does it? I thought SQL Server was going to subside with the external memory pressure…let’s keep digging and enquire of SQL Server what it mentions this.
VmSize: 17159552 kB
VmRSS: 2938684 kB
VmData: 16806460 kB
VmSwap: 4911120 kB
What’s SQL Server Consider?
In SQL Server on Linux 2017, we got a large slice of the process address space swap out and also saw those pages stay resident in SQL Server’s buffer pool or really any section of the SQL Server process address space mainly because the Linux kernel is really an equal opportunity swapper…so anything contained in the caches managed by SQL Server didn’t know if that memory was resident in physical memory or was on disk. In figure 3, we see the top memory consumers after external memory pressure, our buffer pool in order to be fully populated with his table data.
Ok, let’s examine what happens in SQL Server 2019 with external memory pressure. Looking at the memory distribution, advertisements for classic that the memory clerk for SQL Server’s buffer pool reduced its memory footprint from 7,227MB (the worthiness from back in Figure 2) right after the buffer pool was basically loaded with our large table. Wh then applied external memory pressure along with 7GB process and we also see the buffer pool is reduced to 3,502MB a reduction of 3,725MB. Cool, seems to resemble we dumped our buffer pool to lessen the system nevertheless, not the whole buffer pool, there’s still 3,502MB in memory. Yet the SQL Server 2019 on Linux process still swapped out about that much data (Table 5)…what’s materializing? I still don’t know. Let’s keep digging.
Now, in SQL Server 2017 on Linux Resource Monitor didn’t exist…let’s investigate what we have in SQL Server 2019…using the query from Amit Banerjee and Sudarshan Narasimhan in the following paragraphs here let’s see what actually happened according to the newly implemented Resource Monitor for SQL Server 2019 on Linux.
Inside the output of this query, we can observed that System Indicator is 2 at times of external memory pressure. This is truly a signal from Resource Monitor to SQL Server to dump its caches to reduce the health of all around system. We certainly observed that for the output in Figure 4. But everything we also found reality Linux is placing 4.68GB of memory within the swap file thinking that SQL Server is holding 3,502MB of memory within a buffer pool whilst VmRSS of the process is 2.8GB (Table 5). To ensure the question remains, exactly why is SQL Server swapping a great deal of its memory inside the market disk and holding on to so much memory with the buffer pool even if it’s seeing the more reduced memory signal from Resource Monitor. This records are further saved by the fact that just about every day SQL_CommitedMemory_KB staying relatively stable around the light on the external memory pressure, the expectation is the fact would reduce. So it’s like it’s not nearly finishing your responsibilities deallocating the memory. So let’s ask Linux how much it thinks about this.
Resource Monitor on Linux – Earn money think it works…
If you’ve check this out blog before you know I like to use strace to recognise how applications talk with the OS via system calls. Well, let’s use that method to try to evaluate what happens when SQL Server starts dumping it’s cache under external memory pressure.
Over the last section, we learned that Resource Monitor can track everywhere memory conditions in Linux. Inside of this section, let’s discover how it interacts with your Linux memory management system to deallocate memory.
From the output below (Table 6) can be seen the aggregated system calls when I’m applying external memory pressure to SQL Server on Linux. From that output, it feels like SQL Server on Linux uses memory mapped files, that is a file access technique where files are opened mapped straight into a process’ memory address space. The can then access the file via direct memory addressing as compared with using system calls. This style of access actually makes very well to the internal file structures used for SQL Server data files.
During external memory pressure, the madvise one among the frequently called system call do your best. This system call is utilized by applications to advise (hint) the kernel on the things to do with ranges of memory under certain conditions. Over the last section, we observed that Resource Monitor can sense the external memory pressure and signals SQL Server to pay off its caches which is likely the implementation of methods those caches are dumped.
Under memory pressure, the madvise system call referred to as with the MADV_DONTNEED flag for giant address ranges in SQL Server’s process address space, the memory mapped file. The MADV_DONTNEED flag tells the kernel the application just isn’t going to expect to access these pages in the foreseeable future and they could possibly be discarded consequently freed rrnstead of swapped out to disk. Look at the kernel source code in charge of how it works.
madvise(0x69274a000, 8192, MADV_DONTNEED) = 0
My theory is actually is how Resource Monitor is signaling to your OS which these pages should needed and can be deallocated due to the semantics of the madvise system call all pages are marked as redundant. So we look at reduction in the size reported by SQL Server’s memory clerks. But some tips i find interesting is most likely the amount of memory that’s still hitting swap whenever we look at the memory accounting information to the individual processes along with IO profile of one’s system during external memory pressure. The program seems like the kernel and the application is holding onto pages and they’re hitting swap rather then being deallocated.
When i decided to my theory and I’m wanting to bring the pieces together in our emerging technology since it’s still in CTP. So in case you have some insight relating to what’s happening at this particular layer, I’d enjoy hear from you 🙂
strace -p PID -f -c
% time seconds usecs/call calls errors syscall
—— ——– ——– ——- ——- ————–
75.98 575.483472 12188 47217 11425 futex
8.24 62.370591 275976 226 epoll_wait
7.73 58.578205 2253008 26 14 restart_syscall
3.44 26.050226 2170852 12 io_getevents
1.60 12.096475 549840 22 nanosleep
1.55 11.726022 266501 44 44 rt_sigtimedwait
0.71 5.387356 14 389517 clock_gettime
0.49 3.688034 24 154405 madvise
0.12 0.944458 803 1176 munmap
0.07 0.495154 343 1445 mmap
0.06 0.444399 7796 57 open
0.01 0.062301 1093 57 read
0.00 0.013327 579 23 23 stat
0.00 0.009003 46 196 gettid
0.00 0.005058 43 119 sched_setaffinity
0.00 0.004188 68 62 clone
0.00 0.004018 65 62 set_robust_list
0.00 0.003902 36 109 rt_sigprocmask
0.00 0.003194 56 57 fstat
0.00 0.002914 47 62 sched_getaffinity
0.00 0.002331 38 62 sigaltstack
0.00 0.001805 29 62 arch_prctl
0.00 0.001575 28 57 close
0.00 0.001182 394 3 io_submit
0.00 0.000672 42 16 sched_yield
0.00 0.000506 22 23 rt_sigaction
0.00 0.000011 11 1 fdatasync
—— ——– ——– ——- ——- ————–
100.00 757.380379 595118 11506 total
% time seconds usecs/call calls errors syscall
—— ——– ——– ——- ——- ————–
71.52 569.206916 33673 16904 4444 futex
12.46 99.164048 819538 121 epoll_wait
6.90 54.920907 2387866 23 17 restart_syscall
2.73 21.715336 1084 20033 io_getevents
2.15 17.123567 1902619 9 nanosleep
2.07 16.494124 868112 19 19 rt_sigtimedwait
1.77 14.122895 1070 13202 io_submit
0.22 1.780524 218 8180 munmap
0.10 0.769185 93 8262 mmap
0.06 0.502749 125 4026 clock_gettime
0.00 0.006171 386 16 sched_setaffinity
0.00 0.005702 62 92 read
0.00 0.004128 197 21 fstat
0.00 0.003996 118 34 sched_yield
0.00 0.003961 165 24 gettid
0.00 0.003112 346 9 arch_prctl
0.00 0.002615 291 9 sched_getaffinity
0.00 0.002528 84 30 rt_sigprocmask
0.00 0.002045 85 24 close
0.00 0.001929 92 21 open
0.00 0.001924 214 9 clone
0.00 0.001752 195 9 set_robust_list
0.00 0.001647 97 17 madvise
0.00 0.001198 599 2 fdatasync
0.00 0.000989 110 9 sigaltstack
0.00 0.000632 57 11 11 stat
0.00 0.000446 45 10 rt_sigaction
0.00 0.000351 18 19 write
0.00 0.000316 11 29 epoll_ctl
0.00 0.000310 44 7 writev
0.00 0.000159 11 15 7 readv
0.00 0.000051 26 2 socket
0.00 0.000021 11 2 setsockopt
0.00 0.000014 4 4 getcpu
0.00 0.000009 3 3 getsockname
0.00 0.000009 5 2 getpeername
0.00 0.000004 4 1 accept
0.00 0.000004 2 2 fcntl
—— ——– ——– ——- ——- ————–
100.00 795.846274 71212 4498 total