Sunday, November 21, 2010

Solaris 10 Container Resource Management using Dynamic Resource Pool & FSS

Let’s think of a physical Solaris server with 2 CPUs which needs to host:

  1. One Database server - one that needs a fully dedicated CPU and
  2. Two Web servers -one that is more flexible and can share CPUs.

To accomplish these different levels of isolation we use a Solaris Container technology called Dynamic Resource Pools that enables CPU resources to be dedicated to specific applications. In this example, the database server needs a separate resource pool, while the Web servers can share another.

Login to the Global zone as super user and activate the resource pool facility:

#pooladm –e

Check the current status of resource pools using:

#pooladm

Save the current configuration to the default file (/etc/pooladm.conf) using:

#pooladm –s

Check the number of processors available to the system:

#psrinfo

Create a processor set containing one CPU:

#poolcfg –c ‘create pset db-pset (uint pset.min=1; uint pset.max=1)’

Update the configuration:

#pooladm –c

Check the configuration:

#pooladm

Create a pool for Database zone:

#poolcfg –c ‘cerate pool dbpool’

Link the pool with processor set:

#poolcfg –c ‘associate pool dbpool (pset db-pset)’

Commit the change:

#pooladm –c

Check the configuration:

#pooladm

***************************************************

Set the pool while creating the non-global zone for Database or to an existing zone by:

zonecfg:dbzone>set pool=dbpool
zonecfg:dbzone>verify

Now if you set the default pool for other 2 web server zones, they are going to use the default pool associated with default processor set with 1 CPU, while Database zone will start using another CPU dedicatedly using processor set “db-pset”.

zonecfg:webzone1>set pool= pool_default

zonecfg:webzone2>set pool= pool_default

Check the same by trying to disable any of the processor using;

#psradm –f [0,1]

Fair Share Scheduler:

While the two Web servers are capable of sharing the remaining CPUs on the system, they each need a minimum guarantee of CPU resources that will be available to them. This is made possible by another Solaris Container technology called the Fair Share Scheduler (FSS). This technology enables CPU resources to be allocated proportionally to applications. That is, each application gets assigned a number of the available “shares” of the total CPU.

Enable FSS for pool-default:

#poolcfg -c 'modify pool pool_default (string pool.scheduler="FSS")'

Commit:

#pooladm –c

Move all the processes in the default pool and its associated zones under the FSS.

Either by rebooting the system or by using:

#priocntl -s -c FSS -i class TS

#priocntl -s -c FSS -i pid 1

(valid classes are RT for real-time, TS for time-sharing, IA for inter-active, FSS for fair-share, or  FX  for  Fixed-Priority)

Remember, the two Web servers share the CPU resources of the default pool with each other as well as the global zone, so you need to specify how those resources should be shared using the Fair Share Scheduler (FSS).

With FSS, the relative importance of applications is expressed by allocating CPU resources based on shares—a portion of the system's CPU resources assigned to an application. The larger the number of shares assigned to an application, the more CPU resources it receives from the FSS software relative to other applications. The number of shares an application receives is not absolute – important thing is how many shares it has relative to other applications, and whether they will compete with it for CPU resources. That is how it is being dynamic.

Assign three shares to the first web server container (more priority):

zonecfg:Web-zone1> add rctl
zonecfg:Web-zone1:rctl> set name=zone.cpu-shares
zonecfg:Web-zone1:rctl> add value (priv=privileged,limit=3,action=none)
zonecfg:Web-zone1:rctl> end
zonecfg:Web-zone1> exit

Assign two shares to the second web server container (less priority):

zonecfg:Web-zone2> add rctl
zonecfg:Web-zone2:rctl> set name=zone.cpu-shares
zonecfg:Web-zone2:rctl> add value (priv=privileged,limit=2,action=none)
zonecfg:Web-zone2:rctl> end
zonecfg:Web-zone2> exit

You now have 3 containers created; one with a fixed amount of CPU, and two dynamically sharing CPU with the global zone.

Now the DB server will run on its own guaranteed CPU, protected from the other applications on this system, while the Web servers share the remaining 1 CPU. To clarify the FSS share usage, the first Web server application has three out of the total six shares (3 – web-zone1, 2 – webzone2, 1 – global zone), entitling it to 0.5 CPU worth of the 1 CPU (1*3/6=0.5); the second (web-zone2) has two of the six shares, giving it 0.33 CPUs worth; and the global zone gets the remaining one of six shares – that is 0.17 CPU worth.

Just to comment, Oracle (when you involve Oracle DB like this setup) acknowledges these types of Solaris Containers as a valid license boundary. In their terminology they are known as Capped Containers, and are made by a combination of Dynamic Resource Pools and Solaris Zones - where the amount of CPUs in the pool determines the size of the license. So you need to consider the same when using this technology with Oracle Database server or any other oracle product.

Root Password hacking when neither cdrom nor jumpstart available

There are many articles on web talking about "Solaris root password recovery".
Procedure followed is to boot the machine by a cdrom in single user mode, mount
the root file system and then to edit /etc/shadow file. One of my Sys Admin
friend wanted to know if there was a way to do this without cdrom and hence
this exercise of booting the machine with kmdb.
First thing I did was to boot the machine in single user mode with kmdb loaded:
ok boot kmdb -s
Resetting ...
Sun Serverblade1 (UltraSPARC-IIe 650MHz), No Keyboard
Copyright 1998-2003 Sun Microsystems, Inc. All rights reserved.
OpenBoot 4.11.3, 1024 MB memory installed, Serial #52785242.
Ethernet address 0:3:ba:4d:3:74, Host ID: 8325705a.
 
 
Rebooting with command: boot kmdb -s
Boot device: /pci@1f,0/ide@d/disk@0,0:a File and args: kmdb -s
Loading kmdb...
SunOS Release 5.10 Version Generic_141444-09 64-bit
Copyright 1983-2009 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
WARNING: todblade: kernel debugger detected: hardware watchdog disabled
Booting to milestone "milestone/single-user:default".
Hostname: solarisbox
As soon as I got the hostname displayed on the console, I sent a break and
dropped into kmdb prompt.
cli>break -y s3
s3: Break sent.
cli>console -f s3
[Connected with input enabled on fru S3]
Escape Sequence is '#.'
I then added a break point at exec_common to trace the exec of sulogin process.
[0]> exec_common+0x16c:b
[0]> :c
kmdb: stop at exec_common+0x16c
kmdb: target stopped at:
exec_common+0x16c: orcc %g0, %o0, %i3
[0]> $C
000002a100728ff1 exec_common+0x16c(40e0c, 0, cdf18, 1813070, 0, 0)
000002a100729231 exece+0x10(40e0c, fd87bdf0, cdf18, b9fa8, 1010101, 80808080)
000002a1007292e1 syscall_trap32+0xcc(40e0c, fd87bdf0, cdf18, b9fa8, 1010101,
80808080)
[0]> 000002a100729231+0x7c7::print struct pathname
{
pn_buf = 0x30002b20080 "/sbin/sh"
pn_path = 0x30002b20080 "/sbin/sh"
pn_pathlen = 0x8
pn_bufsize = 0x400
}
[0]>
Above doesn't look like the one we are interested in. I put a break point at
exec_common+0x16c because, the pathname passed in from userland is pulled into
kernel land by pn_get at exec_common+0x16c. Look at following dis code.
[0]> exec_common+0x16c::dis
exec_common+0x144: ld [%g7 + 0x124], %l4
exec_common+0x148: st %l4, [%fp + 0x7eb]
exec_common+0x14c: call +0x8f724 <sigorset>
exec_common+0x150: add %g1, 0x3a8, %o1
exec_common+0x154: call -0xa6df0 <mutex_exit>
exec_common+0x158: ldx [%l1 + 0x10], %o0
exec_common+0x15c: add %fp, 0x7c7, %o2
exec_common+0x160: mov %i0, %o0
exec_common+0x164: call +0x4c130 <pn_get>
exec_common+0x168: mov %i4, %o1
exec_common+0x16c: orcc %g0, %o0, %i3
exec_common+0x170: bne,pn %icc, +0x6d8 <exec_common+0x848>
exec_common+0x174: add %fp, 0x7a7, %l7
exec_common+0x178: call +0x4c044 <pn_alloc>
exec_common+0x17c: add %fp, 0x7a7, %o0
exec_common+0x180: add %fp, 0x7c7, %o0
exec_common+0x184: add %fp, 0x7ef, %o3
exec_common+0x188: add %fp, 0x7f7, %o4
exec_common+0x18c: mov %l7, %o1
exec_common+0x190: call +0x30008 <lookuppn>
exec_common+0x194: mov 0x1, %o2
[0]>
 
I had to now step through exec_common+0x16c as shown below. I ignored around
30 exec_common calls and blindly continued as nothing was interesting and
relevant.
I was looking for "Requesting System Maintenance Mode" message to be displayed
on console. This message is displayed while sulogin process is created:

http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/svc/startd/fork.c#221
[0]> :c
Requesting System Maikmdb: stop at exec_common+0x16c
kmdb: target stopped at:
exec_common+0x16c: orcc %g0, %o0, %i3
Part of "Requesting System Maintenance Mode" message is displayed, this should
be the process of our interest.
[0]> $C
000002a100758ff1 exec_common+0x16c(38f3c, 0, ffbfff18, 1813070, 0, 0)
000002a100759231 exece+0x10(38f3c, fd07be58, ffbfff18, 7cf28, 0, ff2303a8)
000002a1007592e1 syscall_trap32+0xcc(38f3c, fd07be58, ffbfff18, 7cf28, 0,
ff2303a8)
[0]> 000002a100759231+0x7c7::print struct pathname
{
pn_buf = 0x30002b1e480 "/sbin/sulogin"
pn_path = 0x30002b1e480 "/sbin/sulogin"
pn_pathlen = 0xd
pn_bufsize = 0x400
}
 
Yes, indeed this is the process of our interest. sulogin will later read in
/etc/shadow file for root entry. It is better to track all UFS file systems
read function from now on and to look out for /etc/shadow file.
[0]> ufs_read+8:b
[0]> :c
ntenkmdb: stop at ufs`ufs_read+8
kmdb: target stopped at:
ufs`ufs_read+8: clr %l0
I stepped through uninteresting files. After reading /etc/default/login and
/etc/nsswitch.conf file, sulogin opens /etc/shadow file.
[0]> $c
ufs`ufs_read+8(30003c2ca00, 2a100759a10, 0, 300004d1d98, 0, ff3f6ae8)
fop_read+0x20(30003c2ca00, 2a100759a10, 0, 300004d1d98, 0, 135b98c)
read+0x274(101, 0, 30001d3da48, 400, 2001, 0)
syscall_trap32+0xcc(101, 261a0, 400, 0, ff3f4910, 821)
[0]> 30003c2ca00::print vnode_t v_path
v_path = 0x30001864908 "/etc/shadow"
[0]> ::pgrep sulogin
S PID PPID PGID SID UID FLAGS ADDR NAME
R 139 7 139 139 0 0x4a004000 00000300002df858 sulogin
[0]> 00000300002df858::pfiles
FD TYPE VNODE INFO
0 CHR 000003000280edc0 /devices/pseudo/cn@0:console
1 CHR 000003000280edc0 /devices/pseudo/cn@0:console
2 CHR 000003000280edc0 /devices/pseudo/cn@0:console
3 CHR 0000030003c2d200 /devices/pseudo/sysmsg@0:sysmsg
256 REG 0000030003c2ca00 /etc/shadow
257 REG 0000030003c2ca00 /etc/shadow
I then put a break point at uiomove+8. Once the data is read into memory, UFS
uses uiomove to copy read in data from kernel land to user land.
[0]> uiomove+8:b
[0]> :c
kmdb: stop at uiomove+8
kmdb: target stopped at:
uiomove+8: be,pn %xcc, +0x13c <uiomove+0x144>
[0]> $c
uiomove+8(fffffa003a6e0000, 15f, 0, 2a100759a10, fffffa003a6e0000, 15f)
vpm_data_copy+0x100(30003c2ca00, 0, 15f, 2a100759a10, 109a358, 15f)
ufs`rdip+0x468(0, 1fff, ffffffffffffe000, 186a060, 30003c6b720, 18ffd58)
ufs`ufs_read+0x208(30003c6b800, 2a100759a10, 0, 300004d1d98, 30003c6b800,
300000a1a80)
fop_read+0x20(30003c2ca00, 2a100759a10, 0, 300004d1d98, 0, 135b98c)
read+0x274(101, 0, 30001d3da48, 400, 2001, 0)
syscall_trap32+0xcc(101, 261a0, 400, 0, ff3f4910, 821)
[0]> fffffa003a6e0000 \s
0xfffffa003a6e0000: root:wqXY4dQaZCTfs:6445::::::
daemon:NP:6445::::::
bin:NP:6445::::::
sys:NP:6445::::::
adm:NP:6445::::::
lp:NP:6445::::::
uucp:NP:6445::::::
nuucp:NP:6445::::::
smmsp:NP:6445::::::
listen:*LK*:::::::
gdm:*LK*:::::::
webservd:*LK*:::::::
postgres:NP:::::::
svctag:*LK*:6445::::::
nobody:*LK*:6445::::::
noaccess:*LK*:6445::::::
nobody4:*LK*:6445::::::
[0]>
The idea is to modify read in data in memory and have encrypted password for
root removed.
[0]> fffffa003a6e0005 \v 3a
0xfffffa003a6e0005: 0x77 = 0x3a
[0]> fffffa003a6e0006 \v 36
0xfffffa003a6e0006: 0x71 = 0x36
[0]> fffffa003a6e0007 \v 34
0xfffffa003a6e0007: 0x58 = 0x34
[0]> fffffa003a6e0008 \v 34
0xfffffa003a6e0008: 0x59 = 0x34
[0]> fffffa003a6e0009 \v 35
0xfffffa003a6e0009: 0x34 = 0x35
[0]> fffffa003a6e000a \v 3a
0xfffffa003a6e000a: 0x64 = 0x3a
[0]> fffffa003a6e000b \v 3a
0xfffffa003a6e000b: 0x51 = 0x3a
[0]> fffffa003a6e000c \v 3a
0xfffffa003a6e000c: 0x61 = 0x3a
[0]> fffffa003a6e000d \v 3a
0xfffffa003a6e000d: 0x5a = 0x3a
[0]> fffffa003a6e000e \v 3a
0xfffffa003a6e000e: 0x43 = 0x3a
[0]> fffffa003a6e000f \v 3a
0xfffffa003a6e000f: 0x54 = 0x3a
[0]> fffffa003a6e0010 \v d
0xfffffa003a6e0010: 0x66 = 0xd
[0]> fffffa003a6e0000 \s
s:14761:::::: root::6445::::::
daemon:NP:6445::::::
bin:NP:6445::::::
sys:NP:6445::::::
adm:NP:6445::::::
lp:NP:6445::::::
uucp:NP:6445::::::
nuucp:NP:6445::::::
smmsp:NP:6445::::::
listen:*LK*:::::::
gdm:*LK*:::::::
webservd:*LK*:::::::
postgres:NP:::::::
svctag:*LK*:6445::::::
nobody:*LK*:6445::::::
noaccess:*LK*:6445::::::
nobody4:*LK*:6445::::::
[0]>
Once this was done, I removed all break points and continued boot process. Now
the root entry without password is copied from kernel land to user land.
[0]> :z
[0]> :c
Root password for system maintenance (control-d to bypass):
single-user privilege assigned to /dev/console.
Entering System Maintenance Mode
 
<root>#
At "Root password" prompt, I had to hit "Enter" key to login.
We can also get to cached file data using Virtual vnode and page structures.
[0]> 30003c6b800::print vnode_t v_pages | ::print page_t p_pagenum
p_pagenum = 0x1f974
[0]> _pagesize::print
0x2000
[0]> 0x1f974 * 0x2000 \s
0x3f2e8000: root:wqXY4dQaZCTfs:6445::::::
daemon:NP:6445::::::
bin:NP:6445::::::
sys:NP:6445::::::
adm:NP:6445::::::
lp:NP:6445::::::
uucp:NP:6445::::::
nuucp:NP:6445::::::
smmsp:NP:6445::::::
listen:*LK*:::::::
gdm:*LK*:::::::
webservd:*LK*:::::::
postgres:NP:::::::
svctag:*LK*:6445::::::
nobody:*LK*:6445::::::
noaccess:*LK*:6445::::::
nobody4:*LK*:6445::::::
[0]>
We could have then gone about modifying addresses from 0x3f2e8005 to 0x3f2e8010
to get the same result.
After logging into the system, dumped the contents of /etc/shadow:
<root># cat /etc/shadow
s:14761:::::::::
daemon:NP:6445::::::
bin:NP:6445::::::
sys:NP:6445::::::
adm:NP:6445::::::
lp:NP:6445::::::
uucp:NP:6445::::::
nuucp:NP:6445::::::
smmsp:NP:6445::::::
listen:*LK*:::::::
gdm:*LK*:::::::
webservd:*LK*:::::::
postgres:NP:::::::
svctag:*LK*:6445::::::
nobody:*LK*:6445::::::
noaccess:*LK*:6445::::::
nobody4:*LK*:6445::::::
 
Even though cat shows the content of /etc/shadow as modified, this modification
has happened only in cache (kernel memory). Opening /etc/shadow file in vi shows
something like below:
<snip>
root::6445::::::^Ms:14761::::::
daemon:NP:6445::::::
bin:NP:6445::::::
<snip>
Edit the first line and remove "^Ms:14761::::::".
Rebooting the machine now will not prompt for root password until a new password
is set.

Finding Memory leaks Within Solaris Applications Using libumem

Debugging Methodology :

The malloc() and free() memory management methods are used by many application developers. An application can be written without a dependence on any particular memory management programming interface by using the standard memory management methods malloc() and free(). This section will outline the steps needed to take advantage of the libumem library to debug an application's memory transactions.

Library Interposition and libumem Flags

If the libumem library is interposed (by setting the LD_PRELOAD environment variable) when executing an application, the malloc() and free() methods defined within the libumem library will be used whenever the application calls malloc() or free(). In order to take advantage of the debugging infrastructure of the libumem library, one needs to set the UMEM_DEBUG and the UMEM_LOGGING flags in the environment where the application is being executed.  The most common values for these flags are as follows: UMEM_DEBUG=default and UMEM_LOGGING=transaction.  With these settings, a thread ID, high-resolution time stamp, and stack trace are recorded for each memory  transaction initiated by the application.

The following are examples of the commands used to set the appropriate debug flags and interpose the libumem library when executing an application.


(csh)

%(setenv UMEM_DEBUG default; setenv UMEM_LOGGING transaction;
setenv LD_PRELOAD libumem.so.1;)

or

(bash)

bash-2.04$UMEM_DEBUG=default UMEM_LOGGING=transaction LD_PRELOAD=libumem.so.1

More details about the debug flags (UMEM_DEBUG and UMEM_LOGGING) can be found in the umem_debug(3MALLOC) man page.

MDB Commands

The developer can view the debug information pertaining to an application's memory management transactions by using MDB.  The following commands within MDB can be used to provide a great deal of information about the memory transactions that took place during the execution of the application.

::umem_status

    * Prints the status of the umem indicating if the logging features have been turned on or off

::findleaks

    * Prints a summary of the memory leaks found within the application
::umalog

    * Prints the memory transactions initiated by the application and the correlated stack traces
::umem_cache

    * Prints the details about each of the umem caches
[address]::umem_log

    * Prints the umem transaction log for the application
[address]::umem_verify

    * Prints the integrity of the umem caches which is useful in determining if a buffer has been corrupted

address$<bufctl_audit

    * Prints the contents of the umem_bufctl_audit structure as defined in the /usr/include/umem_impl.h header file

Example :

Traditional Memory Leak

In order to examine if SunMC has a memory leak, one can execute the following steps to narrow down the section of the code which is causing the leak.

1. The libumem library is only available on systems which are running the Solaris 9 OS, Update 3 and above.
%uname -a
SunOS hostname 5.9 Generic_112233-05
(csh)

2. Execute the application with the libumem library interposed and the appropriate debug flags set.
%(setenv UMEM_DEBUG default; setenv UMEM_LOGGING transaction;setenv LD_PRELOAD libumem.so.1; ./a.out)
(bash)
#UMEM_DEBUG=default;UMEM_LOGGING=transaction;LD_PRELOAD=libumem.so.1; ./a.out
3. Use the gcore (1) command to get an application core to analyze the application's memory transactions.

%ps -ef | grep esd
user1     970   714  0 10:42:42 pts/4    0:00 ./a.out

%gcore 970
gcore: core.970 dumped
4. Use MDB to analyze the core for memory leaks using the commands described in the previous section.
$mdb core.970
Loading modules: [ libumem.so.1 libc.so.1 ld.so.1 ]

> ::umem_log
CPU ADDR     BUFADDR         TIMESTAMP THREAD  
  0 0002e0c8 00055fb8     159d27e121a0 00000001
  0 0002e064 00055fb8     159d27e0fce8 00000001
  0 0002e000 00049fc0     159d27da1748 00000001
    00034904 00000000                0 00000000
    00034968 00000000                0 00000000
    ... snip ...
 
Here we can see that there have been three transactions by thread #1 on cpu #0.

> ::umalog

T-0.000000000  addr=55fb8  umem_alloc_32
         libumem.so.1`umem_cache_free+0x4c
         libumem.so.1`process_free+0x68
         libumem.so.1`free+0x38
         main+0x18
         _start+0x108

T-0.000009400  addr=55fb8  umem_alloc_32
         libumem.so.1`umem_cache_alloc+0x13c
         libumem.so.1`umem_alloc+0x44
         libumem.so.1`malloc+0x2c
         main+0x10
         _start+0x108

T-0.000461400  addr=49fc0  umem_alloc_24
         libumem.so.1`umem_cache_alloc+0x13c
         libumem.so.1`umem_alloc+0x44
         libumem.so.1`malloc+0x2c
         main+4
         _start+0x108
 The three transactions consist of one allocation to the 24 byte umem
cache, and one memory allocation and release from the 32 byte umem
cache. Note that the high resolution timestamp output in the upper left
hand corner is relative to the last memory transaction initiated by the
application.
> ::findleaks
CACHE     LEAKED   BUFCTL CALLER
0003d888       1 00050000 libumem.so.1`malloc+0x0
----------------------------------------------------------------------
 Total       1 buffer, 24 bytes

  This shows that there is one 24 byte buffer which has been leaked.