SAN Performance Metrics
I often get requests from application
owners to review performance stats. I thought I’d give a quick overview of some
of the things I look at, what the myriad of performance metrics in Navisphere
Analyzer and ECC Performance Manager mean, and how you might use some of them
to investigate a performance problem. Performance analysis is very much an art
(not a science) and it’s sometimes difficult to pinpoint exact causes based on
the mix of applications and workload on the array. Taking all of the metrics into
account with a holistic view is needed to be successful. Performing data
collection of application workloads over time is recommended because
application workload characteristics will likely vary over time. If you have a
major problem, I would always recommend opening an SR with EMC.
This post is just an overview of SAN
performance metrics and isn’t meant to dive in to every possible scenario from
every angle. EMC already has excellent guides for performance best practices
that you can read here:
·
http://www.emc.com/collateral/hardware/white-papers/h5773-clariion-best-practices-performance-availability-wp.pdf
(Older version fpr clariion)
·
http://www.scribd.com/doc/91233385/h8268-VNX-Block-Best-Practices
(Newer version for VNX)
Because we have EMC’s Performance
Manager tool installed in our environment, I always go to that tool first
rather than Navisphere Analyzer. Both use the same metrics, so the following
information will be useful regardless of which method you use.
The first thing I do is look at the
Storage Processors. This will give you a good indication of the overall health
of the array before you dive into the specific LUN (or LUNs) used by the
application.
·
SP Cache Dirty Pages (%). These are pages in write cache that
have received new data from hosts but have not yet been flushed to disk. You
should have a high percentage of dirty pages as it increases the chance of a
read coming from cache or additional writes to the same block of data being
absorbed by the cache. If an IO is served from cache the performance is better
than if the data had to be retrieved from disk. That’s why the default
watermarks are usually around 60/80% or 70/90%. You don’t want dirty pages to
reach 100%, they should fluctuate between the high and low watermarks (which
means the Cache is healthy). Periodic spikes or drops outside the watermarks
are ok, but consistently hitting 100% indicates that the write cache is
overstressed.
·
SP Utilization (%). Check and see if either SP is running
higher than about 75%. If either is running that high application response time
will be increased. Also, both will need to be under 50% for non-disruptive
upgrades. We had to do a large scale migration of data from one SAN to another
at one point in order to get a NDU accomplished. You’ll also want to check for
proper balance. If one is much higher than the other, you should consider
migrating LUNs from one SP owner to another. I check SP balance on all of our arrays
on a daily basis.
·
SP Response time (ms). Make sure again that both SPs are even
and that response time is acceptable. I like to see response times under 10ms.
If you see that one SP has high utilization and response time but the other SP
doesn’t, look for LUNs owned by the busier SP that are using more array
resources. Looking at total IO on a per LUN basis can help confirm If both SPs
have relatively similar throughput but one SP has much higher bandwidth. That
could mean that there is some large block IO occurring.
·
SP Port Queue Full Count. This represents the number of times
that a front end port issued a QFULL response back to the hosts. If you are
seeing QFULL’s it could mean that the Queue Depth on the HBA is too large for
the LUNs being accessed. A Clariion/VNX front end port has a queue depth of
1600 which is the maximum number of simultaneous IO’s that port can process.
Each LUN on the array has a maximum queue depth that is calculated using a
formula based on the number of data disks in the RAID Group. For example, a
port with 512 queues and a typical LUN queue depth of 32 can support up to: 512
/ 32 = 16 LUNs on 1 Initiator (HBA) or 16 Initiators (HBAs) with 1 LUN each or
any combination not to exceed this number. Configurations that exceed this
number are in danger of returning QFULL conditions. A QFULL condition signals
that the target/storage port is unable to process more IO requests and thus the
initiator will need to throttle IO to the storage port. As a result of this,
application response times will increase and IO activity will decrease.
The next thing I do is look at the
specific LUNs that the application owner is asking about. The list below
includes the basic performance metrics that I most often look at when
investigating a performance problem.
·
Utilization (%) represents the fraction of an
observation period during which a LUN has any outstanding requests. When the
LUN becomes the bottleneck, the utilization will be at or close to 100%.
However, since I/Os can get serviced by multiple disks an increase in workload
might still result in a higher throughput. Utilization by itself is not a very
good indicator of the overall performance of the LUN, it needs to be factored
in with several other things. For example, If you are writing to a LUN (100%
Writes) and the location of the data is in a small physical space on the LUN,
it may be possible to get to 100% with write cache re-hits. This means that all
writes are being serviced by the write cache and since you are writing data to
the same locations over and over you do not flush any of the data to the disks.
This can cause your LUN Utilization to be 100% but there will actually be no IO
to the disks. Utilization is very affected by caching, both read and write. The
LUN can be very busy but may not have a problem. Use Utilization to assist in
identifing busy LUNs then look at queuing and response times to see if there
really is an issue.
·
Queue Length is the average number of requests within a polling
interval that are outstanding to this LUN. A queue length of zero indicates an
idle LUN. If three requests arrive at an idle LUN at the same time, only one of
them can be served immediately; the other two must wait in the queue. That
scenario would result in a queue length of 3. My general guideline for “bad
performance” on a LUN is a queue length greater than 2 for a single disk drive.
·
Average Busy Queue Length is the average number of outstanding
requests when the LUN was busy. This does not include any idle time. This value
should not exceed 2 times the number of spindles on a LUN. For example, if a
LUN has 25 spindles, a value of 50 is acceptable. Since this queue length is
counted only when the LUN is not idle, the value indicates the frequency
variation (burst frequency) of incoming requests. The higher the value, the
bigger the burst and the longer the average response time at this component. In
contrast to this metric, the average queue length does also include idle
periods when no requests are pending. If you have 50% of the time just one outstanding
request, and the other 50% the LUN is idle, the average busy queue length will
be 1. The average queue length however, will be ½.
·
Response Time (ms) is the average time, in milliseconds,
that a request to this LUN is outstanding, including its waiting time. The
higher the queue length for a LUN, the more requests are waiting in its queue,
thus increasing the average response time of a single request. For a given
workload, queue length and response time are directly proportional. Keep in
mind that cache re-hits bring down the average response time (and service
times), whether they are reads or writes. LUN Response time is a good starting
point for troubleshooting. It gives a good indicator of what the host system is
experiencing. Usually if your LUN response time (Response time = queue length *
service time) is good then the host performance is good. High response times
don’t always mean that the CLARiiON is busy, it can also indicate that you’re
having issues with your host or Fabric. We use the Brocade Health report on a
regular basis to identify hosts that have an excessive amount of traffic, as
well as running the EMC HEAT report on hosts that have reported issues (which
can identify incorrect HBA Drivers, Bad HBA, etc).These are my general guidelines
for response time:
Less than 10 ms: very good
Between 10 – 20 ms: okay
Between 20 – 50 ms: slow, needs attention
Greater than 50 ms: I/O bottleneck
Less than 10 ms: very good
Between 10 – 20 ms: okay
Between 20 – 50 ms: slow, needs attention
Greater than 50 ms: I/O bottleneck
·
Service Time (ms) represents the Time, in milliseconds, a
request spent being serviced by a component. It does not include time waiting
in a queue. Service time is mainly a characteristic of the system component.
However, larger I/Os take longer and therefore usually result in lower
throughput (IO/s) but better bandwidth (Mbytes/s). In general, Service time is
simply the time it takes to actually send the I/O request to the storage and
get an answer back. In general, I like to see service times below 20ms.
·
Total Throughput (IO/sec) is the average number of host requests
that is passed through the LUN per second. This includes both read and write
requests. Smaller requests usually result in a higher total throughput than
larger requests. Examining total throughput (along with %Utilization) is a good
way to identify the busiest LUNs on the array. In general, here are the IOPs
limits by drive type:
RPM Drive Type IOPs
7,200 SATA,NL-SAS ~80
10,000 SATA,NL-SAS ~130
10,000 FC,SAS ~140
15,000 FC,SAS ~180
N/A EFD ~1500 (Read/Write, 60/40)
N/A EFD ~6000 (Read)
N/A EFD ~3000 (Write)
·
Write Throughput (I/O/sec) The average number of host write
requests that is passed through the LUN per second. Smaller requests usually
result in a higher write throughput than larger requests. When troubleshooting
specific LUNs, check the write IO size and see if the size is what you would
expect for the application you are investigating. Extremely large IO sizes
coupled with high IOPS may cause write cache contention.
·
Read Throughput (I/O/sec) The average number of host read
requests that is passed through the LUN per second. Smaller requests usually
result in a higher read throughput than larger requests.
·
Total Bandwidth (MB/s) The average amount of host data in
Mbytes that is passed through the LUN per second. This includes both read and
write requests. Larger requests usually result in a higher total bandwidth than
smaller requests.
·
Read Bandwidth (MB/s) The average amount of host read data
in Mbytes that is passed through the LUN per second. Larger requests usually
result in a higher bandwidth than smaller requests.
·
Write Bandwidth (MB/s) The average amount of host write data
in Mbytes that is passed through the LUN per second. Larger requests usually
result in a higher bandwidth than smaller requests. Keep in mind that writes
consume many more array resources than reads.
·
Read Size (KB) The average read request size in Kbytes seen by the LUN.
This number indicates whether the overall read workload is oriented more toward
throughput (I/Os per second) or bandwidth (Mbytes/second). For a finer
distinction of I/O sizes, use an IO Size Distribution chart for this LUN.
·
Write Size (KB) The average write request size in Kbytes seen by the
LUN. This number indicates whether the overall write workload is oriented more
toward throughput (I/Os per second) or bandwidth (Mbytes/second). For a finer
distinction of I/O sizes, use an IO Size Distribution chart for the LUNs.
Below is an explanation of additional
performance metrics that I don’t use as frequently, but I’m including them for
completeness.
·
Forced Flushes/s Number of times per second the cache
had to flush pages to disk to free up space for incoming write requests. Forced
flushes are a measure of how often write requests will have to wait for disk
I/O rather than be satisfied by an empty slot in the write cache. In most well
performing systems this should be zero most of the time.
·
Full Stripe Writes/s Average number of write requests per
second that spanned a whole stripe (all disks in a LUN). This metric is
applicable only to LUNs that are part of a RAID5 or RAID3 group.
·
Used Prefetches (%) The percentage of prefetched data in
the read cache that was read during the last polling interval.
·
Disk Crossing (%) Percentage of host requests that
require I/O to at least two disks compared to the total number of host
requests. A single disk crossing can involve more than two disk drives.
·
Disk Crossings/s Number of times per second that a
request requires access to at least two disk drives. A single disk crossing can
involve more than two disks.
·
Read Cache Hits/s Average number of read requests per
second that were satisfied by either read or write cache without requiring any
disk access. A read cache hit occurs when recently accessed data is
re-referenced while it is still in the cache.
·
Read Cache Misses/s Average number of read requests per
second that did require one or more disk accesses.
·
Reads From Write Cache/s Average number of read requests per
second that were satisfied by write cache only. Reads from write cache occur
when recently written data is read again while it is still in the write cache.
This is a subset of read cache hits which includes requests satisfied by either
the write or the read cache.
·
Reads From Read Cache/s Average number of read requests per
second that were satisfied by the read cache only. Reads from read cache occur
when data that has been recently read or prefetched is re-read while it is
still in the read cache. This is a subset of read cache hits which includes
requests satisfied by either the write or the read cache.
·
Read Cache Hit Ratio The fraction of read requests served
from both read and write caches vs. the total number of read requests. A higher
ratio indicates better read performance.
·
Write Cache Hits/s Average number of write requests per
second that were satisfied by the write cache without requiring any disk
access. Write requests that are not write cache hits are referred to as write
cache misses.
·
Write Cache Misses/s Average number of write requests per
second that did require one or multiple disk accesses. Write requests that
cause forced flushes or that bypass the write cache due to their size are
examples of write cache misses.
·
Write Cache Rehits/s Average number of write requests per
second that were satisfied by the write cache since they had been referenced
before and not yet flushed to the disks. Write cache rehits occur when recently
accessed data is referenced again while it is still in the write cache. This is
a subset of Write Cache Hits.
·
Write Cache Hit Ratio The ratio of write requests that the
write cache satisfied without requiring any disk access vs. the total number of
write requests to this LUN. A higher ratio indicates better write performance.
·
Write Cache Rehit Ratio The ratio of write requests that the
write cache satisfied since they have been referenced before and not yet
flushed to the disks vs. the total number of write requests to this LUN. This
is a measure of how often the write cache succeeded in eliminating a write
operation to disk. While improving the rehit ratio is useful it is more
beneficial to reduce the number of forced flushes
No comments:
Post a Comment