Subject : Troubleshooting System Hangs

Description :


system hangup °ü·ÃÇÏ¿© Á¤¸»·Î systemÀÌ hangµÇ¾ú´ÂÁö ¾Æ´ÑÁö¸¦ ÆÇ´ÜÇÏ´Â ´ë´Â
´Ù¼Ò ¾î·Á¿òÀÌ µû¸¦ ¼ö ÀÖ´Ù. ¶§¶§·Î ´ÜÁö ¾î¶² applicationÀÌ hangµÇ¾ú´Âµ¥µµ
systemÀÌ Çà¾÷ÀÌ µÈ°Íó·³ º¸Àϼö ÀÖ´Ù. ÀÌÁ¦ systemÀÌ Á¤¸»·Î Çà¾÷ÀÌ µÇ¾ú´ÂÁö
¾Æ´ÑÁö ±×¸®°í ¾î¶»°Ô Áø´Ü ÇÒ¼ö ÀÖ´ÂÁö ¶Ç´Â ¹«¾ùÀ» È®ÀÎÇØ¾ß µÇ´ÂÁö¸¦ »ìÆì º¸°Ú´Ù:

´ç½ÅÀº system¿¡ rlogin ¶Ç´Â telnetÀ¸·Î Á¢¼ÓÇÒ¼ö Àִ°¡?

systemÀÌ ping¿¡ ´ëÇØ ¹ÝÀÀÀÌ Àִ°¡?

window¿¡¼­ mouse°¡ ¿òÁ÷À̴°¡?

system¿¡ ÃÖ±Ù¿¡ ¾î¶² º¯È­ ¶Ç´Â ¼öÁ¤ÀÌ ÀÖ¾ú´Â°¡?

Çà¾÷ÀÌ ÀÚÁÖ ¹ß»ý Çߴ°¡?

Çà¾÷ÀÌ ¹ß»ýÇÒ ¶§ÀÇ È¯°æ ¶Ç´Â ¡ÈÄ´Â ¹«¾ùÀΰ¡?

Can the hang be reproduced on command?

What is necessary to get out of the hang (i.e. can the
machine be L1-A'd)?


CHECKING FOR A RESOURCE DEPRIVATION HANG
----------------------------------------

´ëºÎºÐÀÇ system hangÀÇ ¿øÀÎÀº ´õÀÌ»óÀÇ resource¸¦ Á¦°ø¹ÞÁö ¸øÇÏ´Â µ¥ ÀÖ´Ù.
±×·± »ç½ÇÀ» ÆÇ´ÜÇϱâ À§Çؼ± ¸ÕÀú performance toolÀ» runÇÏ¹Ç·Î½á »ìÆ캼¼ö ÀÖ´Ù.

´ÙÀ½Àº systemÀÇ CPU bound, I/O bound, ¶Ç´Â memory bound°¡ ¾î´ÀÁ¤µµ ÀÎÁö¸¦
¾Ë¾Æº¸±â À§ÇØ 15 ºÐ °£°ÝÀ¸·Î cronÀ¸·Î µ¹¸° shell program fileÀÇ ¿¹¸¦
µé¾î º¸¾Ò´Ù.

date >> file1
vmstat 30 10 >> file1
date >> file2
iostat -xtc 30 10 >> file2
date >> file3
/usr/ucb/ps -aux >> file3
date >> file4
echo kmastat | crash >> file4
date >> file5
echo "map kernelmap" | crash >> file5

CPU Power
---------

vmstat commandÀÇ output¿¡¼­, ù¹ø° columnÀÇ run queue size°¡ ¾î´ÀÁ¤µµ
ÀÎÁö¸¦ »ìÆì¶ó. ¸¸¾à CPU´ç 3°³ ÀÌ»óÀÇ process run queue¸¦ °®´Â´Ù¸é(i.e.
ÇϳªÀÇ CPU system´ç 3°³, Áï 2°³ÀÇ CPU systemÀ̶ó¸é 6°³. µîµî) ±× ½Ã½ºÅÛÀº
ÁÖ¸ñÇؾßÇÑ´Ù. ¸¸¾à run queue°¡ CPU´ç 5°³ ÀÌ»óÀ̶ó¸é CPU power°¡ ºÎÁ·ÇÑ
°ÍÀÌ´Ù.

Virtual Memory
--------------

¸¸¾à vmstat swap columnÀÇ ¼öÄ¡°¡ °è¼Ó ³»·Á°¡¸é¼­ ȸº¹ÀÌ ¾ÈµÈ´Ù¸é memory ºÎÁ·ÀÇ
°¡´É¼ºÀ» »ý°¢ÇØ º¼¼ö ÀÖ´Ù. kernel memory ºÎÁ·À» ÆÇ´ÜÇÏÀÚ¸é µÎ°³ÀÇ crash command
ÀÇ °á°ú¸¦ »ìÆ캸¾Æ¾ß ÇÒ °ÍÀÌ´Ù(ÀÌ¿¡ ´ëÇؼ± ´ÙÀ½¿¡ ±â¼úÇÏ°Ú´Ù). applicationÀÇ
»ç¿ëÀÚ memory ºÎÁ·À» ÆÇ´ÜÇÏÀÚ¸é ps commandÀÇ SZ columnÀ» »ìÆ캸¾Æ¾ß ÇÑ´Ù.
ÀÌ columnÀº processÀÇ data ¿Í stack ÀÇ size¸¦ kilobyte·Î Ç¥½ÃÇÑ´Ù.

¸¸¾à vmstat swap columnÀÇ ¼öÄ¡°¡ °è¼Ó ³»·Á°¬´Ù°¡ ȸº¹ÀÌ µÈ´Ù¸é, swap columnÀÇ
°¡Àå ³·Àº °ªÀ» ±â¾ïÇ϶ó. ¸¸¾à ±× ¼öÄ¡°¡ 4000 ÀÌÇ϶ó¸é virtual memory spaceÀÇ
¼ÒÁøÀ» ¿ì·ÁÇÏ¿©¾ß ÇÒ °ÍÀÌ´Ù. system¿¡ swap space¸¦ ´Ã·ÁÁÖ¾î¶ó.

Physical Memory
---------------

vmstatÀÇ sr (scan rate) columnÀº ÇöÀçÀÇ process¸¦ À§ÇØ ÇÊ¿äÇÑ page¸¦ ã±âÀ§ÇÏ¿©
scanningµÈ pageÀÇ ºñÀ²ÀÌ´Ù. ¸¸¾à ÀÌ ¼öÄ¡°¡ 200À» ³Ñ´Â´Ù¸é physical memory°¡
ºÎÁ·ÇÑ °ÍÀ¸·Î ÆÇ´ÜÇÒ ¼ö ÀÖ´Ù.

Kernel Memory
-------------

kernelÀº kernel data allocationÀ» À§ÇØ »ç¿ëµÇ´Â memoryÀÇ ¿µ¿ªÀÌ´Ù.
ÀÌ memory´Â º¸Åë kernel heap¶ó°íµµ ºÒ¸°´Ù. kernel heapÀÇ maximum size´Â
machine architecture¿Í physical memoryÀÇ Å©±â¿¡ ´Þ·ÁÀÖ´Ù. ¸¸¾à machineÀÌ
´õÀÌ»óÀÇ kernel memory¸¦ »ç¿ëÇÏÁö ¸øÇÑ´Ù¸é, º¸Åë Çà¾÷À» ¹ß»ý½Ãų °ÍÀÌ´Ù.
crash kernelmap commandÀÇ °á°ú·Î kernel memory ¿À·ù¸¦ ¾Ë¾Æº¼¼ö ÀÖÀ» °ÍÀÌ´Ù.
ÀÌ command´Â kernel memoryÀÇ segment°¡ ¾ó¸¶³ª µÇ´ÂÁö ±×¸®°í °¢ segmentÀÇ
Å©±â°¡ ¾ó¸¶³ª µÇ´ÂÁö¸¦ ¾Ë·ÁÁØ´Ù. ¸¸¾à ´ÜÁö Çѵΰ³ÀÇ page segment¸¸ÀÌ
³²°ÜÁ® ÀÖ´Ù¸é, ´õÀÌ»óÀÇ kernel memory°¡ ºÎÁ·ÇÑ °ÍÀÌ´Ù(¼³»ç ¿ø·¡ page segment°¡
¸¹ÀÌ ÀÖÀ»Áö¶óµµ).

The crash kmastat command shows how much memory has been allocated to
which bucket.  Prior to Solaris 2.4, this showed only 3 buckets making
it difficult to tell which bucket was hogging the memory (if any).
Starting with 2.4, kmastat breaks memory allocation down to many
different buckets.  If one of these buckets has several MB of memory,
and there are kernel memory allocation failures, there is probably
a memory leak involving the large bucket.

In order to diagnose a memory leak problem it is possible to turn
on some flags in Solaris 2.4 and above (see SRDB 12172).  With
these flags turned on, once the kmastat command shows significant
growth in the offending bucket, L1-A should be used to stop the 
machine and create a core file.  SunService can use this core file
to help determine the cause of the leak.

Disk I/O
--------

To check for disks which are overly busy, look at the iostat output.  The
columns of interest are %b (% of time the disk is busy) and svc_t (average
service time in milliseconds).  If %b is greater than 20% and svc_t
is greater than 30ms, this is a busy disk.  If there are other disks which
are not busy, the load should be balanced.  If all disks are this busy,
additional disks should be considered.

There is no direct way to check for an overloaded SCSI bus, but if the %w
column (% of time transactions are waiting for service) is greater then 5%,
then the SCSI bus may be overloaded.

Information about what levels to check for the various performance statistics
is taken from "Sun Performance and Tuning" by Adrian Cockroft, ISBN 0-13-149642-5.

Additional performance gathering scripts can be gotten from Infodoc 2242
for Solaris 2.x and Infodoc 11365 for SunOS 4.x.


GENERATING CORE FILES
---------------------

If looking at the performance statistics is not enough to diagnose the
problem, it is necessary to get a core file.  Infodoc 11837 describes
how to do this.

If it is not possible to get a core file, then the situation is called
a hard hang.  Contact SunService for information on diagnosing hard
hang situations.

Analyzing system hang core files
--------------------------------

Once a core file is obtained, the first information to look at is a
threadlist generated by the adb command.

$adb -k unix.NUM vmcore.NUM | tee threadlist.NUM
physmem xxxxx
$