Subject : Troubleshooting System Hangs Description : system hangup °ü·ÃÇÏ¿© Á¤¸»·Î systemÀÌ hangµÇ¾ú´ÂÁö ¾Æ´ÑÁö¸¦ ÆÇ´ÜÇÏ´Â ´ë´Â ´Ù¼Ò ¾î·Á¿òÀÌ µû¸¦ ¼ö ÀÖ´Ù. ¶§¶§·Î ´ÜÁö ¾î¶² applicationÀÌ hangµÇ¾ú´Âµ¥µµ systemÀÌ Çà¾÷ÀÌ µÈ°Íó·³ º¸Àϼö ÀÖ´Ù. ÀÌÁ¦ systemÀÌ Á¤¸»·Î Çà¾÷ÀÌ µÇ¾ú´ÂÁö ¾Æ´ÑÁö ±×¸®°í ¾î¶»°Ô Áø´Ü ÇÒ¼ö ÀÖ´ÂÁö ¶Ç´Â ¹«¾ùÀ» È®ÀÎÇØ¾ß µÇ´ÂÁö¸¦ »ìÆì º¸°Ú´Ù: ´ç½ÅÀº system¿¡ rlogin ¶Ç´Â telnetÀ¸·Î Á¢¼ÓÇÒ¼ö Àִ°¡? systemÀÌ ping¿¡ ´ëÇØ ¹ÝÀÀÀÌ Àִ°¡? window¿¡¼ mouse°¡ ¿òÁ÷À̴°¡? system¿¡ ÃÖ±Ù¿¡ ¾î¶² º¯È ¶Ç´Â ¼öÁ¤ÀÌ ÀÖ¾ú´Â°¡? Çà¾÷ÀÌ ÀÚÁÖ ¹ß»ý Çߴ°¡? Çà¾÷ÀÌ ¹ß»ýÇÒ ¶§ÀÇ È¯°æ ¶Ç´Â ¡ÈÄ´Â ¹«¾ùÀΰ¡? Can the hang be reproduced on command? What is necessary to get out of the hang (i.e. can the machine be L1-A'd)? CHECKING FOR A RESOURCE DEPRIVATION HANG ---------------------------------------- ´ëºÎºÐÀÇ system hangÀÇ ¿øÀÎÀº ´õÀÌ»óÀÇ resource¸¦ Á¦°ø¹ÞÁö ¸øÇÏ´Â µ¥ ÀÖ´Ù. ±×·± »ç½ÇÀ» ÆÇ´ÜÇϱâ À§Çؼ± ¸ÕÀú performance toolÀ» runÇÏ¹Ç·Î½á »ìÆ캼¼ö ÀÖ´Ù. ´ÙÀ½Àº systemÀÇ CPU bound, I/O bound, ¶Ç´Â memory bound°¡ ¾î´ÀÁ¤µµ ÀÎÁö¸¦ ¾Ë¾Æº¸±â À§ÇØ 15 ºÐ °£°ÝÀ¸·Î cronÀ¸·Î µ¹¸° shell program fileÀÇ ¿¹¸¦ µé¾î º¸¾Ò´Ù. date >> file1 vmstat 30 10 >> file1 date >> file2 iostat -xtc 30 10 >> file2 date >> file3 /usr/ucb/ps -aux >> file3 date >> file4 echo kmastat | crash >> file4 date >> file5 echo "map kernelmap" | crash >> file5 CPU Power --------- vmstat commandÀÇ output¿¡¼, ù¹ø° columnÀÇ run queue size°¡ ¾î´ÀÁ¤µµ ÀÎÁö¸¦ »ìÆì¶ó. ¸¸¾à CPU´ç 3°³ ÀÌ»óÀÇ process run queue¸¦ °®´Â´Ù¸é(i.e. ÇϳªÀÇ CPU system´ç 3°³, Áï 2°³ÀÇ CPU systemÀ̶ó¸é 6°³. µîµî) ±× ½Ã½ºÅÛÀº ÁÖ¸ñÇؾßÇÑ´Ù. ¸¸¾à run queue°¡ CPU´ç 5°³ ÀÌ»óÀ̶ó¸é CPU power°¡ ºÎÁ·ÇÑ °ÍÀÌ´Ù. Virtual Memory -------------- ¸¸¾à vmstat swap columnÀÇ ¼öÄ¡°¡ °è¼Ó ³»·Á°¡¸é¼ ȸº¹ÀÌ ¾ÈµÈ´Ù¸é memory ºÎÁ·ÀÇ °¡´É¼ºÀ» »ý°¢ÇØ º¼¼ö ÀÖ´Ù. kernel memory ºÎÁ·À» ÆÇ´ÜÇÏÀÚ¸é µÎ°³ÀÇ crash command ÀÇ °á°ú¸¦ »ìÆ캸¾Æ¾ß ÇÒ °ÍÀÌ´Ù(ÀÌ¿¡ ´ëÇؼ± ´ÙÀ½¿¡ ±â¼úÇÏ°Ú´Ù). applicationÀÇ »ç¿ëÀÚ memory ºÎÁ·À» ÆÇ´ÜÇÏÀÚ¸é ps commandÀÇ SZ columnÀ» »ìÆ캸¾Æ¾ß ÇÑ´Ù. ÀÌ columnÀº processÀÇ data ¿Í stack ÀÇ size¸¦ kilobyte·Î Ç¥½ÃÇÑ´Ù. ¸¸¾à vmstat swap columnÀÇ ¼öÄ¡°¡ °è¼Ó ³»·Á°¬´Ù°¡ ȸº¹ÀÌ µÈ´Ù¸é, swap columnÀÇ °¡Àå ³·Àº °ªÀ» ±â¾ïÇ϶ó. ¸¸¾à ±× ¼öÄ¡°¡ 4000 ÀÌÇ϶ó¸é virtual memory spaceÀÇ ¼ÒÁøÀ» ¿ì·ÁÇÏ¿©¾ß ÇÒ °ÍÀÌ´Ù. system¿¡ swap space¸¦ ´Ã·ÁÁÖ¾î¶ó. Physical Memory --------------- vmstatÀÇ sr (scan rate) columnÀº ÇöÀçÀÇ process¸¦ À§ÇØ ÇÊ¿äÇÑ page¸¦ ã±âÀ§ÇÏ¿© scanningµÈ pageÀÇ ºñÀ²ÀÌ´Ù. ¸¸¾à ÀÌ ¼öÄ¡°¡ 200À» ³Ñ´Â´Ù¸é physical memory°¡ ºÎÁ·ÇÑ °ÍÀ¸·Î ÆÇ´ÜÇÒ ¼ö ÀÖ´Ù. Kernel Memory ------------- kernelÀº kernel data allocationÀ» À§ÇØ »ç¿ëµÇ´Â memoryÀÇ ¿µ¿ªÀÌ´Ù. ÀÌ memory´Â º¸Åë kernel heap¶ó°íµµ ºÒ¸°´Ù. kernel heapÀÇ maximum size´Â machine architecture¿Í physical memoryÀÇ Å©±â¿¡ ´Þ·ÁÀÖ´Ù. ¸¸¾à machineÀÌ ´õÀÌ»óÀÇ kernel memory¸¦ »ç¿ëÇÏÁö ¸øÇÑ´Ù¸é, º¸Åë Çà¾÷À» ¹ß»ý½Ãų °ÍÀÌ´Ù. crash kernelmap commandÀÇ °á°ú·Î kernel memory ¿À·ù¸¦ ¾Ë¾Æº¼¼ö ÀÖÀ» °ÍÀÌ´Ù. ÀÌ command´Â kernel memoryÀÇ segment°¡ ¾ó¸¶³ª µÇ´ÂÁö ±×¸®°í °¢ segmentÀÇ Å©±â°¡ ¾ó¸¶³ª µÇ´ÂÁö¸¦ ¾Ë·ÁÁØ´Ù. ¸¸¾à ´ÜÁö Çѵΰ³ÀÇ page segment¸¸ÀÌ ³²°ÜÁ® ÀÖ´Ù¸é, ´õÀÌ»óÀÇ kernel memory°¡ ºÎÁ·ÇÑ °ÍÀÌ´Ù(¼³»ç ¿ø·¡ page segment°¡ ¸¹ÀÌ ÀÖÀ»Áö¶óµµ). The crash kmastat command shows how much memory has been allocated to which bucket. Prior to Solaris 2.4, this showed only 3 buckets making it difficult to tell which bucket was hogging the memory (if any). Starting with 2.4, kmastat breaks memory allocation down to many different buckets. If one of these buckets has several MB of memory, and there are kernel memory allocation failures, there is probably a memory leak involving the large bucket. In order to diagnose a memory leak problem it is possible to turn on some flags in Solaris 2.4 and above (see SRDB 12172). With these flags turned on, once the kmastat command shows significant growth in the offending bucket, L1-A should be used to stop the machine and create a core file. SunService can use this core file to help determine the cause of the leak. Disk I/O -------- To check for disks which are overly busy, look at the iostat output. The columns of interest are %b (% of time the disk is busy) and svc_t (average service time in milliseconds). If %b is greater than 20% and svc_t is greater than 30ms, this is a busy disk. If there are other disks which are not busy, the load should be balanced. If all disks are this busy, additional disks should be considered. There is no direct way to check for an overloaded SCSI bus, but if the %w column (% of time transactions are waiting for service) is greater then 5%, then the SCSI bus may be overloaded. Information about what levels to check for the various performance statistics is taken from "Sun Performance and Tuning" by Adrian Cockroft, ISBN 0-13-149642-5. Additional performance gathering scripts can be gotten from Infodoc 2242 for Solaris 2.x and Infodoc 11365 for SunOS 4.x. GENERATING CORE FILES --------------------- If looking at the performance statistics is not enough to diagnose the problem, it is necessary to get a core file. Infodoc 11837 describes how to do this. If it is not possible to get a core file, then the situation is called a hard hang. Contact SunService for information on diagnosing hard hang situations. Analyzing system hang core files -------------------------------- Once a core file is obtained, the first information to look at is a threadlist generated by the adb command. $adb -k unix.NUM vmcore.NUM | tee threadlist.NUM physmem xxxxx $