--// 昨天一边看一边使用金山词霸,翻译这篇文章,可能存在很多错误。希望对大家有用。 http://tech.e2sn.com/oracle/troubleshooting/latch-contention-troubleshooting Systematic Latch Contention Troubleshooting in Oracle By Tanel Poder This article was originally published at IOUG Select journal Q1 2010. Thanks to John Kanagaraj for editing the IOUG Select article and pushing me for deadlines :-) --//这篇文章最初发表在IOUG选择杂志2010年第一季度。 感谢约翰·卡纳加拉杰编辑了"IOUG选择"文章,并催促我完成截止日期:-) Thanks to IOUG team for giving me the IOUG Select Journal Editor's Choice award (2011) for this article!!! --//感谢IOUG团队为这篇文章授予我IOUG选择期刊编辑选择奖(2011年)! ! Contents 1 Systematic Latch Contention Troubleshooting in Oracle 1.1 What is a latch? 1.2 Systematic troubleshooting of latch contention 1.3 Question 1 - Who is trying to get the latch and why? 1.4 Question 2 - Who is holding the latch and why? 1.5 Using LatchProf to find who are holding latches the most 1.6 Using LatchProfX for finding the hot block causing cache buffers chains latch contention 1.7 Limitations of V$LATCHHOLDER and LatchProf(X) 1.8 Download LatchProf and LatchProfX /*+ 1 Oracle系统故障排除 1.1什么是闩锁? 1.2闩锁争用系统故障排除 1.3问题1-谁试图获得闩锁,为什么? 1.4问题2-谁拿着闩锁,为什么? 1.5利用Latchprof 找出拿着锁的人最多 1.6使用Latch ProfX查找导致缓存缓冲链锁存争用的热块 1.7 V$LATCHOLDER和LatchProf(X)的限制 1.8下载LatchProf and LatchProfX */ Introduction As an Oracle DBA, developer or performance analyst, you may have run into what is termed "latch contention" at various points. So what exactly is a "latch" and why do we have contention on this "latch". In this article, we will take an in-depth look at latches and how we determine and resolve such contention. Whether you are a newbie or an experienced old-timer, we hope this article will cast a little more light on this ill-understood subject. --//作为一名OracleDBA、开发人员或性能分析师,您可能在不同的时候遇到了所谓的"锁存争用"。 那么,什么是"锁存器",为什么我们 --//对这个"锁存器"有争议。 在本文中,我们将深入研究锁存器以及我们如何确定和解决这种争论。 无论你是新手还是经验丰富的老手 --//,我们希望这篇文章能对这个不被理解的话题多一点启发。 What is a latch? I will put it very simply: Latch is a lock. In fact, a latch is just another special type of lock in Oracle. Oracle uses many different types of locks internally, for example enqueue locks, library cache locks, library cache pins, buffer cache pins, mutexes and latches and these different lock types are useful for different purposes. Enqueue locks are sophisticated, provide ordered queuing capabilities and are used for cases where the locks are held for relatively long durations. Latches on the other hand are much less sophisticated, more lightweight and are used for protecting very short operations on memory structures such various internal linked list modifications, shared pool memory allocation, library cache object lookups and so on. Latches are very low-level locks managed deep inside Oracle kernel code and users or applications can not directly acquire nor release them. --//我会简单地说:Latch是一把锁。 事实上,栓锁只是Oracle中另一种特殊类型的锁。 Oracle内部使用许多不同类型的锁,例如 --//enqueue locks、library cache locks, library cache pins, buffer cache pins、互斥锁和锁存,这些不同的锁类型对于不同的 --//目的是有用的。 Enqueue locks是复杂的,提供有序的排队功能,并用于锁持有相对较长时间的情况。 另一方面,锁存器不那么复 --//杂,更轻,并且用于保护内存结构上非常短的操作,例如各种内部链表修改、共享池内存分配、库缓存对象查找等。 锁存器是在 --//Oracle内核代码的深层管理的非常低级的锁,用户或应用程序不能直接获取或释放它们。 So, how is a latch physically implemented? Every latch is just a memory structure in SGA, usually 100-200 bytes in size, depending on your Oracle version, hardware platform and whether you are running 32 or 64-bit Oracle. --//那么,锁存器是如何物理实现的呢? 每个锁存器只是SGA中的内存结构,通常大小为100-200字节,这取决于您的Oracle版本、硬件 --//平台以及运行32位还是64位Oracle。 Contrary to a common misconception, latches are in no way related to OS synchronization mechanisms such as mutexes or semaphores! Latches are just plain memory structures in SGA and the OS doesn't even have an idea that latches exist. Taking a latch does not require any system calls or OS interaction. If that were the case, latch operations would slow down considerably as such system calls are expensive. If a process wants to take a latch, it just reads latch memory to check whether the latch value is zero (i.e. latch is not taken) and if it is zero, then it changes the value to indicate that the latch is taken. For "exclusive-get-only" latches, the value will be hardcoded, such as "0xFF" on Intel platforms, but for "shared-get-capable" latches, the acquiring process will put its own Oracle PID in there if it takes that latch in exclusive mode. If a shared latch is held by many processes simultaneously then the latch value will just show the reference count (the number of shared mode holders) and tracking individual processes is done via state objects memory structures. --//与常见的误解相反,锁存器与OS同步机制(如互斥或信号量)没有任何关系!锁存只是SGA中的普通内存结构,操作系统甚至没有存在 --//锁存的想法。 使用锁存器不需要任何系统调用或OS交互。 如果是这样的话,锁存操作将大大减慢,因为这样的系统调用是昂贵的。 --//如果一个进程想要接受一个锁存器,它只是读取锁存内存来检查锁存值是否为零(即。 锁存器不被取),如果它是零,那么它改变值 --//以指示锁存器被取。 对于"独家获取唯一"锁存器,该值将被硬编码,例如英特尔平台上的"0x FF",但对于"共享获取能力"锁存器, --//如果它以独占模式接受该锁存器,则获取过程将将将其自己的Oracle PID放在其中。如果一个共享锁存器同时被许多进程保持,那么 --//锁存值将只显示参考计数(共享模式保持器数),并且跟踪单个进程是通过状态对象内存结构完成的。 So, if a latch is just some memory content in SGA, modified by processes at will, who then guarantees that two processes (running on different CPUs) don't accidentally take the same latch simultaneously, without knowing about each other? This would be a race condition, and if allowed, the Oracle instance could quickly end up in a crash or a hang. The solution lies in atomic "test-and-set" and "compare-and-swap" (CAS) instructions which all modern hardware supports. This allows the Oracle kernel code to perform latch value checking and changing in a single atomic instruction. On some platforms (such as Intel), this needs to be combined with a special flag which locks the memory line where the latch lives for a single CPU access only for the duration of the latch get. This is done with LOCK CMPXCHG instruction on Intel platform. CMPXCHG means Compare and Exchange and the LOCK prefix guarantees that only one CPU at a time can access that latch (remember, latches are usually held for very short periods, measured as low as just microseconds). --//那么,如果一个锁存器只是SGA中的一些内存内容,由进程随意修改,那么谁又保证两个进程(在不同的CPU上运行)不会意外地同时接 --//受同一个锁存器,而不会互相了解呢? 这将是一个竞赛条件,如果允许,Oracle实例可能会很快以崩溃或挂起告终。 解决办法在于 --//所有现代硬件都支持的原子"测试和设置"和"比较和交换"(CAS)指令。 这允许Oracle内核代码在单个原子指令中执行锁存值检查和更 --//改。在一些平台(如Intel)上,这需要与一个特殊的标志相结合,该标志锁定了锁存器仅在锁存器获取的持续时间内为单个CPU访问而 --//生活的内存线。这是用英特尔平台上的LOCKCMP X CHG指令完成的。 CMPX CHG的意思是比较和交换,LOCK前缀保证每次只有一个CPU --//可以访问该锁存器(记住,锁存器通常保持很短的时间,测量低到只有微秒)。 Since the Oracle SGA holds lots of information in various types of memory structures, it is not enough to have just one latch for each type of memory structure to be protected. For example, if you have an 8 GB buffer cache for 8 KB blocks, then you have one million buffers in the buffer cache. Reasonable concurrent access to all these million buffers can't be provided by a single latch (in this case a "cache buffers chains" latch). As well, the library cache holds thousands if not hundreds of thousands of cursors and objects in it, reasonable access to all of them would not be possible with a single "library cache" latch, especially if you have lots of CPUs in your database server box. Therefore, Oracle has split the memory structures that need protection into logical sub-pieces and each piece is protected by a separate sub-latch known as a child latch. Some structures such redo log buffer are physically split into ranges and each range is protected by (and accessed using) a separate latch (redo allocation latch for log buffer). In some cases,an algorithm such as a hash or modulus function is used for determining which exact child latch should be used when accessing an object. For example, if you run a database with large buffer cache (100GB+), you may have millions of cache buffers chains child latches in use. When accessing a data buffer block, Oracle will run a modulus function on the data block address (DBA) of the block to determine which cache buffers chains latch to get to satisfy that access. The right library cache latch is also determined by running modulus function on the hash value of library cache object name (the hash_value column in V$SQL for example). --//由于OracleSGA在各种类型的内存结构中保存了大量信息,因此只有一个锁存器来保护每种类型的内存结构是不够的。例如,如果您 --//有8KB块的8GB缓冲区缓存,那么缓冲区缓存中有100万个缓冲区。 对所有这些百万缓冲器的合理并发访问不能由一个锁存器提供(在 --//这种情况下是一个"缓存缓冲链"锁存器)。 此外,库缓存如果不包含几十万个游标和对象,那么使用一个"库缓存"锁存器就不可能合 --//理地访问所有游标和对象,特别是如果数据库服务器框中有大量CPU。 因此,Oracle将需要保护的内存结构拆分为逻辑子块,每个块 --//都由一个单独的子锁存器保护,称为子锁存器。一些结构,如重做日志缓冲区,物理上被分割成范围,每个范围由(并使用)一个单独 --//的锁存器(日志缓冲区的重做分配锁存器)保护。 在某些情况下,使用哈希或模数函数等算法来确定在访问对象时应该使用哪个精确 --//的子锁存器。 例如,如果运行具有大缓冲缓存(100GB)的数据库,则可能有数百万个缓存缓冲区链子锁存器正在使用。 在访问数据 --//缓冲块时,Oracle将在块的数据块地址(D BA)上运行模数函数,以确定要获得哪些缓存缓冲链锁存以满足该访问。 正确的库缓存锁 --//存器也是通过在库缓存对象名称的哈希值上运行模数函数来确定的(例如V$SQL中的hash_value列).. Since latches are held for very short periods, if a latch is busy due some other process, then the acquiring process doesn't give up immediately when it cannot get the latch on the first try. If a latch is busy, the acquiring process tries again few thousand times instead of going to sleep (and wait) immediately. The reasoning here is that if a latch is currently busy, it normally will be freed very quickly, so instead of taking the trouble of going to sleep (requires a syscall and context switch) we "spin" and burn some more CPU and test with the hope of getting the latch some (hundreds) of microseconds later. In other words, some CPU time is sacrificed but ultimately this will save time as it may avoid the needing for the process to "sleep" or suspend itself. This is called spinning (busy-waiting). Oracle registers a "latch free" wait event only after the process has failed to acquire the latch after the initial spinning and goes to sleep. --//由于锁存器被保持很短的时间,如果一个锁存器由于其他过程而繁忙,那么当它无法在第一次尝试中获得锁存时,获取过程不会立即 --//放弃。 如果锁存器繁忙,获取过程将再次尝试几千次,而不是立即睡觉(并等待)。 这里的推理是,如果一个锁存器目前很忙,它通 --//常会很快被释放,所以我们不用费心去睡觉(需要一个系统和上下文开关),而是"旋转",并燃烧更多的CPU和测试,希望得到一些(数 --//百)微秒后的锁存器。 换句话说,牺牲了一些CPU时间,但最终这将节省时间,因为它可能避免进程需要"睡眠"或暂停自己。 这就是 --//所谓的旋转(忙-等待)。 Oracle只在进程在初始旋转后未能获得锁存并进入睡眠后才会注册一个"无锁存器""等待事件"。 Note that some latch get operations explicitly don't perform a "spin-get" and hence the latch get will fail immediately after the first attempt to get it didn't succeed. This technique is used for some latches where Oracle doesn't care which exact child latch to take and is called immediate latch gets or "not willing to wait" latch gets. The "redo copy" latch is one such example as this helps the LogWriter (LGWR) process to know whether anyone is currently copying redo data to log buffer, but not who exactly is copying and where, as this does not matter to LGWR. --//请注意,一些锁存器get操作显式地不执行"spin-get",因此在第一次尝试获取失败后,锁存器get将立即失败。 这种技术用于一些 --//锁存器,其中Oracle不关心要采取哪一个精确的子锁存器,称为即时锁存get或"不愿意等待"锁存get。 "重做复制"锁存器就是这样 --//的一个例子,因为这有助于日志写手(LGWR)进程知道是否有人正在将重做数据复制到日志缓冲区,但不知道到底是谁在复制以及在哪 --//里,因为这对LGWR来说并不重要。 Systematic troubleshooting of latch contention --//锁存争用的系统故障排除. So after this brief introduction to latching in Oracle, let's see why latch contention occurs and how to troubleshoot it. Remember – essentially, a latch is a type of lock. It allows controlling concurrent access to some SGA memory structures so we wouldn't end up with race conditions, SGA corruptions and crashes. So, take note of this simple, but important concept in latch (or any lock) contention: Such latch contention occurs only when there is some other process already holding the latch which our process wants to acquire! --//因此,在简要介绍Oracle中的锁存之后,让我们看看为什么会发生锁存争用以及如何排除它。 记住-本质上,锁存器是一种锁。 它 --//允许控制对某些SGA内存结构的并发访问,这样我们就不会最终遇到竞争条件、SGA损坏和崩溃。 因此,请注意这个简单但重要的概 --//念,在锁存(或任何锁)争用:这样的锁存争用只有当有其他进程已经持有锁存,我们的进程想要获得! In other words. latch contention can only happen when these two conditions are satisfied: --//换句话说。 锁存争用只有在满足这两个条件时才能发生: 1) Someone must try to get the latch 2) Someone else must be already holding the latch --//1)一定有人要把闩锁 --//2)一定是有人拿着闩锁 Latch contention means that some process is forced to sleep as it cannot acquire the latch it wants due someone else already holding it. There will be no latch contention if both of the above conditions aren't met! Therefore, whenever we see latch contention, we need to find out two things: --//锁存争用意味着某些进程被迫休眠,因为它无法获得它想要的锁存器,因为其他人已经持有它。 如果上述两个条件都不满足,就不 --//会有锁存争用! 因此,当我们看到锁存争用时,我们需要找出两件事: 1) Who is trying to get the latch (and why)? 2) Who is already holding the latch (and why)? --//1)谁想得到闩锁(为什么)? --//2)谁已经拿着闩锁了(为什么)? These two questions need to be asked for any lock contention and is applicable here as well. The following sections explain how to answer these questions in Oracle. --//这两个问题需要询问任何锁争用,也适用于这里。 以下各节解释如何在Oracle中回答这些问题。 Question 1 - Who is trying to get the latch and why? The first question can be answered with the help of any session-level tool based on the Oracle Wait Interface, such as the V$SESSION_WAIT view, Active Session History or SQL_TRACE. All these tools would let you know for which exact (child) latch address the wait occurred. --//第一个问题可以在基于OracleWait接口的任何会话级别工具的帮助下回答,例如V$SESSION_WAIT视图、活动会话历史记录或 --//SQL_TRACE。 所有这些工具都会让您知道等待发生的确切(子)锁存器地址。 Note that I did not mention instance-level performance data collection tools such as Statspack or AWR reports in the above list. The problem with these tools is that they only gather and store system-wide performance data such V$SYSSTAT or V$SQL snapshots, and thus they don't know anything about which exact sessions are experiencing the latch contention and waits. These tools would show that someone is experiencing latch contention problems in the database instance, but finding out who and exactly why would involve a significant amount of guesswork, previous experience and luck. Thanks to this information gap, there currently are no systematic approaches to latch contention troubleshooting published and hence such troubleshooting is still perceived as a very complex task. This article aims to change this as there are ways to fill this information gap, which allows us to become systematic and not depend on luck. --//请注意,我没有在上面的列表中提到实例级性能数据收集工具,如Statspack或AWR报告。 这些工具的问题是,它们只收集和存储全 --//系统性能数据,如V$SYSSTAT或V$SQL快照,因此它们不知道哪些确切的会话正在经历锁存争用和等待。 这些工具将显示某人在数据 --//库实例中遇到锁存争用问题,但找出谁和为什么会涉及大量的猜测、以前的经验和运气。 由于这种信息差距,目前还没有系统的方 --//法来锁定争用故障排除发布,因此这种故障排除仍然被认为是一项非常复杂的任务。 本文旨在改变这一点,因为有办法填补这一信 --//息空白,这使我们能够成为系统的,而不是依赖于运气。 There are two common trigger points for latch contention troubleshooting. One is that a DBA sees (from Statspack or AWR report or other monitoring tools) that a significant amount of instance-wide response time is spent waiting for some latch (from the "Top 5 Timed events" section for example). The other trigger point would start from the end user (who's complaining) or some measured business task which takes too long. In this case the DBA would identify the session(s) used by that user or task and use some session-level tool (as mentioned above) for quantifying where the response time is spent. --//锁存争用故障排除有两个常见的触发点。 一个是DBA看到(从Statspack或AWR报告或其他监控工具)大量的实例响应时间被花费在等待 --//一些锁存(例如,从"前5名定时事件"部分)。 另一个触发点将从最终用户(谁在抱怨)或一些测量的业务任务开始,这需要太长时间。 --//在这种情况下,DBA将确定该用户或任务使用的会话,并使用某些会话级别的工具(如上文所述)量化响应时间的使用地点。 Question 2 - Who is holding the latch and why? The second question can be easily answered (starting in Oracle 8.0) by querying the V$LATCHHOLDER view which lists any process/session currently holding a latch. A description is shown below: --//第二个问题可以很容易地回答(从Oracle8.0开始),方法是查询V$LATCHOLDER视图,其中列出了当前持有锁存的任何进程/会话。 说 --//明如下: SQL> describe v$latchholder Name Null? Type ------ ----- ------------- PID NUMBER SID NUMBER LADDR RAW(4) NAME VARCHAR2(64) GETS NUMBER Although little known, this is an extremely useful view for systematic latch contention troubleshooting as it provides one important bit of information which no other Oracle view or trace provides - the session ID of the latch holder. Once you know the holder's SID, you can focus on that session and see exactly what it is doing that it needs to hold the latch for so long. --//虽然鲜为人知,但这是一个非常有用的视图,用于系统锁存争用故障排除,因为它提供了一个重要的信息,没有其他Oracle视图或跟 --//踪提供-锁存器的会话ID。 一旦你知道了持有人的SID,你就可以专注于那个会话,看看它到底在做什么,它需要保持锁存这么长时 --//间。 Columns of interest are SID (Session ID), NAME (Latch name), LADDR (Latch address in memory). LADDR is the latch's unique identifier in an instance and helps to determine which exact child latch was held in case the latch has multiple child latches. --//感兴趣的列是SID(会话ID)、名称(锁存名称)、LADDR(内存中的锁存地址)。 在实例中,LADDR是锁存器的唯一标识符,并有助于确定 --//在锁存器有多个子锁存器的情况下持有哪个精确的子锁存器。 As I said earlier, latches are usually held for very short durations, so manually querying this view may not return anything at all even after querying it multiple times. For this reason, I have written a tool for high-frequency profiling of latch holders which I will introduce now. It's called LatchProf and is a plain SQL script which doesn't require any modifications of the database or schema. You can download it for free from my website at http://blog.tanelpoder.com – just search for "latchprof" using the search button on the right side of the page. --//正如我前面所说,锁存器通常保持非常短的持续时间,因此手动查询此视图可能根本不会返回任何东西,即使在多次查询它之后。 --//出于这个原因,我已经编写了一个工具的高频分析锁存器,我现在将介绍。 它被称为LatchProf,是一个普通的SQL脚本,不需要对 --//数据库或模式进行任何修改。 您可以从我的网站免费下载它在http://blog.tanelpoder.com-只需搜索"latchprof"使用搜索按钮在 --//页面的右侧。 Using LatchProf to find who are holding latches the most --//利用Latchprof找出拿着锁的人最多 Let's assume that we have noticed shared pool and library cache latch contention, reported by a Statspack/AWR report or some monitoring tool. Remember what I mentioned earlier - latch and any other lock contention occurs only if someone is already holding the latch while I'm trying to get the same. Instead of guessing who are the troublemakers, I can just run LatchProf which samples V$LATCHHOLDER at high frequency and reports the top latch holders. --//假设我们已经注意到共享池和库缓存锁存争用,由Statspack/AWR报告或一些监控工具报告。 记住我前面提到的-锁存器和任何其他 --//锁争用只有当有人已经拿着锁,而我试图得到同样的。 我不是猜测谁是麻烦制造者,而是可以运行LatchProf,哪些样本 --//V$LATCHOLDER在高频,并报告顶部锁存器持有人。 I could start by looking which latch types are the busiest in the system. Note that as there are many latch types in Oracle and they also have lots of child latches then the aggregate "busyness" of a latch does not necessarily reflect the (child) latch with the highest contention. While the Oracle Wait Interface is the only correct source for detecting contention, LatchProf should be used for drilldown once the latch experiencing contention has been detected. --//首先,我可以看看系统中哪些锁存类型是最繁忙的。 请注意,由于Oracle中有许多锁存类型,并且它们也有许多子锁存器,那么锁 --//存器的聚合"繁忙"并不一定反映具有最高争用的(子)锁存器。 虽然Oracle等待接口是检测争用的唯一正确来源,但一旦检测到锁存 --//体验争用,则应使用LatchProf进行钻取。 To illustrate the capabilities of LatchProf, I start from a simple example which shows an overview of how busy different latch types are. By "busy" I mean how much of the time these latches are held by somebody. --//为了说明LatchProf的功能,我从一个简单的例子开始,它概述了不同锁存类型有多忙。我指的是这些闩锁有多少时间是由某人持有的。 LatchProf takes four parameters: Parameter 1 specifies which columns from V$LATCHHOLDER to report and group by. In the case below I just want to report latch holds by latch name (and not even break it down by SID for starters). --//参数1指定从V$LATCHOLDER报告和分组的列。 在下面的情况下,我只想报告锁存器持有的锁存器名称(甚至没有打破它的SID为启动)。 Parameter 2 specifies which SIDs to monitor. In the case below, I am interested in any SID which holds a latch (%). --//参数2指定要监视哪些SID。 在下面的情况下,我对任何持有锁存(%)的SID感兴趣。 Parameter 3 specifies which latches to monitor. This can be set either to latch name or latch address in memory. In the case below, I monitor all latches (%). --//参数3指定要监视的锁存器。 这可以设置为内存中的锁存名称或锁存地址。 在下面的情况下,我监视所有锁存器(%)。 Parameter 4 specifies how many times to sample V$LATCHHOLDER. I use 100000 samples below, which completed in a couple of seconds on my test database. The sampling speed depends on your server CPU/memory bus speed and the value of processes parameter. You should start from lower number like 1000 and adjust it so that LatchProf would complete its sampling in a couple of seconds, and that is usually enough for diagnosing ongoing latch contention problems. You shouldn't keep sampling for long periods since LatchProf runs constantly on the CPU. --//参数4指定采样V$LATCHOLDER的次数。 我使用下面的10万个样本,在我的测试数据库上几秒钟就完成了。 采样速度取决于服务器 --//CPU/内存总线速度和进程参数的值。 您应该从较低的数字开始,如1000,并调整它,以便LatchProf将在几秒钟内完成其采样,这通 --//常足以诊断正在进行的锁存争用问题。 您不应该长期保持采样,因为LatchProf在CPU上不断运行。 So, let's run LatchProf now. After few seconds of sampling, LatchProf returns the above profile: --//所以,让我们现在运行Latch教授。 经过几秒钟的采样,LatchProf返回上述配置文件: SQL> @latchprof name % % 100000 -- LatchProf 1.20 by Tanel Poder ( http://www.tanelpoder.com ) NAME Held Gets Held % Held ms Avg hold ms ----------------------------------- ---------- ---------- ------- ----------- ----------- library cache 3319 2815 3.32 91.273 .032 shared pool 1694 1131 1.69 46.585 .041 library cache lock 580 580 .58 15.950 .028 shared pool simulator 482 100 .48 13.255 .133 kks stats 269 258 .27 7.398 .029 enqueues 158 158 .16 4.345 .028 library cache pin 67 67 .07 1.843 .028 enqueue hash chains 62 62 .06 1.705 .028 row cache objects 4 4 .00 .110 .028 shared pool sim alloc 2 2 .00 .055 .028 10 rows selected. The output column meanings are as follows: Name - Latch name Held - During how many samples out of total samples (100000) the particular latch was held by somebody Gets - How many latch gets against that latch were detected during LatchProf sampling Held % - How much % of time was the latch held by somebody during the sampling. This is the main column you want to be looking at in order to see who/what holds the latch the most (the latchprof output is reverse-ordered by that column) Held ms - How many milliseconds in total was the latch held during the sampling Avg hold ms - Average latch hold time in milliseconds (normally latches are held from a few to few hundred microseconds) Note that LatchProf works on Oracle versions 9.2 and higher, but the Gets and Avg hold ms columns are shown only on Oracle 10g and higher as the required data isn't externalized in the V$LATCHHOLDER until 10g. --//请注意,LatchProf工作在Oracle版本9.2及更高版本上,但获取和Avg保持ms列仅显示在Oracle10g和更高版本上,因为所需数据在 --//V$LATCHHOLDER中不会外化到10g。 So, from the above output we can see that the library cache latches were the busiest during LatchProf run (3.32% of the sampling time there was some library cache latch held by some session). --//因此,从上面的输出我们可以看到,库缓存锁f运行期间,库缓存锁存器是最忙的(3.32%的采样时间有一些库缓存锁存器由某个会话 --//持有)。 When a latch is busy, one obvious question would be: "which sessions are holding the latch the most?" Sometimes all the sessions in an instance are both victims and troublemakers (each contributing to the latch contention problem just a little), but sometimes there are one (or a few) heavy troublemakers and most other sessions end up as victims. --//当一个锁存器繁忙时,一个明显的问题是:"哪个会话最能握住锁存器?"有时,一个实例中的所有会话都是受害者和麻烦制造者(每 --//个会话对锁存争用问题只有一点),但有时有一个(或几个)严重的麻烦制造者,大多数其他会话最终都是受害者。 Luckily LatchProf allows us to break down the latch holder statistics by SID and since I am interested only in the library cache latches for now, I can monitor only this latch, as shown in bold below: --//幸运的是,LatchProf允许我们用SID分解锁存器的统计数据,并且由于我现在只对库缓存锁存器感兴趣,所以我只能监视这个锁存器 --//,如下粗体所示: SQL> @latchprof sid,name % "library cache" 100000 -- LatchProf 1.20 by Tanel Poder ( http://www.tanelpoder.com ) SID NAME Held Gets Held % Avg hold ms --- ------------------ ---- ---- ------ ----------- 120 library cache 5059 4618 5.06 .037 120 library cache lock 189 188 .19 .034 120 library cache pin 135 135 .14 .034 139 library cache lock 131 131 .13 .034 139 library cache 109 109 .11 .034 113 library cache lock 60 60 .06 .034 113 library cache 48 48 .05 .034 7 rows selected. In the listing above, we have just found out the SID of the main "offender", i.e. SID 120, which apparently holds some library cache latch(es) over an order of magnitude more (5.06% of time) than the next most offending holder sessions (139,113). --//在上面的列表中,我们刚刚发现了主犯的SID。 SID120,它显然持有一些库缓存锁存(ES)超过数量级(5.06%的时间)比下一个最违规 --//的持有人会话(139,113)。 Since we know the SID of the troublemaker, we can now see what this session is doing, determine which SQL its running by querying V$SESSION, ASH or by enabling SQL trace. In fact, the easiest way for getting an idea what kind of work a session is doing is running my Snapper script. This script is also available on my site (and doesn't make any changes to the database, by the way!) Snapper takes 4 parameters (out|trace - where to show the output, either DBMS_OUTPUT or tracefile, 5 - how many seconds to sleep between session performance snapshots, 1 - how many samples to take, 120 - SID to monitor): --//由于我们知道麻烦制造者的SID,我们现在可以看到这个会话正在做什么,通过查询V$Session、ASH或启用SQL跟踪来确定它运行的 --//SQL。 事实上,最简单的方法就是运行我的Snapper脚本。 这个脚本在我的网站上也是可用的(顺便说一下,不会对数据库做任何更 --//改)! 快照获取4个参数(输出|跟踪-在哪里显示输出,DBMS_OUTPUT或跟踪文件,5-在会话性能快照之间休眠多少秒,1-要采集多少 --//个样本,120-SID监视): SQL> @snapper stats 5 1 120 -- Session Snapper v2.02 by Tanel Poder ( http://www.tanelpoder.com ) -------------------------------------------------------------------------------------- SID, USERNAME, TYPE, STATISTIC , DELTA, HDELTA/SEC, %TIME -------------------------------------------------------------------------------------- 120, SYS , STAT, recursive calls , 8904, 1.78k, 120, SYS , STAT, recursive cpu usage , 241, 48.2, 120, SYS , STAT, CPU used by this session , 249, 49.8, 120, SYS , STAT, enqueue requests , 4452, 890.4, 120, SYS , STAT, enqueue releases , 4452, 890.4, 120, SYS , STAT, calls to get snapshot scn: kcmgss , 4452, 890.4, 120, SYS , STAT, parse time cpu , 214, 42.8, 120, SYS , STAT, parse time elapsed , 452, 90.4, 120, SYS , STAT, parse count (total) , 4452, 890.4, 120, SYS , STAT, parse count (hard) , 4452, 890.4, 120, SYS , STAT, execute count , 4452, 890.4, 120, SYS , TIME, hard parse elapsed time , 3522257, 704.45ms, 70.4% 120, SYS , TIME, parse time elapsed , 4425428, 885.09ms, 88.5% 120, SYS , TIME, PL/SQL execution elapsed time , 91410, 18.28ms, 1.8% 120, SYS , TIME, DB CPU , 2790000, 558ms, 55.8% 120, SYS , TIME, sql execute elapsed time , 5035731, 1.01s, 100.7% 120, SYS , TIME, DB time , 5035731, 1.01s, 100.7% -- End of snap 1, end=2009-11-10 03:37:30, seconds=5 PL/SQL procedure successfully completed. The first thing that immediately jumps out from SID 120's execution profile is that it does heavy hard parsing, 890 hard parses per second! (shown in bold). That's a pretty evident reason for the frequent library cache latch holding as reported by LatchProf. We can be almost sure that the application code is not using bind variables where it should use them. The next step would be to identify couple of SQL statements from this session, either by briefly enabling SQL_TRACE or just by querying V$SESSION.SQL_ID and mapping this to V$SQL, and then call the developer(s) who wrote that code and get them to fix it! --//从SID120的执行配置文件中立即跳出来的第一件事是,它做重硬解析,每秒890硬解析! (以粗体显示) 这是LatchProf报告的频繁库 --//缓存锁存保持的一个非常明显的原因。 我们几乎可以确定应用程序代码不是在应该使用绑定变量的地方使用它们。 下一步将是通过 --//简单启用SQL_TRACE或仅仅通过查询V$会话来识别本会话中的几个SQL语句。 SQL_ID并将其映射到V$SQL,然后调用编写该代码的开发 --//人员并让他们修复它! Let's assume that we could get this code fixed, but our monitoring system still reports some ongoing library cache latch contention and we don't see an obvious troublemaker anymore (all sessions reported are holding the monitored latches roughly equal amount of time) as seen below: SQL> @latchprof sid,name % "library cache" 100000 -- LatchProf 1.20 by Tanel Poder ( http://www.tanelpoder.com ) SID NAME Held Gets Held % Held ms Avg hold ms ---------- --------------------- ----- ------ ------- ----------- ----------- 116 library cache lock 657 657 .66 12.549 .019 116 library cache 594 594 .59 11.345 .019 134 library cache lock 211 211 .21 4.030 .019 134 library cache 122 122 .12 2.330 .019 139 library cache 82 82 .08 1.566 .019 139 library cache lock 64 64 .06 1.222 .019 6 rows selected. So, we now have two ways to move forward: --//因此,我们现在有两种前进的方法: 1) Take one of the reported sessions and see what it is doing, by enabling SQL trace or similar. However this approach may not tell you which of the SQL statements is causing the latches to be held 2) Run LatchProf with SQLID option, which also reports the SQLIDs (or hash values in 9i) of the statements causing the latches to be held --//1)通过启用SQL跟踪或类似的操作,获取所报告的会话之一并查看它正在做什么。 但是,这种方法可能不会告诉您哪个SQL语句导致 --// 锁存器被持有 --//2)使用SQLID选项运行LatchProf,该选项还报告导致锁存保持的语句的SQLID(或9i中的哈希值) I will proceed with option 2, and will just specify SQLID (or SQLHASH in 9i) in the first parameter to LatchProf: --//我将继续执行选项2,只需在第一个参数中指定SQLID(或9i中的SQLHASH)给LatchProf: SQL> @latchprof name,sqlid % "library cache" 100000 -- LatchProf 1.20 by Tanel Poder ( http://www.tanelpoder.com ) NAME SQLID Held Gets Held % Avg hold ms -------------------- ------------- ----- ------ ------- ----------- library cache 5csdgx2jquqvj 858 858 .86 .028 library cache lock 8vdny41vqbq7t 377 377 .38 .028 library cache lock 5csdgx2jquqvj 369 369 .37 .028 This shows me that the main statement causing library cache latches to be held has SQL_ID '5csdgx2jquqvj'. --//这表明导致库缓存锁存的主语句有SQL_ID'5csdgx2j quqvj'。 Using the SQL ID I can find the text of the query as shown below: --//用SQLID,我可以找到查询的文本,如下所示: SQL> select executions, sql_text from v$sql where sql_id = '5csdgx2jquqvj'; EXECUTIONS SQL_TEXT ---------- ------------------------------------------------- 40522096 select count(*) from dual where rownum = 1 Apparently someone is running the above query really frequently! (Of course I should see how much the executions increases right now instead of just looking into the total number of executions since that cursor was loaded into the library cache). --//显然有人正在频繁地运行上述查询! (当然,我现在应该看到执行增加了多少,而不是仅仅查看执行的总数,因为光标被加载到库缓 --//存中)。 The common reasons for such frequently re-executed, but useless SQL can be either someone's really bad piece of code or bad connection pool design feature, which constantly polls DUAL table to verify if the connection is still alive. --//这种频繁重新执行但无用的SQL的常见原因可能是某人的真正糟糕的代码或糟糕的连接池设计特性,它不断轮询DUAL表以验证连接是 --//否仍然存在。 Note that it's possible to break the latch holders down by both SID and SQLID to get a more detailed view of latch holders: --//请注意,可以通过SID和SQLID将锁存器保持器断开,以获得更详细的锁存器保持器视图: SQL> @latchprof sid,name,sqlid % "library cache" 100000 -- LatchProf 1.20 by Tanel Poder ( http://www.tanelpoder.com ) SID NAME SQLID Held Gets Held % Avg hold ms ---- ------------------- ------------- ----- ----- ------- ----------- 134 library cache 5csdgx2jquqvj 462 462 .46 .020 134 library cache lock 5csdgx2jquqvj 395 395 .40 .020 134 library cache lock 8vdny41vqbq7t 305 305 .31 .020 116 library cache lock 5csdgx2jquqvj 100 100 .10 .020 139 library cache 5csdgx2jquqvj 98 98 .10 .020 116 library cache 5csdgx2jquqvj 93 93 .09 .020 139 library cache lock 5csdgx2jquqvj 80 80 .08 .020 134 library cache 8vdny41vqbq7t 73 73 .07 .020 113 library cache 5csdgx2jquqvj 62 62 .06 .020 139 library cache lock 8vdny41vqbq7t 60 60 .06 .020 116 library cache lock 8vdny41vqbq7t 60 60 .06 .020 113 library cache lock 8vdny41vqbq7t 35 35 .04 .020 113 library cache lock 5csdgx2jquqvj 31 31 .03 .020 113 library cache 1 1 .00 .020 14 rows selected. The above example was meant as an introduction to the capabilities of LatchProf. I used library cache latches as an example, but LatchProf can show the same information for any other type of latch (there are hundreds of different types of latches in Oracle!) --//上面的例子是为了介绍Latchprof的能力。 我以库缓存锁存器为例,但LatchProf可以为任何其他类型的锁存器显示相同的信息 --//(Oracle中有数百种不同类型的锁存器)! The key difference between LatchProf and other latch contention troubleshooting approaches are: --//Latch Prof与其他锁存争用故障排除方法的关键区别在于: 1) LatchProf shows you who is holding the latches (who is causing the contention) instead of just showing who's waiting for them 2) LatchProf shows you which actual sessions are holding the latches (not some system-wide ambiguous metrics), thus you can simply focus on the troublemaking sessions and see what are they doing 3) LatchProf shows you which SQL statement caused taking the taking of the latch, thus narrowing down the root cause even more --//1)Latchprof告诉你是谁拿着门闩(谁引起了争论),而不是仅仅显示谁在等他们 --//2)LatchProf向您展示了哪些实际会话持有锁存器(而不是一些系统范围内的模糊度量),因此您可以简单地关注故障处理会话,看看 --// 它们在做什么 --//3)LatchProf向您展示了哪个SQL语句导致了锁存器的占用,从而使根本原因更加缩小 Having established the basics, let's now dig a bit deeper with another example. --//已经建立了基础,现在让我们用另一个例子更深入地挖掘。 Using LatchProfX for finding the hot block causing cache buffers chains latch contention --//使用LatchProfX查找导致缓存缓冲区链锁存争用的热块 Let's say your monitoring system reported significant cache buffers chains latch contention and you want to diagnose it. Cache buffers chains (CBC) latch contention can occur due a number of reasons. One of the most common reasons is that you have an extremely hot block in the buffer cache. Of course we don't know the root cause in advance, so let's systematically troubleshoot this problem. --//假设您的监控系统报告了重要的缓存缓冲区链锁存争用,您希望诊断它。 缓存缓冲链(CBC)锁存争用可能发生的原因有很多. 最常见 --//的原因之一是缓冲区缓存中有一个非常热的块。 我们当然不知道根本原因,所以让我们系统地排除这个问题。 First, there are many CBC latches in an Oracle instance. In my test server with a small buffer cache, I have 4096 CBC child latches. A big production database with a large buffer cache can easily have hundreds of thousands of child latches! You can check that using the following command: --//首先,在Oracle实例中有许多CBC锁存器。 在带有小缓冲区缓存的测试服务器中,我有4096个CBC子锁存器。 一个大的生产数据库, --//有一个大的缓冲缓存,可以很容易地拥有几十万个子锁存器!! 您可以使用以下命令检查: SQL> select count(*) from v$latch_children where name = 'cache buffers chains'; COUNT(*) -------- 4096 So, as a first step we should identify which CBC latches are experiencing the contention – i.e. is the contention lightly spread across many CBC latch children or is there a single child latch experiencing much heavier contention compared to others? --//因此,作为第一步,我们应该确定哪些CBC锁存器正在经历争用-即这一争用是否在许多CBC子锁存器中轻轻传播,或者是否有一个单 --//个子锁存器与其他子锁存器相比经历了更重的争用? We will need to use the Oracle Wait Interface since it can report the contention (waits) for latches. This will require us to use either SQL_TRACE or V$SESSION_WAIT/ASH or calculate deltas of V$LATCH_CHILDREN.WAIT_TIME to see which child latch experiences the most sleeps. I do like the first two approaches more as they allow to tie the latch waits together with a session and with whatever it was doing. For example, SQL_TRACE reports the following lines when a traced session is experiencing CBC latch contention: --//我们需要使用OracleWait接口,因为它可以报告锁存器的争用(等待)。 这将要求我们使用SQL_TRACE或V$SESSION_WAIT/ASH或计算 --//V$LATCH_CHILDREN.wait_time的Deltas看看哪个孩子闩锁经历的睡眠最多。 我确实更喜欢前两种方法,因为它们允许将锁存器与会 --//话以及它正在做的任何事情连接在一起。 例如,当跟踪会话遇到CBC锁存器争用时,SQL_TRACE报告以下行: WAIT #1: nam='latch: cache buffers chains' ela= 204 address=15742234376 ... WAIT #1: nam='latch: cache buffers chains' ela= 8 address=15742234376 ... WAIT #1: nam='latch: cache buffers chains' ela= 2 address=15742234376 ... WAIT #1: nam='latch: cache buffers chains' ela= 7 address=15742234376 ... The "address" part is important - it shows you exactly which child latch is experiencing the contention (every child latch in the instance has an unique memory address). I took the most prevalent address (in decimal) from the tracefile, converted it to hex and queried V$LATCH_CHILDREN with it as seen below: --//"地址"部分很重要,它向您准确地显示哪个子锁存器正在经历争用(实例中的每个子锁存器都有一个唯一的内存地址)。 我从trace --//file中获取了最流行的地址(十进制),将其转换为十六进制,并用它查询V$LATCH_CHILDREN如下: SQL> select name, gets, misses, sleeps 2 from v$latch_children 3 where addr = hextoraw(trim(to_char(15742234376, '0XXXXXXXXXXXXXXX'))); NAME GETS MISSES SLEEPS -------------------- -------- ------ ------ cache buffers chains 62178598 125541 805 Alternatively you can sample V$SESSION or query ASH (if you have the Diagnostics Pack licenses!) as I've shown below: --//或者,您可以示例V$会话或查询ASH(如果您有诊断包许可证)!如下所示: SQL> SELECT * FROM ( 2 SELECT 3 event 4 , TRIM(TO_CHAR(p1, 'XXXXXXXXXXXXXXXX')) latch_addr 5 , TRIM(ROUND(RATIO_TO_REPORT(COUNT(*)) OVER () * 100, 1))||'%' PCT 6 , COUNT(*) 7 FROM 8 v$active_session_history 9 WHERE 10 event = 'latch: cache buffers chains' 11 AND session_state = 'WAITING' 12 GROUP BY 13 event 14 , p1 15 ORDER BY 16 COUNT(*) DESC 17 ) 18 WHERE ROWNUM <= 10 19 / EVENT LATCH_ADDR PCT COUNT(*) --------------------------- ---------- ---- -------- latch: cache buffers chains 3AA4F6F08 4.1% 50 latch: cache buffers chains 3AA4D4908 3.7% 45 latch: cache buffers chains 3AA4E2558 3.4% 42 latch: cache buffers chains 3AA4F00E0 3.2% 39 latch: cache buffers chains 3AA519440 2.9% 36 latch: cache buffers chains 3AA527090 2.8% 34 latch: cache buffers chains 3AA4FDD30 2.8% 34 latch: cache buffers chains 3AA50B8B8 2.7% 33 latch: cache buffers chains 3AA4DB730 2.6% 32 latch: cache buffers chains 3AA534C18 2.6% 32 10 rows selected. From above query I get the same child latch address (in hex) as reported by SQL_TRACE. Now that I know the actual CBC child latch address, I can run LatchProf with filtering by individual child latch address! That way, I can monitor the holders of this particular child latch experiencing worst contention. --//从上面的查询中,我得到了与SQL_TRACE报告的相同的子闩锁地址(十六进制)。 现在我知道了实际的CBC子锁存地址,我可以通过单 --//个子锁存器地址来运行latchprof! 这样,我就可以监视这个特定子锁存器的持有者,体验到最坏的争用。 SQL> @latchprof sid,name,sqlid % 3AA4F6F08 100000 -- LatchProf 1.20 by Tanel Poder ( http://www.tanelpoder.com ) SID NAME SQLID Held Gets Held % Avg hold ms --- -------------------- ------------- ---- ---- ------ ----------- 136 cache buffers chains f0cxkf0q803f8 58 58 .06 .011 120 cache buffers chains f0cxkf0q803f8 28 28 .03 .011 119 cache buffers chains f0cxkf0q803f8 16 16 .02 .011 148 cache buffers chains f0cxkf0q803f8 4 4 .00 .011 118 cache buffers chains 75621g9y3xmvd 1 1 .00 .011 The top SQLID causing holding of the CBC child latch at address 3AA4F6F08 is "f0cxkf0q803f8". --//导致CBC子锁存器保持在地址3AA4F6F08的顶部SQLID是"f0cxkf0q803f8"。 SQL> select executions, sql_text from v$sql where sql_id = 'f0cxkf0q803f8'; EXECUTIONS SQL_TEXT ---------- -------------------------------------------------- 2115217 SELECT /*+ first_rows index(customers, customers_pk) index(orders, order_status_ix) */ O.ORDER_ID, LINE_ITEM_ID, PRODUCT_ID, UNIT_PRICE, QUANTITY, ORDER_MODE, ORDER_STATUS, ORDER_TOTAL, SALES_REP_ID, PROMOTION_ID, C.CUSTOMER_ID, CUST_FIRST_NAME, CUST_LAST_NAME, CREDIT_LIMIT, CUST_EMAIL, ORDER_DATE FROM ORDERS O , ORDER_ITEMS OI, CUSTOMERS C WHERE O.ORDER_ID = OI.ORDER_ID AND O.CUSTOMER_ID = C.CUSTOMER_ID AND O.ORDER_STATUS <= 4 From this, I know both the top sessions causing the contention and also the top SQL's causing this latch contention! Since we now know the SQLID, the next step could be to check the execution plan of the given SQL and see why is it accessing the datablock protected by our latch so frequently. Sometimes, the contention comes from many frequent revisits of the block(s) due a bad execution plan, and may involve nested loop joins in improper locations, looping over many rows. However, in other situations, the problem may just occur due a hot block, such an index root block, visited by large number of concurrent sessions. --//由此,我知道导致争用的顶级会话和导致这个锁存争用的顶级SQL! 既然我们现在知道SQLID,下一步可以是检查给定SQL的执行计划 --//,看看为什么它会如此频繁地访问我们的锁存器保护的datablock。 有时,争用来自由于执行计划不好而频繁地重新访问块,并且可 --//能涉及嵌套循环在不适当的位置连接,循环在许多行上。 然而,在其他情况下,问题可能只是由于一个热块,这样的索引根块,访 --//问了大量并发会话。 Still, it's useful to know what kind of hot blocks are causing the latch contention. --//尽管如此,知道什么样的热块导致锁存争用是很有用的。 The good news is that we can exactly identify the exact data block which is so hot that it causes the contention. There are few approaches for identifying the hot block out there, but the most accurate of them is the LatchProfX script. The LatchProfX is available on my website, again for free. However, note the X in the end of script name - it means LatchProf eXtended or LatchProf which requires X$ tables, which means extended privileges. --//好消息是,我们可以准确地识别精确的数据块,它是如此的热,以至于它引起了争论。 识别热块的方法很少,但其中最准确的是 --//LatchProfX脚本。 Latch Prof X可在我的网站上免费获得。 然而,请注意脚本名称末尾的X-它意味着LatchProfeXtended或 --//LatchProf需要X$表,这意味着扩展特权。 LatchProfX shows an additional column, called "object". This column shows information about the object protected by a given latch. For cache buffers chains latches, this object shows the Data Block Address (DBA) of the block that we accessed, causing the latch get! See the example below: --//Latch Prof X显示了一个额外的列,称为"对象". 此列显示有关由给定锁存器保护的对象的信息。 对于缓存缓冲链锁存,此对象显 --//示我们访问的块的数据块地址(DBA),导致锁存器获取! 见下面的例子: SQL> @latchprofx sid,name,sqlid,object % 3AA4F6F08 100000 -- LatchProfX 1.20 by Tanel Poder ( http://www.tanelpoder.com ) SID NAME SQLID OBJECT Held Gets Held % Avg hold ms --- -------------------- ------------- ------- ----- ----- ------- ----------- 148 cache buffers chains f0cxkf0q803f8 1C0005A 40 40 .04 .010 136 cache buffers chains f0cxkf0q803f8 1C0005A 39 37 .04 .011 120 cache buffers chains f0cxkf0q803f8 1C0005A 4 4 .00 .010 118 cache buffers chains 75621g9y3xmvd 1C0005A 1 1 .00 .010 It looks like the block at data block address 1C0005A (in hex) is the troublemaker! --//看起来数据块地址1C0005A的块(十六进制)是麻烦制造者! Lets check into which segment does this block belong: --//让我们检查一下这个块属于哪段: SQL> @dba 1C0005A RFILE# BLOCK# ---------- ---------- 7 90 --//1C0005A = set dba 7,90 = alter system dump datefile 7 block 90 = 29360218 Press enter to find the segment using V$BH (this may take CPU time), CTRL+C to cancel: --//按Enter键查找段使用V$BH(这可能需要CPU时间),CT RLC取消: STATE BLOCK_CLASS DATA_OBJ# OBJECT_TYPE object ---------- ------------------ ---------- ------------------- ----------------- xcur data block 62691 INDEX PARTITION SOE.CUSTOMERS_PK As seen above, this block belongs into an index partition of an index SOE.CUSTOMERS_PK. The block itself lives in relative file number 7 and block number 90 in that file. It is possible to dump the contents of the block using ALTER SYSTEM DUMP DATAFILE 7 BLOCK 90 command (note that this ALTER SYSTEM command takes absolute file ID as parameter, but the above number is a relative file ID and should be converted to absolute figure using DBA_DATA_FILES). --//如上所述,此块属于索引SOE的索引分区。 CUSTOMERS_PK. 块本身位于该文件中的相对文件号7和块号90中。 可以使用ALTER系统 --//DUMP数据文件7BLOCK90命令转储块的内容(注意,该ALTER系统命令以绝对文件ID作为参数,但上述数字是相对文件ID,应使用 --//DBA_DATA_FILES转换为绝对数字)。 However, the data_obj# column shows the data_object_id of the index partition segment and using this information, I can query and find out where this index partition segment's header block resides. Keep in mind that Index root blocks are always stored immediately next to index segment header so that Oracle would know where to physically start traversing the index when accessing it. --//但是,data_obj#列显示了索引分区段的data_object_id,并且使用这些信息,我可以查询并找到这个索引分区段的头块所在的位置 --//。 请记住,索引根块总是立即存储在索引段头旁边,以便Oracle在访问索引时知道从哪里开始物理遍历索引。 SQL> SELECT 2 header_file 3 , header_block 4 FROM 5 dba_segments 6 WHERE 7 (owner, segment_name, partition_name) = 8 (SELECT owner, object_name, subobject_name 9 FROM dba_objects WHERE data_object_id = 62691) 10 / HEADER_FILE HEADER_BLOCK ----------- ------------ 7 89 1 row selected. So from the above, the segment header block (7,89) resides immediately before our hot block (7,90) so we know that the hot block is an index root block. It is normal to have some contention on very commonly accessed blocks such index root and first level branch blocks (especially of primary key indexes and so on). Oracle has come up with many optimizations for relieving such contention - for example, it allows shared CBC latch access for examination of the index root and branch blocks on most platforms. --//因此,从上面可以看出,段头块(7,89)位于我们的热块(7,90)之前,因此我们知道热块是一个索引根块。 对于非常常见的访问块 --//,如索引根和一级分支块(特别是主键索引等),通常有一些争用。 甲骨文提出了许多优化来缓解这种争用-例如,它允许共享CBC锁 --//存访问,以检查大多数平台上的索引根和分支块。 Note that sometimes the latch contention shows inefficient physical design for a given concurrency requirement such as not spreading the contention points across multiple blocks/latches by partitioning the object to spread the "hot" data to multiple blocks in various partitions. --//请注意,有时锁存争用显示给定并发需求的低效率物理设计,例如通过分区对象将"hot"数据传播到不同分区中的多个块,而不将争 --//用点分散到多个块/块上。 Often though, the CBC latch contention is merely a symptom of an incorrect execution plan, such as a nested loop join operation that revisits the same data blocks again and again many times in a tight loop. For example the plan below would work well when the number of matching orders returned from ORDERS table is low, so we need to iterate through the CUSTOMERS table (and the CUSTOMERS_PK index which had contention) relatively small number of times. --//然而,通常情况下,CBC锁存争用只是不正确的执行计划的一个症状,例如嵌套循环连接操作,它在一个紧密的循环中反复多次访问 --//相同的数据块。 例如,当从Orders表返回的匹配订单数量较低时,下面的计划将很好地工作,因此我们需要迭代Customers表(以及 --//具有争用的CUSTOMERS_PK索引),次数相对较少。 --------------------------------------------------------- |Id|Operation |Name | --------------------------------------------------------- | 0|SELECT STATEMENT | | | 1| TABLE ACCESS BY GLOBAL INDEX ROWID |ORDER_ITEMS | | 2| NESTED LOOPS | | | 3| NESTED LOOPS | | | 4| TABLE ACCESS BY GLOBAL INDEX ROWID|ORDERS | |*5| INDEX RANGE SCAN |ORD_STATUS_IX| | 6| PARTITION HASH ITERATOR | | | 7| TABLE ACCESS BY LOCAL INDEX ROWID|CUSTOMERS | |*8| INDEX UNIQUE SCAN |CUSTOMERS_PK | |*9| INDEX RANGE SCAN |ITEM_ORDER_IX| --------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 5 - access("O"."ORDER_STATUS"<=4) 8 - access("O"."CUSTOMER_ID"="C"."CUSTOMER_ID") 9 - access("O"."ORDER_ID"="OI"."ORDER_ID") Limitations of V$LATCHHOLDER and LatchProf(X) --//V$LAT CHOLDER和LatchProf(X)的局限性 There's a limitation in the V$LATCHHOLDER view. While it is enough for troubleshooting 99.9% of latch contention problems, there are couple of cases where V$LATCHHOLDER view does not show the latch holder even if it's actually held. --//在V$LATCHOLDER视图中有一个限制。 虽然它足以排除99.9%的锁存争用问题,但有几种情况下,V$LATCHOLDER视图不显示锁存保持架 --//,即使它实际上持有。 If your STATISTICS_LEVEL = BASIC then V$LATCHHOLDER does not maintain the latch holder information for ultrafast latches. Ultrafast latches are a small subset of latches (like cache buffers chains latches) which can be "even faster" by not maintaining some instrumentation, stats and counters. For all other latches (in practice everything else than CBC latches) will still maintain the latch holder info even if the statistics_level = basic (the undocumented parameter for setting this independently from statistics level is named _ultrafast_latch_statistics). Note that even when the ultrafast latch statistics info is not gathered, it's still possible to extract the latch holder info directly from SGA memory, thanks to state objects which Oracle has to maintain for process recovery. --//如果您的STATISTICS_LEVEL=BASIC,那么V$LATCHOLDER不维护超快锁存器的锁存器信息。 超快锁存器是一小部分锁存器(如缓存缓 --//冲区链锁存器),它可以通过不维护一些仪表、状态和计数器来"更快"。 对于所有其他锁存器(在实践中,除了CBC锁存器之外的一切 --//),即使statistics_level=basic(独立于统计级别设置这一点的无证参数被命名为_ultrafast_latch_statistics),仍然会维护锁存 --//器信息。 请注意,即使没有收集超快锁存统计信息,仍然可以直接从SGA内存中提取锁存器信息,这要归功于Oracle必须维护的用于 --//进程恢复的状态对象。 V$LATCHHOLDER scans through the process state object array (V$PROCESS/X$KSUPR) and looks into a field there which points to the latch held by a process. However, sometimes a process can hold multiple latches at the same time (library cache latch + shared pool latch for example). When a process holds multiple latches, then unfortunately V$LATCHHOLDER only reports the first one taken by that process. This may lead to cases where you see sessions waiting significantly for (let say) shared pool latch, but nobody seems to be holding it too much. However, as said in the beginning of this article, you should only start troubleshooting latch contention whether you see sessions actually wait for that latch! So if you see sessions waiting (the contention exists!), but nobody apparently holding the latch much in V$LATCHHOLDER, then you may be hitting this shortcoming of that V$ view. This is the case when you may need to fall back to old latch contention troubleshooting techniques (which do require lots of previous experience with this and involve quite an amount of guesswork ). Or alternatively you can hire me as a consultant, I have tools for reading the truth out of the state objects in SGA memory ;-) --//V$LATCHOLDER通过进程状态对象数组(V$PROCESS/X$KSUPR)扫描,并查看指向进程持有的锁存器的字段。 然而,有时一个进程可以同 --//时容纳多个锁存器(例如,库缓存锁存共享池锁存器)。 当一个进程持有多个锁存器时,不幸的是V$LAT CHOLDER只报告该进程采取的 --//第一个锁存器。 这可能导致您看到会话等待(让我们说)共享池闩锁的情况,但似乎没有人持有太多。 然而,正如本文开头所说的, --//无论您是否看到会话实际上等待该锁存器,您都应该只开始排除锁存争用! 因此,如果您看到会话等待(争用存在! ),但显然没有 --//人在V$LATCHOLDER中持有大量的锁存器,那么您可能正在触及V$视图的这个缺点。 这种情况下,您可能需要回到旧的锁存争用故障 --//排除技术(这确实需要大量以前的经验,并涉及相当多的猜测)。 或者你可以聘请我作为顾问,我有工具从SGA内存中的状态对象中读 --//取真相;-) Conclusion There's a huge variety of reasons for latch contention in Oracle (and there are hundreds of different types of latches in Oracle too) and I hope to discuss some of the most common reasons in a future article. However it helps to keep in mind two main facts about latch contention: --//甲骨文的锁存争用有各种各样的原因(甲骨文也有数百种不同类型的锁存器),我希望在未来的文章中讨论一些最常见的原因。 然而 --//,它有助于记住关于锁存争用的两个主要事实: 1) Latch contention and waits happen only if someone tries to get a latch which is already held by someone else! Therefore, after finding out which sessions are waiting and for which latches, it's good to find out which sessions are holding these latches and why (and that's why I wrote LatchProf and LatchProfX!) --//1)锁存争用和等待只有当有人试图得到一个锁存器,已经被其他人持有! 因此,在找出哪些会话正在等待,哪些锁存器之后,找出 --//哪些会话正在举行这些锁存器,以及为什么(这就是为什么我写了Latchprof和LatchprofX)是很好的! 2) Latch contention is usually a symptom of some other problem, such bad connection / cursor management, inefficient execution plan, physical design issues or even CPU starvation. --//2)Latch争用通常是其他一些问题的症状,例如糟糕的连接/光标管理、低效的执行计划、物理设计问题甚至CPU饥饿。 P.S. If you want to take on an interesting SQL challenge, then look into LatchProf or LatchProfX source code and understand how it manages to sample the V$LATCHHOLDER view up to a hundred thousand times per second while being just a single SQL statement! --//P.S.如果您想接受一个有趣的SQL挑战,那么请查看LatchProf或LatchProf X源代码,并了解它如何能够在仅仅是一个SQL语句的同时 --//,每秒采样多达十万次的V$LATCHOLDER视图? Download LatchProf and LatchProfX http://blog.tanelpoder.com/files/scripts/latchprof.sql http://blog.tanelpoder.com/files/scripts/latchprofx.sql
Systematic Latch Contention Troubleshooting in Oracle
来源:这里教程网
时间:2026-03-03 16:17:57
作者:
编辑推荐:
- Systematic Latch Contention Troubleshooting in Oracle03-03
- 20201215]记录工作中的错误.txt03-03
- 昆仑【2540437】主管GBase8s 数据库查看状态03-03
- Toad for Oracle 2020 安装教程(附安装方法步骤)03-03
- 数据库查询ASH03-03
- 百事3主管2540437[20201214]再遇SQL*Net break/reset to client.txt03-03
- 摩登主管2540437[20201210]sql语句优化.txt03-03
- 【性能调优】Oracle AWR报告指标全解析03-03
下一篇:
相关推荐
-
雷神推出 MIX PRO II 迷你主机:基于 Ultra 200H,玻璃上盖 + ARGB 灯效
2 月 9 日消息,雷神 (THUNDEROBOT) 现已宣布推出基于英
-
制造商 Musnap 推出彩色墨水屏电纸书 Ocean C:支持手写笔、第三方安卓应用
2 月 10 日消息,制造商 Musnap 现已在海外推出一款 Oce
热文推荐
- oracle查询v$lock锁里面block和被block的sql_text
- 20201215]记录工作中的错误.txt
20201215]记录工作中的错误.txt
26-03-03 - 昆仑【2540437】主管GBase8s 数据库查看状态
昆仑【2540437】主管GBase8s 数据库查看状态
26-03-03 - Toad for Oracle 2020 安装教程(附安装方法步骤)
Toad for Oracle 2020 安装教程(附安装方法步骤)
26-03-03 - ORACLE锁的种类和级别
ORACLE锁的种类和级别
26-03-03 - kubernetes-部署Oracle数据库步骤
kubernetes-部署Oracle数据库步骤
26-03-03 - ora-20003报错,ora-06512报错
ora-20003报错,ora-06512报错
26-03-03 - Oracle、NoSQL和NewSQL 数据库技术对比
Oracle、NoSQL和NewSQL 数据库技术对比
26-03-03 - exp和imp详解
exp和imp详解
26-03-03 - oracle删除表中数据(delete与truncate)
oracle删除表中数据(delete与truncate)
26-03-03
