Fundamental Oracle Flaw Revealed
The SCN is a moving line that cannot be crossed. The line moves up by 16,384 every second; as long as the SCN growth rate is slower, all should be well.
But what happens when your SCN moves closer and closer to that line due to the spurious jumps caused by the backup bug, a simple mistake by an admin, or other means? How do you deal with this impending problem?
The answer: Shut down the database servers for a while so that the number stops incrementing.
Plainly put, this means that every single interlinked Oracle instance across an entire company will need to be shut down just to move a bit further away from the line. If only a handful are shut down, they will very quickly jump right back to the high SCN whenever they connect with other Oracle instances. The only way to ensure this problem is completely eradicated is to shut down every affected system: backup servers, replicas, everything.
Not only that, but while they're down, admins will need to scour the infrastructure to be absolutely certain that no affected Oracle systems have escaped remediation. If they miss even one instance, they will have to perform the complete shutdown again.
Then there's the issue of how long to shut down. If you shut down every instance for a week, that would buy you about 10 billion ticks away from the line of no return. How many businesses would entertain the idea of shutting down their database systems for that long?
While they're down, the SCN could potentially be "reset" -- but only by dumping out each database, dropping the database, and importing the dump into a fresh database. This would have to be done for every database running on every database server across the entire organization, all at the same time. With databases routinely in the multiterabyte range, this will take a while.
Again, only very large customers with many interconnected Oracle databases would be likely to run a significant risk of being affected by this problem. But the larger the Oracle environment, the longer this restoration would take. Typically, large organizations have the least tolerance for downtime.
Until recently, aside from the backup bug fix, Oracle's only response to the SCN elevation issue -- as far as we've been able to determine -- has been to release a patch that extends the SCN calculation to 32,768 times the number of seconds since 01/01/1988, doubling the rate at which the soft limit increases. Oracle even made it modifiable, so admins can further increase the multiplier.
If this patch is applied to an Oracle instance, it will definitely increase the time the interlinked databases can run before hitting the SCN limit. However, it also introduces new variables.
Part of the problem is that you can't patch every system at once. Additionally, if you have a patched system with an elevated soft limit -- based on a multiplier of, say, 65,536 -- the SCN on that system could be higher than the SCN on an unpatched system using the original 16,384 multiplier, causing the unpatched system to refuse the connection or encounter another problem as it fails the soft limit check. There's also the issue of servers running older Oracle versions that may not have a patch available.
Furthermore, if this patch is a default inclusion in the next Oracle release, admins may suddenly discover that their existing servers are unable to communicate with new or upgraded servers that use the new, higher SCN calculation method, should the new servers have a sufficiently elevated SCN. If the SCN values line up just right, it's possible that a patched system could connect and set the SCN of an unpatched system just shy of the soft limit, causing the unpatched system to hit the limit through its own processing.
As mentioned, the risk of such a scenario playing out is very small except in large, highly interconnected environments where an elevated SCN can flow like a virus from server to server. But once a server is infected, there's no going back. Also, if the SCN is incremented arbitrarily -- or manually, with malicious intent -- then that 48-bit integer hard limit is suddenly not as astronomical as it might seem.
The community reaction
InfoWorld contacted a number of Oracle sources for this story. Several lacked familiarity with the problem; others noted that Oracle licensing agreements prevented them from commenting on any aspect of their product usage. The head of the Independent Oracle User Group (IOUG), Andy Flower, offered this statement on the record: "This bug with the SCN number is obviously something our membership would be concerned about -- and will need to consider what sort of challenges that may present and if any mitigation strategies will be needed. I'm sure it will be a topic that some of our larger members will probably get together and discuss."
Among the Oracle experts we spoke with, Shirish Ojha, senior Oracle DBA for Logicworks, a hosting and private cloud service provider, was the most familiar with SCN issues, including the bug numbers associated with the problem. He acknowledges that although few Oracle environments are likely to encounter the problem, the consequences may be severe. "If there is a dramatic jump in SCN due to any Oracle bug, there is a minimalistic probability of breach of this seemingly high number," said Ojha, who has earned the coveted title of Oracle Certified Master. "If this occurs in a high-transaction and large interconnected Oracle architecture, this will render all interconnected Oracle databases useless in a short period of time."
Ojha continues: "If this occurs, even though its probability is low, the potential [financial] loss ... is very high." By definition, he said, the problem has the potential to affect only large Oracle customers. But "once the SCN limit is reached, there is no easy way to get out of the problem, other than shutting down all databases and rebuilding databases from scratch."
Anton Nielsen, the president of C2 Consulting and an Oracle expert, focused on the potential risk of malicious attack using an elevated SCN: "In theory, the elevated SCN attack is similar to a DoS attack in two significant ways: It can bring a system to its knees, rendering it inoperable for a significant period of time, and it can be accomplished by a user with limited permissions. While a DoS can be perpetrated by anyone with network access to a Web server, however, the elevated SCN requires a database username and password with the ability to connect."