A sample text widget
Etiam pulvinar consectetur dolor sed malesuada. Ut convallis
euismod dolor nec pretium. Nunc ut tristique massa.
Nam sodales mi vitae dolor ullamcorper et vulputate enim accumsan.
Morbi orci magna, tincidunt vitae molestie nec, molestie at mi. Nulla nulla lorem,
suscipit in posuere in, interdum non magna.
|
Mistake #6: Not involving management in data loss and target recovery time decisions
I recently met with a management team to determine acceptable data loss and recovery time thresholds. My strategy was to bring a printed list of questions for the team to consider and answer. It was a truly productive discussion.
The client DBA, with whom I work closely, noticed that I had said very little during the meeting. My answer was, “Well, we DBAs are really only the humble servant in this situation. Management knows what they need, and our job is to listen and tell them what it will take to get them to that place.”
I hope you’ll consider involving your management in deciding how much data your company can afford to lose in a disaster, and also how long your company can afford to be without its database(s). They need to be involved in this decision, as it is more of an operational decision than a technical one.
Some benefits of involving management in the decision making:
• Management will see that you are voluntarily and proactively involving them in what legitimately is a management-level decision. This will show good judgment on your part. And you like brownie points, right?
• You will have a chance to learn management’s expectations about these matters before a disaster happens. This could save your reputation, and maybe even your job, later on.
• You will have a chance to shape or temper management’s expectations. I have started conversations of this sort by asking, “How many minutes of transactional data can you afford to lose permanently?” Often the answer is, “None!” I don’t know if this a problem with how I’m asking the question, is a knee-jerk answer, or is something they really believe is a business requirement. Whatever the case, this gives me a chance to explain what meeting that requirement might entail: possibly a more expensive setup than they have now, and also some database performance impact.
Another question: if there is a disaster, how soon must you be up and running again? I don’t hear “immediately” very often, but I have heard “within 10 minutes” before.
Both of these questions are your chance to talk with management about what your current equipment, configuration, and level of technical expertise can accomplish relative to their expectations, and what additional resources it would take to meet those expectations. Often I see management downshift their demands after understanding what they would cost. At the end of the meeting, management might instead agree that 30 minutes of data loss is acceptable, and that it’s OK for it to take 8 hours to be up and running again. What a difference that would make for you when disaster actually struck: management expectations would be set to the level that you negotiated together, based on legitimate business need, currently on-board technology, and current level of IT funding!
• You will get better policy coordination and buy-in from complementary IT areas. Data loss and recovery decisions will require involvement by and have impacts on technical components other than the database. If the recovery time/data loss decisions are made by management they’ll have some “teeth” to them: the decision will be binding on the other component areas, such as system administration, application server administration, and network operations. In a broad-scale disaster, those areas might have to do their own recoveries before the database and application can be up and running, and their recovery times need to be part of recovery/loss policy making and planning.
• You might get more budget money next time. Having management think about recovery time and data loss issues could get you favorable treatment during the next budget negotiation cycle. Maybe management will better see the value of scrounging up money for that standby database server or springing for a few “U”s of storage at a remote data center, since you educated them about how it will improve business operations in case of a disaster.
Have you inquired about and worked to shape management’s recovery/data loss expectations at your shop?
Mistake #5: Not keeping your backups in well-chosen and varied locations
Texas is a big place, and one geographically disparate enough to have a large variety of natural disasters. We have tornadoes, hurricanes, brush fires, and floods. We even had a tiny earthquake in the Dallas area a couple of years ago.
We also have the normal variety of non-natural threats: thieves, saboteurs, arsonists, leaky roofs, power surges. Add that to the threats inherent to an IT infrastructure: human error, hardware failures, hackers, software bugs. Those are a lot of threats. What’s an IT guy/gal to do?
Make backups as often as your data loss/recovery time strategy dictates.
Either make the backups to a different server/disk pool than the database, or quickly copy them to a different location once the backup is complete.
If the secondary location isn’t offsite, get a copy offsite as soon as you can. The time period within which backups are required to have offsite copies should be governed by how much data you can afford to lose in a site disaster. (More about “affordability”, next time.)
Be prepared for localized and whole-site disasters. You and your end users have worked hard to make your databases the valuable information stores they are today, and you owe yourselves (and them) no less.
Mistake #4: Not recover-testing your backups
This is a critical deficiency, and it is very common. And, to be truthful, this mistake is not unique to smaller Oracle shops. I’ve been to some larger Oracle shops that also skimp on recovery testing, but I find this to be more common in the smaller shops where resources tend to be spread more thinly.
Disasters happen every day. Even to well-run companies. In the blink of an eye, you could lose it all. Hardware failures, human error, sabotage, natural disasters – no shop can afford to feel immune to these threats. They are real; they are hellish; and I have the sad client stories to prove it.
Can you recover from a disaster? Just because your backup returned a normal completion code or didn’t throw any errors, does that mean can you recover your database? Just because your recovery strategy worked last quarter, does that mean it works now? Is life in the IT arena really that simple? (I wish it were.)
If you aren’t testing your backups, what is the reason? Will that reason seem important in retrospect, if you can’t get your data back when you need it?
Wow, re-reading this entry, I realize that it sounds preachy, almost accusatory. But I feel passionate about this issue. I’ve seen the proverbial wailing and gnashing of teeth when it becomes clear that a complete recovery of a failed production database is not possible. I don’t want this to happen to any of my clients, and, from a purely selfish perspective, I don’t want to witness such desperation again. It’s disturbing.
Can you look your manager in the eye and tell him you are proud of your recovery testing strategy? Can we help you with your strategy? This link will take you to our site, where you can contact us.
Mistake #3: Not having a diversified backup strategy
In the last post, we discussed why you might or might not want to run in archivelog mode. Now we’ll discuss some related decisions.
As in the last two blog entries, I’ll be oversimplifying, for the sake of illustration. Additionally, there is often more than one way to satisfactorily resolve a given disaster, and the best method might not be the ones I’m using as examples here. It might not even involve resorting to backups at all. This blog is just a tool to help you understand how diversified backups could help you.
So, there are two basic kinds of database backups, physical and logical. Physical backups are copies of the files (or parts of the files) that comprise the database. Physical backups come in several forms, for example, copies of the database files made with operating system backup commands, RMAN backups (this tool comes with the Oracle software), and disk mirror copies.
Logical backups are collections of SQL statements that can recreate database objects and their associated data. For example, a table backup could be comprised of a CREATE TABLE statement, some GRANT statements, some CREATE INDEX statements, and many INSERT INTO [table] statements. Collectively, these statements would get your table restored.
Let’s think about some sample database disasters:
Scenario 1: A loss of the server on which the database is hosted, causing a loss of the whole database. Remember archived redo logs, from the last blog entry? Those are important to back up. If you restore a physical database backup, the data will be in the state it was in at the time you made the backup. You might need a more recent version of the data. You can use the archived redo logs to roll the database forward in time, which will bring your data as recent as the most recent available archived redo log.
What if you didn’t have a physical backup? Well, you’ll be creating a new, empty database and then loading your most recent logical backup into it. Let’s say that backup was from last night at 11:30 pm. You lost your server at 2:14 pm today. You will lose all transactions that happened since last night at 11:30 pm. Even if you have archived redo logs available, they won’t help you, because you can’t apply archived redo logs to logical backups.
Scenario 2: A loss of a single datafile. That might happen because of corruption or erroneous file deletion.
In this case, the database won’t be happy until you get the lost datafile back. If you have a physical backup AND you’re running in archivelog mode AND you have all the archivelogs since the last physical backup AND your online redo logs aren’t damaged, you will be able to make a complete recovery. Just fish the lost datafile out of the physical backup, then instruct Oracle to recover it. It will re-apply the transactions made to the lost datafile, using the archived redo logs and then the online redo logs. Good thing you made a physical backup!
What if you didn’t have a physical backup? Depending on what was in the missing datafile, you could lose some transactions. You might have to create an empty database and restore the most logical backup, losing all transactions since 11:30 pm last night. You might be fortunate and discover that missing datafile only had indexes in it, and then be able to just rebuild the indexes, assuming you have some way to reconstruct the index creation SQL (a logical backup can help with this). If the missing datafile contained data, you might do something fancy like restore the logical backup to a temporary database and then somehow transfer just the missing data to the damaged database, although there are some gotchas with that scenario. It really would have been best to have a physical backup with all archivelogs since the last backup.
Scenario 3: A user error, such as the payroll process accidentally being run twice this morning, at 10:37 am and 11:05 am.
It is now 12:44 pm. This might not be simple to fix. You’ll have some choices to make. How many tables are involved and how much data is modified by a payroll run? Are you knowledgeable enough to manually back out the second payroll run? If the back-out is simple enough and you are confident you can surgically remove only the second run, go for it.
If just one table is affected by the payroll run (probably not, but let’s pretend for the sake of this example), and if you have a logical backup, and if you are willing to lose non-payroll changes to that table since last night, you could restore just that table from last night’s logical backup, then re-run payroll. Once.
If many tables are affected or if it’s not clear how to fix the problem manually, you might find that it’s, overall, the best solution to put the database back to like it was at 11:04 am. This is possible only if you have a physical backup and only if you have all the archived redo logs made between the time of the backup and 11:04 am. You will lose all transactions made after 11:04 am. Once the recovery is done, the database will be in the state it was just before the second payroll run started.
Scenario 4: Erroneous but limited-scope data modification error. For example, 20 days after the fact, you realize that 75 rows had erroneously been deleted from a table that changes frequently.
Restoring the database via physical backup, back to how it was 20 days ago and leaving it that way is probably out of the question, unless you are conducting a forensic analysis of some sort and are restoring to a non-production server for this purpose. But you’ll still need a way to fix your production database.
If you have logical backups, you could restore the table to a different database or a different part of the production database, extract the missing rows from the restored table, and restore them to the deficient production table.
If you don’t have a logical backup, and you really need these rows back, you’ll probably have to use a physical backup, and those are less well-suited to this particular problem. In this case, you could restore the physical backup to another server, extract the missing 75 rows, and restore them to the deficient production table.
As you have seen, your best chance to recover from a wide variety of possible disasters is to have a variety of backup types. Sometimes a blend of backup types can be used to solve a recovery problem. Some very large databases are awkward to back up (or back up regularly), but with our target customer base (which is small to medium sized businesses), most of the databases are small enough to backup within a reasonable amount of time, using more than one backup strategy, and using a reasonable amount of space for the backup files.
Will your backup strategy meet your recovery needs?
Mistake #2: Choosing (or defaulting to) an inappropriate archivelog setting
Let’s say you back up your database nightly. If your database were obliterated in a hardware failure at 3:58 pm, would it be important to recover the changes made to the database since last night? If it’s a training or test database, or a production database that doesn’t change much, perhaps just the restoring database backup from last night would be fine. Otherwise, you probably would be interested in recovering the work that’s been done since last night. (And having users insistently and repeatedly ask you when they’ll get their data back would only serve to increase your interest in completing the recovery.)
Remember the online redo logs from Mistake #1? As we discussed, they are vital to database recoveries. Let’s talk more about how they work.
Databases have more than one redo log. Let’s say that our database has three redo logs (logs 1, 2, and 3). Let’s also say it takes about 30 minutes for a redo log to fill. When log 1 fills with transactional data, it closes, and log 2 begins writing. When log 2 fills, log 3 starts writing. When log 3 fills, log 1 gets overwritten.
Losing the information in log 1 is fine, as long as you don’t need it later. You can get copies of each redo log before it is overwritten. These are called archived redo logs, and they are only generated if the database is running in archivelog mode.
Now, back to our 3:58 pm database obliteration scenario. Let’s say your backup was made at 3 am last night; you are running in archivelog mode; and you back up your archive logs every hour, on the hour.
You would restore your backup from 3 am, then “apply” the archived redo logs made since the 3 am backup. Applying redo logs replays the transactions stored in the redo logs, “applying” them to your restored database. This is how you get your database changes back since this morning. You’d be able to replay enough transactions to get your database to the state it was in at 3 pm when you made your hourly archivelog backup. (I’m oversimplifying here to more easily illustrate the concept.)
If you weren’t running in archivelog mode, you’d restore your backup from last night at 3 am, then you’d be done, because it would not be possible to replay the transactions made since 3 am. You would lose all work done since 3 am this morning.
If you run in archivelog mode, there are backup types that you can run that otherwise would not be available to you.
So, rollforward recoveries and more available backup types are the good points about running in archivelog mode.
There are some bad points, too:
• Archivelog mode can make your database run more slowly, although in many cases it’s undetectably so.
• You’ll have to manage the archivelogs – if the file system/drive they are on fills up, the database will hang. You’ll need some mechanism to delete them in a timely (not to early, not too late) fashion.
But what if you don’t care about getting your transactions back since 3 am? And what if you don’t make a backup type that is capable of replaying transactions and applying them to the backup? In either case, running in archivelog mode will not help you. Don’t bother with it.
Do you know what your archive logging strategy is? Is that strategy serving your business well?
We’ve been working with Oracle long enough to see the same database implementation and administration mistakes being made over and over. Some of the mistakes were made when the database was set up; others are ongoing poor habits. This is the first in a series about Oracle database mistakes that seem more common in the smaller Oracle shops, usually shops that don’t have an experienced Oracle resource overseeing the database. So … let’s get started.
Mistake #1: Not multiplexing online redo logs
When a user saves a transaction to the database, the transaction is written first from database memory to the online redo logs. Online redo logs are simply a series of files that store database change records. At some point later, the change is written from database memory to the “real” database files.
Sometimes the database needs the information in the redo logs, such as when the database crashes before the database files can get written from memory, or when there is a catastrophic loss of one or more datafiles. In either case, the information in the datafiles is transactionally “older” than the information in the redo logs. The database will “replay” the transactions in the redo logs onto the datafiles in order to bring the datafiles back up to date. If you cannot bring the datafiles up to date (in other words, if you’ve lost your redo logs), you will either not be able to open your database or you will have to choose to lose some transactions. That’s how important the online redo logs are! (I’m oversimplifying, to more easily illustrate the point.)
You can configure your redo logs to multiplex (write more than one copy at the same time). This is advisable because, even if one copy gets corrupted or deleted, the other copy might still be OK. Failure to multiplex your online redo logs introduces a critical single point of failure for your database.
A further protection would be to keep the multiplexed copies in separate locations. This would protect you in case a single directory or file system gets damaged or lost; you would still have the redo log copies in the other directory/file system.
Also consider having at least one copy in a different location than your datafiles. That way, even if you lose everything in the datafile location, you’ll still have a copy of the online redo logs to contribute to your rollforward recovery. Without the online redo logs, you might not be able to recover all your transactions.
Are your online redo logs multiplexed to a good variety of locations?
McNeely Tech sponsored the Dallas Oracle Users Group meeting last week, which was a presentation by Oracle Corporation about their souped-up Exadata product.
The first generation of Exadata was a combination of an HP hardware/operating system and Oracle software, each modified to work well with each other. That Exadata box pumped out some impressive performance stats.
The next generation, not surprisingly, is on Sun hardware, with a customized operating system and storage, and with modified Oracle software. This critter screams!
Have you seen a picture of the new Exadata box yet? It’s got a handsome enclosure, but it’s got a big “X” on the front. “X” to me means, it’s broken, don’t use it! Funny/odd.
“X” or not, an amazing machine.
I attended the Hotsos Symposium last week, which, as usual, was an excellent conference. It’s a small conference, relative to OpenWorld and Collaborate, but yet it’s still sophisticated and prestigious enough to attract some of the best minds in the Oracle industry.
Great location – the Omni Mandalay. Great staff, great food, beautiful venue.
Every year at the Symposium, there’s an evening theme party. This year was a 1970s/disco theme. Last year, I went overboard and bought a fancy pirate outfit for the “Pirate” party, so this year, I took it easy on the old wallet and simply wore some jeans, a tie-dye shirt, and some platform sandals that I already own. Add a bright headband with peace symbols on it, which I bought on eBay just for this party, and I was good to go! (What does it mean that items from my current wardrobe fit seamlessly into an occasion for which we are supposed to dress ridiculously? That MIGHT mean than I am NOT a slave to fashion! Can I catch at least a little break, given that I’m a tech geek?)
Good to see my “geek buddies”, too. Every year, I meet a few more of them at the Symposium, and it is the only time I get to see many of them.
I even ran into one of the two people who first taught me Oracle, all those years ago. (Great to see you! You look serene and well, my friend.)
It’s fortunate to have this gem of an event right here in Dallas, every year. On top of that, there is a vibrant local Oracle community within the Dallas Oracle Users Group.
Dallas is a good place to be an Oracle professional.
Mary Elizabeth McNeely
An oldie, but goodie, from the McNeely Tech archives:
It was a sickening moment.
“We should have listened to you, Mary Elizabeth,” the client said. He didn’t look well.
I had been pestering this client for quite a while to remedy their inadequate database disk mirroring and backup/recovery practices.
A few hours ago, his production database had been obliterated in a hardware failure. He tried but failed to recover any recent backup. Finally, he called to see if we could find any way to salvage recent transactions. Unfortunately, we couldn’t. The damage to the database was too extensive, and, additionally, there were no usable recent backups. The client hadn’t been recover-testing his backups, so he didn’t realize he had a problem until he tried to use them in an emergency. He had no other option than to recover an older backup.
The client was in the embarrassing position of having to ask partner organizations for re-sends of transactions made since the last good backup. Word eventually seeped out onto the street that a grave technical failure had occurred at his company.
It was no salve to my ego to have been proven right – yes, the client should have fixed their inadequate practices. But it was painful to me to even watch a situation like this unfold.
Please don’t break my heart by being the next one to call me with an unsalvageable loss. When was the last time you tested your backups?
This month marks the anniversary of McNeely Technology Solutions’s corporate formation.
Thank you to the Dallas database community (and some out of state clients, too) for five great years! We look forward to working with you for many more happy years.
|
|