Are You Sure You Can Recover?

    By: Tim Boles on Feb 13, 2018

    By Tim Boles | Edited by Michael Gangler 

    There have been studies done by various researchers and companies on the causes of data loss. Generally, these lists include hardware failure, human error, software corruption, and computer viruses. The intention of this article is to provide you a guide and thoughts on many situations you might encounter. Since everything depends on your unique system and circumstance, this is just a spring board for you to investigate your own system and solutions.

    Causes of Data Loss
    Hardware failures in real-life

    Let us take into consideration a few hardware failures that I have encountered and some things to consider. 

    A power surge took out the server and hard drives with the binaries of the database software.

    • Do you have copies of those binaries?
    • What patches have been installed since the last backup?

    Cooling system fails for the server room overnight. The room gets so hot that the SAN system crashes. When the SAN system is rebooted the drives that house the redo logs fail to come up.

    • Are all your redo logs on the same SAN?
    • Are all your redo logs on the same disks of the SAN?

    RMAN repository database is corrupted.

    • Did you back up your repository?
    • Do your scripts work without a repository?
    • Do you know how to backup/restore without it?


    Only Human
    The second most prevalent cause of data loss is human error. This makes up approximately one quarter of the reasons for data loss. It does not matter how careful you are, how meticulous, or change management conscious, humans are ultimately that, human, and make mistakes.

    A very stringent change control process can only go so far. I worked on one system that a shell script had been written that would transverse the directory tree and delete files that fell within a category. It worked fine on the test/development system, but when it was executed on the production server it deleted several database files. Now you might say to yourself, “Any database administrator worth their salt should be able to quickly restore those database files and have the system up and running in a few minutes.” What if that data file belonged to a Read-Only tablespace? That can have a lot of implications depending on your backup procedures.

    We could go on all day with different scenarios and your ability to recover from them but that is not the purpose of this article. We want to be sure you are ready to recover. However, to recover you must have a good backup.
     

    Backup Basics
    What to Backup?

    You may think your only job is to be sure that the data files are backed up. But there is so much more that needs to be considered. Don’t work in a bubble. Work with other administrators, application developers, and system engineers to make sure you understand your system architecture and what might need to be restored to make the system pristine again.

    Basic System Overview.jpg

     

    Figure 1 : Basic System Overview


    Get a system overview and then start brain storming all the connected pieces and who is responsible for making sure everything is backed up consistently. Communication is key here. I worked in one environment where the SAs were responsible for non-database file system backups and the DBAs for the database backups. The problem was that the SA’s thought that the Oracle Home would be classified as part of the database backup. When the database server needed to be restored who do you think was held accountable for the missing Oracle Home?

    Backing Up Non-Datafiles
    Here is a list of a few non-Datafiles that might be important to your situation. There might be many others. Do you know how often the following are being backed up, who controls the backups and most importantly how to get them restored?

    • Oracle Software Home (binaries)
    • BFILES
    • Password Files
    • pfiles (spfiles are covered with newer versions)
    • tnsnames.ora
    • listener.ora
    • sqlnet.ora
    • /etc/oratab
    • scripts (shell, sql)
    • Encryption keys (wallets)
    • Application files that need to be in-sync with the database.
    • Oracle Gateway software and configuration
    • OraInventory
    • Cronjobs (or other scheduling software)

    Multitenant Architecture
    The multitenant architecture of Oracle 12c introduces additional considerations. One example is PDBs do not have individual archivelogs. The archivelogs belong to the CDB. It is this type of change requires reviewing and testing of the current backup and recovery procedures to make sure they can still be used.

    Be A Boy Scout
    The Boy Scout motto is “Be Prepared!” That is exactly what every DBA should adhere to when it comes to data safety. When you think about your ability to restore a system are you prepared? What is your source for information to restore your database?

    • Memory / Experience
    • Oracle Documentation / Books
    • Internet Search Engines
    • Co-worker
    • Monitoring Tools (i.e. Oracle Enterprise Manager)

    The problem with those sources is that they may not be available, be out dated, or not relevant to your environment. There is no reason to be racking your brains trying to remember commands, system architecture, or procedures when you are in the middle of a crisis and need to restore a system quickly. You should have backup and recovery documentation detailing your system and another with procedures to address the more common restore needs.

    There are so many good reasons for creating and maintaining accurate documentation. 

    • Not dependent on single person being available
    • Boosts ability to concentrate – not sweating the small stuff
    • Gain experience and knowledge
    • Refine backup / restore procedures

    Backup & Recovery Documentation
    There are two types of B&R documentation. The first is a system overview and the other is recovery support for emergencies.

    System Overview

    • Backup implementation - procedures / rational
      Physical
      Hot / Cold
      Full / Incremental
    • Exports
      Full
      Schema, Table, (Transportable) Tablespace
      Non-datafile backups
      Application tires
    • Scheduling – (who / how / change control)
    • Retention policies (time and Off-site Location)

    Database Specifics 
    Many times when reviewing this type of information for a system I often ask around, ”Why did they setup things like that? ” Often to be answered, “Not sure it was that way when I got here.” That is so frustrating when you see things that seem to be better solutions but must research to find if there was a legitimate reason it was not already implemented. It is important that reasons for decisions be documented if they are more than just technology based. I worked on one system where they had to destroy backups every night because of data retention policies. You should note those policies and any supporting information so any “new” person to the system will have some type of reference. 

    Recovery Support
    A backup is only good if you can use it to restore. The time it takes to restore can be drastically reduced if you have the procedures already in place and know they work. Get some stakeholders together and brain storm some scenarios that might cause data loss on your systems and test those scenarios. Your recovery support documentation for each of these scenarios should include at least the following:

    • Outage Procedures – who is in charge / what groups are involved / steps to take
    • Script listing - location / usage / execution syntax / description
    • Test Documentation
    • Recovery scenarios tested
    • Documented restore procedures 

    Test Documentation
    Take the time and record the steps as you go through a recovery scenario. This helps you learn things you don’t know, know the systems better and builds a blueprint if the scenario ever truly occurs. However, remember things always change and you will never be able to document everything that can go wrong, but you will at least have a better chance of dealing with problems when they do occur.

    Take the time to document it and have another DBA review the general procedures and then the restoration procedures with your instructions. Make sure those procedures are tested periodically and specifically with major version changes. I speak from experience. I created a B&R test documentation on a system and went on to other projects. When I returned to that project a few years later one of the DBAs got onto my case saying that they had problems and attempted to use the documentation to recover and it failed. After talking to her I found three things:

    1. No one had “practiced” with the procedures since I had left
    2. They had upgraded from 9i to 10g while I was gone as well
    3. The commands that worked just fine for 9i had slightly different syntax in 10g

    Remember a system backup is only good if it can be used to restore the system, so test restores periodically with different scenarios. This is always a great project for new hires. Following is an appendix that outlines some brain stormed possible recovery scenarios. Some of them are simple or perhaps not applicable to your environment and I am sure you can think of others that are not on this list. It is only a springboard for your use.

     


     

    Tim Boles is a Senior Manager for Hitachi Consulting. He has over seventeen years experience as a DBA working primary in Oracle production systems. He holds a master’s degree in Computer Science from West Virginia University and an MBA in Information Systems from the University of Phoenix. He can be reached at tim2boles@gmail.com and twitter @timboles_dba.

     

     

     

     

    Appendix - Possible Data Loss Scenarios

    • Media Loss

    – Loss of Controlfile

    – Loss of a data file for a tablespace

    • System, rollback segment, UNDO, user data, index, read-only, partition

    – Loss of Redo Log file

    • Inactive online, current online, archived

    – Loss of entire redo group

    • Inactive online, current online, archived, all redo groups

    – Data Block Corruption

    • Physical, Logical, In backup

    – Loss of a BFILE

    • Recovery of Entire Database

    – Recovery to new file system

    – Point in Time Recovery of entire database

    – Recovery of RMAN catalog

    – Creation of standby database

    – Creation of duplicate database on test system

    – Recovery PDB / CDB / non-CDB

    • What If….

    – Database crashes during backup.

    – Oracle Home is destroyed.

    – Entire database server replaced.

    – SAN loses multiple drives.

    – Database crashes during table movement.

    – Database crashes during use of Flashback Technology

    – Read-Only tablespace was created before last backup.

    – Read-only tablespace was created after last backup

    – Recovery catalog is not available

    • User / Software Error

    – Recovery of dropped schema

    – Recovery of dropped table

    – Large transaction corrupts many tables

    – Data corruption in entire schema

    – Data corruption in schema 5 hours old but all other schemas need to remain the same.

    – Trigger or procedure is recompiled with wrong code

    • Recover previous trigger / procedure code

    • Restore data corrupted by trigger / procedure

     

     

    Are You Sure You Can Recover?

     

    There have been studies done by various researchers and companies on the causes of data loss.  Generally, these lists include hardware failure, human error, software corruption, and computer viruses.  The intention of this article is to provide you a guide and thoughts on many situations you might encounter.  Since everything depends on your unique system and circumstance, this is just a spring board for you to investigate your own system and solutions. 

     

    Causes of Data Loss

     

    Hardware Failures in Real-life

     

    Let us take into consideration a few hardware failures that I have encountered and some things to consider. 

     

    A power surge took out the server and hard drives with the binaries of the database software. 

    ·         Do you have copies of those binaries? 

    ·         What patches have been installed since the last backup?

     

    Cooling system fails for the server room overnight.  The room gets so hot that the SAN system crashes.  When the SAN system is rebooted the drives that house the redo logs fail to come up. 

    ·         Are all your redo logs on the same SAN?

    ·         Are all your redo logs on the same disks of the SAN?

     

    RMAN repository database is corrupted.

    ·         Did you back up your repository?

    ·         Do your scripts work without a repository?

    ·         Do you know how to backup / restore without it?

     

    Only Human

     

    The second most prevalent cause of data loss is human error.  This makes up approximately one quarter of the reasons for data loss.  It does not matter how careful you are, how meticulous, or change management conscious, humans are ultimately that, human, and make mistakes.

     

    A very stringent change control process can only go so far.  I worked on one system that a shell script had been written that would transverse the directory tree and delete files that fell within a category.  It worked fine on the test/development system, but when it was executed on the production server it deleted several database files.  Now you might say to yourself, “Any database administrator worth their salt should be able to quickly restore those database files and have the system up and running in a few minutes.”  What if that data file belonged to a Read-Only tablespace?  That can have a lot of implications depending on your backup procedures.

     

    We could go on all day with different scenarios and your ability to recover from them but that is not the purpose of this article.  We want to be sure you are ready to recover.  However, to recover you must have a good backup.

     

    Backup Basics

     

    What to Backup? 

     

    You may think your only job is to be sure that the data files are backed up.  But there is so much more that needs to be considered.  Don’t work in a bubble. Work with other administrators, application developers, and system engineers to make sure you understand your system architecture and what might need to be restored to make the system pristine again. 

     

    (IMAGE)

     

    Figure 1 : Basic System Overview

     

    Get a system overview and then start brain storming all the connected pieces and who is responsible for making sure everything is backed up consistently.  Communication is key here.  I worked in one environment where the SAs were responsible for non-database file system backups and the DBAs for the database backups.  The problem was that the SA’s thought that the Oracle Home would be classified as part of the database backup. When the database server needed to be restored who do you think was held accountable for the missing Oracle Home? 

     

    Backing Up non-Datafiles

     

    Here is a list of a few non-Datafiles that might be important to your situation.  There might be many others.  Do you know how often the following are being backed up, who controls the backups and most importantly how to get them restored?

     

              Oracle Software Home (binaries)

              BFILES

              Password Files

              pfiles (spfiles are covered with newer versions)

              tnsnames.ora

              listener.ora

              sqlnet.ora

              /etc/oratab

              scripts (shell, sql)

              Encryption keys (wallets)

              Application files that need to be in-sync with the database.

              Oracle Gateway software and configuration

              OraInventory

              Cronjobs (or other scheduling software)

     

    Multitenant Architecture

     

    The multitenant architecture of Oracle 12c introduces additional considerations.  One example is PDBs do not have individual archivelogs. The archivelogs belong to the CDB. It is this type of change requires reviewing and testing of the current backup and recovery procedures to make sure they can still be used.

     

    Be A Boy Scout

     

    The Boy Scout motto is “Be Prepared!”  That is exactly what every DBA should adhere to when it comes to data safety.  When you think about your ability to restore a system are you prepared?  What is your source for information to restore your database? 

     

              Memory / Experience

              Oracle Documentation / Books

              Internet Search Engines

              Co-worker

              Monitoring Tools (i.e. Oracle Enterprise Manager)

     

    The problem with those sources is that they may not be available, be out dated, or not relevant to your environment.   There is no reason to be racking your brains trying to remember commands, system architecture, or procedures when you are in the middle of a crisis and need to restore a system quickly.  You should have backup and recovery documentation detailing your system and another with procedures to address the more common restore needs.

     

    There are so many good reasons for creating and maintaining accurate documentation.

     

              Not dependent on single person being available

              Boosts ability to concentrate – not sweating the small stuff

              Gain experience and knowledge

              Refine backup / restore procedures

     

    Backup & Recovery Documentation

     

    There are two types of B&R documentation.  The first is a system overview and the other is recovery support for emergencies.

     

    System Overview

              Backup implementation - procedures / rational

              Physical

              Hot / Cold

              Full / Incremental

              Exports

              Full

              Schema, Table, (Transportable) Tablespace

              Non-datafile backups

              Application tires

              Scheduling – (who / how / change control)

              Retention policies (time and Off-site Location)

              Database specifics

     

    Many times when reviewing this type of information for a system I often ask around, ”Why did they setup things like that? ”  Often to be answered, “Not sure it was that way when I got here.”   That is so frustrating when you see things that seem to be better solutions but must research to find if there was a legitimate reason it was not already implemented.  It is important that reasons for decisions be documented if they are more than just technology based.  I worked on one system where they had to destroy backups every night because of data retention policies.  You should note those policies and any supporting information so any “new” person to the system will have some type of reference. 

     

    Recovery Support

     

    A backup is only good if you can use it to restore.  The time it takes to restore can be drastically reduced if you have the procedures already in place and know they work.   Get some stakeholders together and brain storm some scenarios that might cause data loss on your systems and test those scenarios.  Your recovery support documentation for each of these scenarios should include at least the following:

     

              Outage Procedures – who is in charge / what groups are involved / steps to take    

              Script listing - location / usage / execution syntax / description

              Test Documentation

              Recovery scenarios tested

              Documented restore procedures

     

    Test Documentation

     

    Take the time and record the steps as you go through a recovery scenario.  This helps you learn things you don’t know, know the systems better and builds a blueprint if the scenario ever truly occurs.  However, remember things always change and you will never be able to document everything that can go wrong, but you will at least have a better chance of dealing with problems when they do occur.

     

    Take the time to document it and have another DBA review the general procedures and then the restoration procedures with your instructions.   Make sure those procedures are tested periodically and specifically with major version changes. I speak from experience.  I created a B&R test documentation on a system and went on to other projects.  When I returned to that project a few years later one of the DBAs got onto my case saying that they had problems and attempted to use the documentation to recover and it failed.  After talking to her I found three things….

    1)    No one had “practiced” with the procedures since I had left

    2)    They had upgraded from 9i to 10g while I was gone as well.

    3)    The commands that worked just fine for 9i had slightly different syntax in 10g.

     

    Remember a system backup is only good if it can be used to restore the system, so test restores periodically with different scenarios.  This is always a great project for new hires. Following is an appendix that outlines some brain stormed possible recovery scenarios.  Some of them are simple or perhaps not applicable to your environment and I am sure you can think of others that are not on this list.  It is only a springboard for your use.  

     

    Appendix - Possible Data Loss Scenarios

     

              Media Loss

             Loss of Controlfile

             Loss of a data file for a tablespace

              System, rollback segment, UNDO, user data, index, read-only, partition

             Loss of Redo Log file

              Inactive online, current online, archived

             Loss of entire redo group

              Inactive online, current online, archived, all redo groups

             Data Block Corruption

              Physical, Logical, In backup

             Loss of a BFILE

              Recovery of Entire Database

             Recovery to new file system

             Point in Time Recovery of entire database

             Recovery of RMAN catalog

             Creation of standby database

             Creation of duplicate database on test system

             Recovery PDB / CDB / non-CDB

              What If….

             Database crashes during backup.

             Oracle Home is destroyed.

             Entire database server replaced.

             SAN loses multiple drives.

             Database crashes during table movement.

             Database crashes during use of Flashback Technology

             Read-Only tablespace was created before last backup.

             Read-only tablespace was created after last backup

             Recovery catalog is not available

              User / Software Error

             Recovery of dropped schema

             Recovery of dropped table

             Large transaction corrupts many tables

             Data corruption in entire schema

             Data corruption in schema 5 hours old but all other schemas need to remain the same.

    Released: February 13, 2018, 4:08 am | Updated: February 13, 2018, 8:50 am
    Keywords: Feature


    Copyright © 2018 Communication Center. All Rights Reserved
    All material, files, logos and trademarks within this site are properties of their respective organizations.
    Terms of Service - Privacy Policy - Contact

    Independent Oracle Users Group
    330 N. Wabash Ave., Suite 2000, Chicago, IL 60611
    phone: 312-245-1579 | email: ioug@ioug.org

    IOUG Logo

    Copyright © 1993-2018 by the Independent Oracle Users Group
    Terms of Use | Privacy Policy