Database Recovery Procedure for an Enterprise Site
Overview and Purpose
This document provides a standard operating procedure for recovering a Site database that has been lost or corrupted within a larger Enterprise system environment. The core recovery strategy does not involve a traditional backup restoration. Instead, it leverages the distributed nature of the Enterprise architecture to recreate the database structure locally and then initiate a "sync down" process. This procedure repopulates the new Site database with the authoritative master data from the central root system.
The following sections provide a structured, three-phase guide to execute this recovery, ensuring a systematic and verifiable restoration of service.
Phase 1: Initial Recovery Actions
This first phase covers the two mandatory actions required to trigger the automated recovery and synchronization process. Performing these steps in the correct sequence is critical for signaling to the Enterprise system that the affected Site is prepared to receive data from the root.
Recreate the Schema: The first step is to recreate the database schema for the affected Site. This step ensures a pristine database environment, free from any residual data, corruption, or schema drift that may have contributed to the failure.
Restart the Gateway: Immediately after the schema is recreated, you must restart the Gateway (GW) on the affected Site. This restart forces the Gateway service to re-evaluate its environment. Upon discovering the newly created, empty database, it reports its 'uninitialized' state to the Enterprise root system, which serves as the trigger for the sync-down process.
Executing these two actions signals to the Enterprise root system that the Site has a clean, empty database and is ready to be repopulated. After successfully performing these steps, the next critical task is to move to the verification phase to confirm that data synchronization has begun.
Phase 2: Verifying Data Synchronization
Verifying that the data synchronization has been successfully initiated is a crucial step in this procedure. This phase confirms that the recovery is underway and allows for monitoring its progress by checking for specific communication logs between the root system and the affected Site. The "sync down" is the mechanism where the central root system sends all necessary data down to the newly prepared Site database.
To confirm that this process is working correctly, check the logs on both the root system and the affected Site for the following messages.
Synchronization Confirmation Logs
System Location | Log Message to Verify |
Root System | on root Applying received changelogs from [site-name]-gw |
Affected Site | on Site Requesting changes for [X] objects from [enterprise-name]-gw |
The Requesting changes message from the Site confirms it is actively polling the root for data, acting as the client in this recovery. Conversely, the Applying received changelogs message on the root confirms it is responding to the Site's request and pushing the necessary data, acting as the server. Observing both is definitive proof of a healthy, bidirectional handshake. Once these logs are confirmed and the data transfer is active, the process is progressing as expected and moves toward completion.
Phase 3: Finalizing the Recovery
This section outlines the final stage of the recovery process. It involves confirming the completion of the data synchronization and performing a final recommended action to ensure long-term system stability.
The initial bulk synchronization may involve thousands of objects. Completion is indicated when this number drops to a low, steady state, reflecting only routine, real-time changes rather than the primary recovery load. The log messages will persist, but the object count within them will stabilize in the single or low double digits.
Final Recommended Action: Gateway Restart
Once the sync count has stabilized, perform one final restart of the Site's Gateway.
This final restart is a critical best practice to ensure all services and connections are re-initialized against the fully populated database. It purges any transient states or cached configurations held in memory during the high-volume data transfer, preventing potential post-recovery inconsistencies.
Following these three phases will successfully restore a lost Site database and return it to normal, healthy operation within the Enterprise environment.