Blog

Ultimate Guide: Backup & Restore a Corrupted etcd Service with S3

30.01.2025
Reading time: 5 mins.
Last Updated: 30.01.2025

Table of Contents

In modern distributed systems, etcd plays a crucial role as a reliable and fast key-value store that serves as the backbone for storing critical configuration and state data. From Kubernetes to other large-scale systems, etcd often acts as the “heart” that ensures clusters operate smoothly.

But what happens if this vital database is compromised, deleted, or corrupted? Data loss in etcd can lead to severe disruptions, loss of state, or even complete service outages. This is why proper backup management and recovery are essential for administrators and engineers alike.

To ensure your etcd backups are secure and accessible, storing them in an S3 bucket is a reliable option. S3 provides durability, availability, and the ability to automate backup uploads.

Install and configure the AWS CLI by following this guide

Create a bucket to store your backups. You can do this via the AWS Management Console or CLI

After creating a backup with etcdctl, upload it to the bucket:

Use a cron job or a script to automate regular backups and uploads to S3

Storing etcd Backups in the S3 bucket

Load the environment variables for etcd:

* source /etc/etcd.env

Extract the endpoints of the etcd cluster

ETCD_ENDPOINTS_FOR_BACKUP=$(ETCDCTL_API=3 etcdctl member list --endpoints $ETCDCTL_ENDPOINTS --cacert $ETCD_TRUSTED_CA_FILE --cert $ETCD_CERT_FILE --key $ETCD_KEY_FILE | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',')

echo "Member list is $ETCD_ENDPOINTS_FOR_BACKUP"

Verify the status of the etcd endpoints

ETCDCTL_API=3 etcdctl endpoint status --endpoints $ETCD_ENDPOINTS_FOR_BACKUP --cacert $ETCD_TRUSTED_CA_FILE --cert $ETCD_CERT_FILE --key $ETCD_KEY_FILE

To create a snapshot of the etcd database, use the following command:

ETCDCTL_API=3 etcdctl --endpoints="$ETCDCTL_ENDPOINTS" \

        --cacert="$ETCD_TRUSTED_CA_FILE" \

        --cert="$ETCD_CERT_FILE" \

        --key="$ETCD_KEY_FILE" \

        snapshot save $BACKUP_DIR/$PREFIX-$TIMESTAMP.db
  • –endpoints=”$ETCD_ENDPOINTS”: Specifies the endpoints of the etcd cluster
  • –cacert=”$ETCD_TRUSTED_CA_FILE”: Path to the trusted CA certificate for secure communication
  • –cert=”$ETCD_CERT_FILE”: Path to the client certificate
  • –key=”$ETCD_KEY_FILE”: Path to the client’s private key
  • snapshot save: Saves the current state of the etcd database as a snapshot
  • $BACKUP_DIR/$PREFIX-$TIMESTAMP.db: Specifies the location and naming format for the snapshot file

If your environment variables are set as follows:

BACKUP_DIR=/var/backups/etcd
PREFIX=etcd-backup
TIMESTAMP=$(date +%Y%m%d%H%M%S) 

The snapshot will be saved as:

/var/backups/etcd/etcd-backup-20250115123045.db

After creating the snapshot, upload it to S3 bucket:

aws s3 cp /$BACKUP_DIR/$PREFIX-$TIMESTAMP.db s3://$S3_BUCKET/$ETCD_PREFIX_ENV/$ETCD_PREFIX_ENV_FOR_SNAPSHOTS/$PREFIX-$TIMESTAMP.db

To restore from S3, download the snapshot to the local server

aws s3 cp --profile etcd-backup-restore-s3 s3://$ETCD_S3_BUCKET/$ETCD_PREFIX_ENV/$ETCD_PREFIX_ENV_FOR_SNAPSHOTS/$ETCD_SNAPSHOT etcd-snapshot.db

Load the environment variables for etcd:

* source /etc/etcd.env

Additional Considerations for Restoration

When restoring etcd, the following steps should also be considered:

Stop the etcd Service
Before restoring the snapshot, it’s crucial to stop the etcd service to avoid conflicts during the restore process.

systemctl stop etcd

Rename the Existing Data Directory
It’s a good practice to rename the existing etcd data directory before restoring to avoid any potential data corruption.

mv /var/lib/etcd /var/lib/etcd.copy_$(date +’%Y-%m-%d_%H-%M-%S’)

Once the old data is safely renamed, restore the snapshot into the etcd data directory:

ETCDCTL_API=3 etcdctl \

            --data-dir="/var/lib/etcd" \

            snapshot restore --skip-hash-check=true "$ETCD_SNAPSHOT" \

            --name="$ETCD_NAME" \

            --initial-cluster="$ETCD_INITIAL_CLUSTER" \

            --initial-advertise-peer-urls="$ETCD_INITIAL_ADVERTISE_PEER_URLS" \

            --initial-cluster-token="$ETCD_INITIAL_CLUSTER_TOKEN"

After the snapshot is restored, restart the etcd service to apply the changes:

systemctl start etcd

If you prefer, you can check the status of the etcd cluster after restoration:

etcdctl endpoint status --write-out=table \ 

--endpoints "$ETCDCTL_ENDPOINTS" \ 

--cacert="$ETCD_TRUSTED_CA_FILE" \

  --cert="$ETCD_CERT_FILE" \ 

--key="$ETCD_KEY_FILE"

The etcd leader should be one. 

Following these steps, you can successfully backup and restore your etcd service, ensuring the safety and availability of your critical data. Regular backups and a reliable restore procedure are key to maintaining the stability of your distributed systems.

For further automation, consider using Kubernetes Jobs to schedule and manage your etcd backups. This approach allows you to automate the backup process within your Kubernetes environment, ensuring that backups are performed regularly without manual intervention.

For the restore process, you can leverage Ansible roles to streamline and automate the recovery procedure. Using Ansible, you can define a set of tasks for restoring etcd from a snapshot, making the process more efficient and repeatable across different environments.

By automating both backup and restore procedures, you reduce the risk of human error and ensure a more reliable and consistent approach to managing your etcd service.Test Backup and Restore Regularly: It’s essential to test your backup and restore procedures regularly to ensure they work as expected during an actual disaster recovery scenario. By performing periodic tests, you can identify potential issues before they become critical.

Leave a Reply

Your email address will not be published. Required fields are marked *

More Posts

Learn effective strategies for Amazon S3 migration and explore advanced tools like Storage Lens, Athena, and QuickSight to gain actionable insights into your big data usage. Perfect for organizations optimizing...
Reading
Deploying a PostgreSQL cluster on OKD/OpenShift using the Zalando Postgres Operator is a powerful and scalable solution. This guide provides step-by-step instructions on deploying the operator, configuring storage using StorageClass...
Reading
Get In Touch
ITGix provides you with expert consultancy and tailored DevOps services to accelerate your business growth.
Newsletter for
Tech Experts
Join 12,000+ business leaders and engineers who receive blogs, e-Books, and case studies on emerging technology.