Blog

Exporting Amazon RDS Snapshots to S3: Extracting single table data

21.11.2024
Reading time: 5 mins.
Last Updated: 21.11.2024

Table of Contents

Amazon Relational Database Service (RDS) has become a cornerstone for many organizations looking to manage their relational databases in the cloud. One powerful feature of RDS is the ability to take snapshots, which serve as point-in-time backups of your database instances. However, did you know that you can export these RDS snapshots directly to an Amazon S3 bucket? In this blog post, we’ll explore the value of exporting RDS snapshots to S3 and how to use Python to extract data from specific databases, schemas, or tables, converting it to SQL or CSV format for further analysis.

By default restoring a database from Snapshot means creating a new cluster from said snapshot. That sometimes is too much and not required. What if we want to get only partial data from the snapshot? Of course, one could create a new cluster and then connect and dump what is required but there is an alternative – exporting to S3.

Exporting RDS snapshots to S3 is straightforward. Here’s a quick overview of the steps involved:

If you haven’t already, create an S3 bucket where you want to store your RDS snapshots

Create an IAM role with permissions to allow RDS to write to your S3 bucket

Use the AWS Management Console, AWS CLI, or SDKs to export your RDS snapshot to the S3 bucket

aws rds start-export-task \

    --export-task-identifier my-snapshot-export \

    --source-arn arn:aws:rds:AWS_Region:123456789012:snapshot:snapshot-name \

    --s3-bucket-name amzn-s3-demo-bucket \

    --iam-role-arn iam-role \

    --export-only database.schema.table \

    --kms-key-id my-key

The Export task supports partial database export via the `–export-only` flag, followed by the table or schema we want to export. More details HERE

example IAM role  ; amazon rds
attached policies

Restoring does take time – in my test scenarios it took about 30-40 minutes. RDS does need to spin-up a whole cluster and perform the database dump.

Once done, you can find the extracted data in the bucket. It is in a “.parquet” format. We can use a simple Python script to convert it into SQL Statements

  1. python3 -m venv .venv    
  2. source .venv/bin/activate
  3. pip install pandas pyarrow                                                            
import pandas as pd

import argparse

def parquet_to_sql(parquet_file, table_name, output_file):

    # Read the Parquet file into a DataFrame

    df = pd.read_parquet(parquet_file)

    # Get the column names from the DataFrame

    columns = df.columns.tolist()

    # Create a list to store SQL insert statements

    sql_statements = []

    # Generate SQL INSERT statement for each row

    for index, row in df.iterrows():

        # Prepare the SQL insert statement

        values = [f"'{str(val)}'" if isinstance(val, str) else 'NULL' if pd.isna(val) else str(val) for val in row]

        insert_statement = f"INSERT INTO {table_name} ({', '.join(columns)}) VALUES ({', '.join(values)});"

        sql_statements.append(insert_statement)

    # Write SQL statements to the output file

    with open(output_file, 'w') as f:

        for statement in sql_statements:

            f.write(statement + '\n')

    print(f"SQL statements saved to {output_file}")

def main():

    # Set up argument parser

    parser = argparse.ArgumentParser(description='Convert Parquet to SQL insert statements.')

    parser.add_argument('parquet_file', type=str, help='Path to the Parquet file')

    parser.add_argument('table_name', type=str, help='Name of the target SQL table')

    parser.add_argument('--output', type=str, default='output.sql', help='Output SQL file (default: output.sql)')

    # Parse command-line arguments

    args = parser.parse_args()

    # Convert the Parquet file to SQL

    parquet_to_sql(args.parquet_file, args.table_name, args.output)

if __name__ == "__main__":

    main()

Usage:

python main.py --output MY_OUTPUT.sql MY_EXPORT.gz.parquet TABLE_NAME

Example:

python main.py --output senders.sql part-00000-657cc6f6-1bc3-466f-9e2d-433e98af2b6f-c000.gz.parquet senders

Exporting Amazon RDS snapshots to S3 can be a highly effective way to access and analyze data without the overhead of creating a new RDS cluster. By exporting directly to S3 and leveraging a simple Python script, you can selectively extract tables or specific data from snapshots for tasks like data warehousing, auditing, and archiving.

Always ensure that your S3 bucket is configured with strict access policies, especially if your snapshots contain sensitive information. Use AWS KMS to encrypt data in transit and at rest.

Enhance the Python script with error-handling mechanisms, including checks for data compatibility and logging, to make it more robust in production environments

Be aware of potential S3 storage costs for large snapshots and consider setting up lifecycle policies to transition data to cheaper storage classes if long-term storage is required

Use AWS Lambda and Step Functions to automate snapshot exports and conversions on a schedule, making data extraction streamlined and minimizing manual intervention

If working with large datasets, consider optimizing the Python conversion script to handle larger Parquet files

By integrating these enhancements, you can establish a scalable, secure, and efficient workflow for exporting and processing RDS snapshots directly in S3, ready for analytics and integration into your broader data strategy.

Leave a Reply

Your email address will not be published. Required fields are marked *

More Posts

Discover how Knative empowers developers to deploy, scale, and manage serverless applications on Kubernetes. Learn about its components, deployment steps, and benefits for optimizing cloud operations.
Reading
Learn how to overcome DockerHub rate limits in Kubernetes environments by deploying Harbor as a proxy cache. This guide covers setup steps, image management improvements, and optimizing CI/CD workflows.
Reading
Get In Touch
ITGix provides you with expert consultancy and tailored DevOps services to accelerate your business growth.
Newsletter for
Tech Experts
Join 12,000+ business leaders and engineers who receive blogs, e-Books, and case studies on emerging technology.