Introduction
As most organizations increasingly rely on cloud storage for their data, migrating to Amazon S3 has become a popular choice. S3 provides scalability, durability, and cost-effectiveness, making it ideal for storing big data. When migrating and monitoring data in S3, choosing the right strategies and tools is crucial. This blog explores various S3 migration strategies and effective monitoring solutions, particularly focusing on better understanding your usage and getting insight into your big data.
Migration Strategies
When migrating data to Amazon S3, organizations typically consider several strategies based on their specific needs and the volume of data. Here are three primary methods:
AWS DataSync: Simplified Data Transfer
AWS DataSync is a managed data transfer service that simplifies secure and efficient data movement between on-premises, AWS, and cloud storage systems. It supports various storage types, including file systems (e.g., NFS, SMB, HDFS), object storage, AWS services (like Amazon S3 and Amazon FSx), and third-party cloud providers. DataSync also enables edge storage transfers with devices like AWS Snowball Edge.
For a complete list of supported storage services, see the AWS DataSync documentation.
Modes of Operation
- Basic Mode is ideal for smaller data transfers but comes with certain limitations, especially regarding the types of objects that can be transferred. It may not support transferring files or objects that exceed certain size or format constraints, and there are limits on the number of files, objects, and directories that can be transferred. You can find more details on these limitations here.
- Enhanced Mode is recommended for larger data transfers, as it is designed to optimize data transfer speeds over the network. However, it can only be used for transfers between Amazon S3 buckets. For more information on the pricing differences between the two modes, visit the AWS DataSync Pricing page.
Choosing the Right Mode
When selecting the right mode for your use case, refer to the AWS DataSync Task Mode Comparison to help you choose between Basic and Enhanced modes based on your workload size and requirements.
Logging and monitoring
CloudWatch logs can be configured to monitor the task executions, but you’re not getting a granular, object-level breakdown of what exactly has been transferred. Most likely if you encounter an issue you will need help from AWS support to get a better understanding and a permanent solution. There is no easy way to track down what was transferred and if everything is fully migrated. AWS DataSync Logging Documentation
S3 Batch Operations
S3 Batch Operations is a powerful feature that allows organizations to perform bulk operations on millions or billions of S3 objects with a single request. This capability is beneficial for managing large datasets where manual handling of individual objects is impractical. Supports operations such as copying, tagging, or deleting large sets of S3 objects. Automates repetitive tasks, saving time and reducing manual effort. It is highly efficient for big data environments, making it ideal for large-scale object management scenarios. Operates in the background, allowing other workflows to continue uninterrupted. AWS Batch Operations Documentation
Comparison with DataSync Basic Mode
- In DataSync Basic Mode, you must segregate data to avoid hitting service limits or requesting a quota increase from AWS Support
- S3 Batch Operations eliminates the need for data segregation, simplifying bulk object management. AWS Creation of Batch Operations Job
Logging and Monitoring
S3 Batch Operations provides detailed, object-level logs, making it easy to track:
- Which objects were successfully processed
- Errors or issues encountered during operations. AWS S3 Batch Operations Logging
In contrast, DataSync logs are less descriptive, often requiring assistance from AWS Support to diagnose and resolve problems. AWS DataSync Logging
Insights on Your S3 Buckets
Once your data is migrated to S3, monitoring is essential to ensure optimal performance and cost management. AWS provides various ways for monitoring S3 buckets, including:
S3 metrics monitoring
The simplest and easiest way to monitor your overall bucket size is by using the AWS-provided metrics in CloudWatch:
There you can see the total bucket size and the number of objects. But if you need to gather more insights into the specific size of a prefix you would need to implement additional tools. Keep in mind that the metrics get updated with several days of delay, so this might become an issue if you have a requirement to monitor in real time. AWS S3 Metrics and Dimensions
S3 Storage Lens
S3 Storage Lens offers a comprehensive view of storage usage and activity trends across your S3 buckets. However, it’s important to note its limitation: if a prefix uses less than 1% of total storage, it may not be shown in the analysis. Despite this limitation, Storage Lens provides valuable insights into usage patterns and can help organizations identify underutilized resources, thereby optimizing storage costs. AWS S3 Storage Lens
When configuring your Storage Lens you can filter which buckets/regions you wish to be included in the dashboard. There is a metrics selection where you can include either `free metrics` or `advanced metrics and recommendations`. However, the advanced metrics are paid – for more information related to its pricing check AWS S3 Management & Insights. The included free metrics for the dashboard are the following:
If you choose to configure Advanced metrics these would be the options that you’ll have:
There is also the possibility of having the metrics exported to an S3 destination. This can be combined with analytics tools for an even better understanding. Below you can see one general dashboard in Storage Lens and what it includes:
When you click on any of the provided options you’ll get more in-depth information for the specific selections(Accounts, AWS Regions, Storage classes, Buckets, Prefixes, Storage Lens groups). There is also the possibility to filter in your already existing dashboard, this is very helpful in cases where your dashboard provides information about several Buckets or Regions for example:
Storage Lens provides the possibility for prefix aggregation, but this comes with its limits – you will be able to add a prefix aggregation with a depth of 1 to 10, but you will see only prefixes that are 1% of the overall bucket storage:
This means that if you have one very big prefix and a few that are quite smaller than it, it’s possible that they’re not shown afterward in the dashboard. Below is an example of that:
In the first example, I have three prefixes in my bucket, but two of them are significantly smaller than the third. Due to the limitations explained earlier, only the largest prefix is visible in the dashboard.
In the second example, the bucket contains three prefixes that are the same size. Below is a screenshot showing how this scenario appears in the dashboard.
Storage Lens Groups eliminate the limitation that the original prefix aggregation entails. We’ll see how with more details about Storage Lens Groups in the next point.
S3 Storage Lens Groups
S3 Storage Lens Groups is a feature of S3 that allows you to filter your S3 Storage Lens Dashboard. The filter lets you use different filters with which you can narrow down to the needed information. Below are the options that can be narrowed down when using Storage Lens Groups:
After a group is created we’ll need to integrate it into our existing dashboard – this can be done in the metrics selection of your Storage Lens Dashboard under advanced metrics and recommendations:
You can create several different Storage Lens Groups and integrate them into the same dashboard for visibility.
S3 Inventory and Amazon Athena
S3 Inventory allows users to generate detailed reports on their objects and metadata at regular intervals. By integrating S3 Inventory with Amazon Athena, organizations can execute SQL queries on these reports, enabling in-depth analysis of their S3 data. This combination is especially valuable for auditing and analyzing data usage and compliance.
Inventory can be configured in the Management section of the specific bucket. During configuration, you can define prefixes, versions, report details, encryption settings, and metadata fields. It is essential to include the Size metadata, as it will be needed for Athena queries later.
This approach is more complex, as it requires a working knowledge of Athena. When setting up Athena, you need to create custom tables, schemas, and workgroups. AWS Inventory Documentation & AWS Athena Documentation
Once the inventory result is populated you can set up your Athena resources. Workgroups in Athena are used to separate workloads, in your specific workload you can also specify a query result location – so you have better insights into the run queries.
Once your workgroup is created you’ll need to create a database and table. This can happen via SQL query or through Glue.
Afterward, the table must be created with the specific columns and formats inside. It’s important to use the correct formats based on your query location because otherwise, the queries would fail.
Below you can find the formats I am using based on my current setup.
You can find more information regarding the configuring here – AWS Athena Getting Started
Once the setup of the Athena resources is also completed, we can start running our queries gathering the needed information out of it.
The queries can be changed based on the demand and the information that we’re trying to gather. Keep in mind that for every query result, we would get new files in our output location, so it is a good idea to add lifecycle policies that keep it clean. If needed to have a better visualization tool using this setup it can also be integrated with AWS QuickSight.
Below you can see an example of such a dashboard (keep in mind that in my example the folders have the same size and file):
More information about starting up with its setup can be found in AWS documentation – AWS QuickSight Getting Started
S3 Storage Class Analytics and Amazon QuickSight
For organizations looking to derive actionable insights from their S3 data, integrating S3 Analytics with Amazon QuickSight is a powerful option. S3 Analytics helps analyze access patterns and optimize storage configurations, while QuickSight provides a robust platform for visualizing data insights. This combination allows businesses to make data-driven decisions based on real-time analysis of their S3 storage.
To configure S3 analytics configuration you need to go under `Metrics` and click on `Create analytics configuration`:
You need to specify whether you would like to limit it to a specific prefix and whether you want to export the generated CSV file.
Afterward, the generated report can be visualized in AWS QuickSight. More details on how to configure a dashboard and what are the different solutions can be found in AWS S3 QuickSight Analyses Example
This is how an example dashboard can look like when using the data from S3 Storage Class Analysis and QuickSight:
In the above example, the data is shown together, but you can have separate Storage Analytics configurations for all the different prefixes that you would like to visualize. The dashboard can be completely customizable, so it can have a different layout.
Conclusion
Migrating to Amazon S3 offers numerous advantages, especially for organizations handling large volumes of data. Choosing the right migration strategy, whether through S3 DataSync or S3 Batch Operations is crucial for a successful transition. Additionally, utilizing monitoring tools like S3 Storage Lens, S3 Inventory with Athena, and S3 Analytics with QuickSight enables organizations to gain valuable insights into their data usage and optimize their S3 environment effectively.
For visibility into client usage, leveraging Storage Lens, Inventory with Athena, and S3 Analytics with QuickSight provides various levels of insights and visualization, allowing businesses to fine-tune their S3 configurations. As businesses continue to embrace big data, leveraging the full capabilities of Amazon S3 will be instrumental in driving efficiency and innovation.For more detailed insights on managing and analyzing S3 buckets, check out the AWS Storage Blog.