Amazon Glacier: client is a Perl multithread/multipart downloading

Amazon Glacier

In short — Amazon Glacier is a service with a very attractive price of storage created to store files/backups. But the process of recovering files quite difficult and/or expensive. However, the service is quite suitable for secondary backup.
Read more about Glacier was already wrote on habré.

About the post

I want to share an Open Source client in Perl to synchronize a local directory with service Glacier also tell about some of the nuances of working with glacier and describe workflow of his work.

Functionality

So, mtglacier. Link on GitHub.
Software features:
the

Protocol AWS is implemented independently, without using third party libraries. Self realization Amazon Signature Version 4 Signing Process calculate Hash Tree
Implemented Multipart Upload
multi-threaded upload/download/delete archives
Combining Mnohopocetneho and Multipart upload. I.e., three streams can upload part of the file A two — part file B, one can initiate upload of the file, etc.
Local journal file (opened for writing only in append mode). After all, to get a listing of files from Glacier you will need to wait 4 hours
the Ability to compare checksums of local files with the log
When downloading files, the ability to limit the number of files that will be downloaded at a time (needed because of the peculiarities of billing download archives)

project Goal

to Implement four things in one program (it's all there separately, but not together)

Implementation in Perl — I Think that the language/technology, which made the program too important to the end user/administrator. So it is better to have the choice of implementations in different languages
be Sure you plan to support Amazon S3
Multipart operation+multi-threaded operations
multipart will help to avoid the situation, when you upload multiple gigabytes to a remote server and suddenly the connection breaks. Multithreading speeds up Sargasso, and speeds up loading heaps of small files or deleting large numbers of files
Own implementation of the Protocol is planned to make the code re-usability and to publish as a separate modules on CPAN

How it works

When you sync files to the service, mtglacier creates a magazine (text file) which contains all operations upload'and files: for each transaction, the name of the local file, upload time, Tree Nail received archive_id file.

When restoring files from Glacier to local disk, recovery data are taken from this log (because to get a listing of files on the glacier can only be delayed by four hours or more).

When deleting files from Glacier records are added about the removal in the log. When you re-sync to Glacier, processed only the files where there, according to the journal.

The recovery procedure is two-pass file:

create a task to download files present in the log, but not on the local disk
After waiting four hours you can re-run the command to download these files

Warnings

the

Before use, please read prices on the website of Amazon
Examine the FAQ of Amazon

No, really, be sure to check the prices the

Before "play" with Glacier, make sure that you are able to delete all the files in the end. The fact that to remove files and is not empty vault not through Amazon Console
In multi-threaded work with the glacier you need to choose the optimal number of threads. Amazon returns the HTTP 408 Timeout in case of too slow injection in each specific thread (after this thread mtglacier pauses one second and retry, but no more than five times). So multi-threaded download may be slower than single-threaded

How to use

to Create a Vault in Amazon Console
to Create a glacier.cfg (specify the same region in which you created the vault)
the
```
key=YOURKEY
secret=YOURSECRET
region=us-east-1
```
to Synchronize a local directory to Glacier. Use the option concurrency to specify the number of threads
./mtglacier.pl sync --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log --concurrency=3
you Can add files and sync again
to Check the file integrity (check only journal)
./mtglacier.pl check-local-hash --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log
Unable to remove some files from /data/backup
Create a job on data recovery. Use the option max-number-of-files to indicate the number of arhivov you want to restore. At the moment it is not recommended to specify a value more than a few dozen (still not loading more than one page with the list of current Jobs)
./mtglacier.pl restore --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log --max-number-of-files=10
Wait 4 hours or more
Restore deleted files
./mtglacier.pl restore-completed --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log
If a backup no longer needed, remove all files from Glacier.
./mtglacier.pl purge-vault --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log

Implementation of

the

Operation with HTTP/HTTPS implemented through LWP::UserAgent
interaction with the API of Amazon written from scratch, only used the module Digest::SHA (Core module)
Independent implementation of Amazon Tree Hash (it's their own algorithm for checksum calculation, found similar among TTH algorithms, please correct if wrong)
implemented multi-Threading using fork processes
Processing an abort processes with signals

the interaction between the processes using unnamed pipes the

job queues — your OOP objects, like FSM

not yet enough

The main thing for the first beta was — stable working (it's not alpha) version, ready by the end of the week, so a lot of that is not enough.
Definitely will:
the

Adjustable chunk size
-the multipart upload
the topic for SNS notifications in the config
integration with the outside world for receiving a signal via SNS notification
Internal refactoring
Will be published unit tests when you take them in order
Will be another testsuite, probably a server emulator glacier
of Course is a production-ready version (not Beta)

Article based on information from habrahabr.ru

Поиск по этому блогу

computer express