Amazon Glacier: client is a Perl multithread/multipart downloading
Amazon Glacier
In short — Amazon Glacier is a service with a very attractive price of storage created to store files/backups. But the process of recovering files quite difficult and/or expensive. However, the service is quite suitable for secondary backup.
Read more about Glacier was already wrote on habré.
About the post
I want to share an Open Source client in Perl to synchronize a local directory with service Glacier also tell about some of the nuances of working with glacier and describe workflow of his work.
Functionality
So, mtglacier. Link on GitHub.
Software features:
the
-
the
- Protocol AWS is implemented independently, without using third party libraries. Self realization Amazon Signature Version 4 Signing Process calculate Hash Tree the
- Implemented Multipart Upload the
- multi-threaded upload/download/delete archives the
- Combining Mnohopocetneho and Multipart upload. I.e., three streams can upload part of the file A two — part file B, one can initiate upload of the file, etc. the
- Local journal file (opened for writing only in append mode). After all, to get a listing of files from Glacier you will need to wait 4 hours the
- the Ability to compare checksums of local files with the log the
- When downloading files, the ability to limit the number of files that will be downloaded at a time (needed because of the peculiarities of billing download archives)
project Goal
to Implement four things in one program (it's all there separately, but not together)
-
the
- Implementation in Perl — I Think that the language/technology, which made the program too important to the end user/administrator. So it is better to have the choice of implementations in different languages
the - be Sure you plan to support Amazon S3 the
- Multipart operation+multi-threaded operations
multipart will help to avoid the situation, when you upload multiple gigabytes to a remote server and suddenly the connection breaks. Multithreading speeds up Sargasso, and speeds up loading heaps of small files or deleting large numbers of files
the - Own implementation of the Protocol is planned to make the code re-usability and to publish as a separate modules on CPAN
How it works
When you sync files to the service, mtglacier creates a magazine (text file) which contains all operations upload'and files: for each transaction, the name of the local file, upload time, Tree Nail received archive_id file.
When restoring files from Glacier to local disk, recovery data are taken from this log (because to get a listing of files on the glacier can only be delayed by four hours or more).
When deleting files from Glacier records are added about the removal in the log. When you re-sync to Glacier, processed only the files where there, according to the journal.
The recovery procedure is two-pass file:
-
the
- create a task to download files present in the log, but not on the local disk the
- After waiting four hours you can re-run the command to download these files
Warnings
the
-
the
- Before use, please read prices on the website of Amazon the
- Examine the FAQ of Amazon
- Before "play" with Glacier, make sure that you are able to delete all the files in the end. The fact that to remove files and is not empty vault not through Amazon Console the
- In multi-threaded work with the glacier you need to choose the optimal number of threads. Amazon returns the HTTP 408 Timeout in case of too slow injection in each specific thread (after this thread mtglacier pauses one second and retry, but no more than five times). So multi-threaded download may be slower than single-threaded
No, really, be sure to check the prices the
How to use
-
the
- to Create a Vault in Amazon Console the
- to Create a glacier.cfg (specify the same region in which you created the vault)
thekey=YOURKEY secret=YOURSECRET region=us-east-1
the - to Synchronize a local directory to Glacier. Use the option concurrency to specify the number of threads
./mtglacier.pl sync --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log --concurrency=3
the - you Can add files and sync again the
- to Check the file integrity (check only journal)
./mtglacier.pl check-local-hash --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log
the - Unable to remove some files from /data/backup
the - Create a job on data recovery. Use the option max-number-of-files to indicate the number of arhivov you want to restore. At the moment it is not recommended to specify a value more than a few dozen (still not loading more than one page with the list of current Jobs)
./mtglacier.pl restore --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log --max-number-of-files=10
the - Wait 4 hours or more the
- Restore deleted files
./mtglacier.pl restore-completed --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log
the - If a backup no longer needed, remove all files from Glacier.
./mtglacier.pl purge-vault --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log
Implementation of
the
-
the
- Operation with HTTP/HTTPS implemented through LWP::UserAgent the
- interaction with the API of Amazon written from scratch, only used the module Digest::SHA (Core module) the
- Independent implementation of Amazon Tree Hash (it's their own algorithm for checksum calculation, found similar among TTH algorithms, please correct if wrong) the
- implemented multi-Threading using fork processes the
- Processing an abort processes with signals
- job queues — your OOP objects, like FSM
the interaction between the processes using unnamed pipes the
not yet enough
The main thing for the first beta was — stable working (it's not alpha) version, ready by the end of the week, so a lot of that is not enough.
Definitely will:
the
-
the
- Adjustable chunk size the
- -the multipart upload the
- the topic for SNS notifications in the config the
- integration with the outside world for receiving a signal via SNS notification the
- Internal refactoring the
- Will be published unit tests when you take them in order the
- Will be another testsuite, probably a server emulator glacier the
- of Course is a production-ready version (not Beta)
Комментарии
Отправить комментарий