Amazon Glacier: client is a Perl multithread/multipart downloading

image

Amazon Glacier


In short — Amazon Glacier is a service with a very attractive price of storage created to store files/backups. But the process of recovering files quite difficult and/or expensive. However, the service is quite suitable for secondary backup.
Read more about Glacier was already wrote on habré.

About the post


I want to share an Open Source client in Perl to synchronize a local directory with service Glacier also tell about some of the nuances of working with glacier and describe workflow of his work.

Functionality


So, mtglacier. Link on GitHub.
Software features:
the
    the
  • Protocol AWS is implemented independently, without using third party libraries. Self realization Amazon Signature Version 4 Signing Process calculate Hash Tree
  • the
  • Implemented Multipart Upload
  • the
  • multi-threaded upload/download/delete archives
  • the
  • Combining Mnohopocetneho and Multipart upload. I.e., three streams can upload part of the file A two — part file B, one can initiate upload of the file, etc.
  • the
  • Local journal file (opened for writing only in append mode). After all, to get a listing of files from Glacier you will need to wait 4 hours
  • the
  • the Ability to compare checksums of local files with the log
  • the
  • When downloading files, the ability to limit the number of files that will be downloaded at a time (needed because of the peculiarities of billing download archives)


project Goal


to Implement four things in one program (it's all there separately, but not together)

    the
  1. Implementation in Perl — I Think that the language/technology, which made the program too important to the end user/administrator. So it is better to have the choice of implementations in different languages
  2. the
  3. be Sure you plan to support Amazon S3
  4. the
  5. Multipart operation+multi-threaded operations
    multipart will help to avoid the situation, when you upload multiple gigabytes to a remote server and suddenly the connection breaks. Multithreading speeds up Sargasso, and speeds up loading heaps of small files or deleting large numbers of files
  6. the
  7. Own implementation of the Protocol is planned to make the code re-usability and to publish as a separate modules on CPAN


How it works



When you sync files to the service, mtglacier creates a magazine (text file) which contains all operations upload'and files: for each transaction, the name of the local file, upload time, Tree Nail received archive_id file.

When restoring files from Glacier to local disk, recovery data are taken from this log (because to get a listing of files on the glacier can only be delayed by four hours or more).

When deleting files from Glacier records are added about the removal in the log. When you re-sync to Glacier, processed only the files where there, according to the journal.

The recovery procedure is two-pass file:
    the
  1. create a task to download files present in the log, but not on the local disk
  2. the
  3. After waiting four hours you can re-run the command to download these files


Warnings


the
    the
  • Before use, please read prices on the website of Amazon
  • the
  • Examine the FAQ of Amazon
  • No, really, be sure to check the prices the

  • Before "play" with Glacier, make sure that you are able to delete all the files in the end. The fact that to remove files and is not empty vault not through Amazon Console
  • the
  • In multi-threaded work with the glacier you need to choose the optimal number of threads. Amazon returns the HTTP 408 Timeout in case of too slow injection in each specific thread (after this thread mtglacier pauses one second and retry, but no more than five times). So multi-threaded download may be slower than single-threaded


How to use


    the
  1. to Create a Vault in Amazon Console
  2. the
  3. to Create a glacier.cfg (specify the same region in which you created the vault)
    the
    key=YOURKEY
    secret=YOURSECRET
    region=us-east-1
    

  4. the
  5. to Synchronize a local directory to Glacier. Use the option concurrency to specify the number of threads
    ./mtglacier.pl sync --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log --concurrency=3

  6. the
  7. you Can add files and sync again
  8. the
  9. to Check the file integrity (check only journal)
    ./mtglacier.pl check-local-hash --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log


  10. the
  11. Unable to remove some files from /data/backup
  12. the
  13. Create a job on data recovery. Use the option max-number-of-files to indicate the number of arhivov you want to restore. At the moment it is not recommended to specify a value more than a few dozen (still not loading more than one page with the list of current Jobs)
    ./mtglacier.pl restore --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log --max-number-of-files=10

  14. the
  15. Wait 4 hours or more
  16. the
  17. Restore deleted files
    ./mtglacier.pl restore-completed --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log

  18. the
  19. If a backup no longer needed, remove all files from Glacier.
    ./mtglacier.pl purge-vault --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log



Implementation of


the
    the
  • Operation with HTTP/HTTPS implemented through LWP::UserAgent
  • the
  • interaction with the API of Amazon written from scratch, only used the module Digest::SHA (Core module)
  • the
  • Independent implementation of Amazon Tree Hash (it's their own algorithm for checksum calculation, found similar among TTH algorithms, please correct if wrong)
  • the
  • implemented multi-Threading using fork processes
  • the
  • Processing an abort processes with signals
  • the interaction between the processes using unnamed pipes the

  • job queues — your OOP objects, like FSM

not yet enough


The main thing for the first beta was — stable working (it's not alpha) version, ready by the end of the week, so a lot of that is not enough.
Definitely will:
the
    the
  • Adjustable chunk size
  • the
  • -the multipart upload
  • the
  • the topic for SNS notifications in the config
  • the
  • integration with the outside world for receiving a signal via SNS notification
  • the
  • Internal refactoring
  • the
  • Will be published unit tests when you take them in order
  • the
  • Will be another testsuite, probably a server emulator glacier
  • the
  • of Course is a production-ready version (not Beta)
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Monitoring PostgreSQL + php-fpm + nginx + disk using Zabbix

Templates ESKD and GOST 7.32 for Lyx 1.6.x

Customize your Google