In this article, you will learn about the basics of Paxata backup tasks.
There are three components that requires backup in case of data loss from the running servers:
- Metadata Storage (MongoDB)
- Data Library Storage (HDFS)
- Properties Files (particularly pes.properties)
Notably, Pipeline cache files on executors do not need to backed up, as cache loss would be recovered by cache retrieval automatically.
For each component, there are many tools for backup. Here we are recommending the most basic tools that can achieve the backup task alone. For better reliability/manageability, more advanced tools may be available.
Metadata Storage (MongoDB)
mongodump --out /tmp/mongobackup_`date +"%m-%d-%y"`
Data Library Storage (HDFS)
Distcp allows you to copy directory from HDFS to another cluster/s3 bucket.
hadoop distcp hdfs://CDH5-nameservice/user/paxata/library s3a://bucket/librarybackup
Cloudera BDR is a Enterprise solution of Distcp
Properties Files (particularly pes.properties)
Upload Files from server local file system to S3 bucket
aws s3 sync . s3://bucket/propertybackup