HDFSFileTransfer
| Developer(s) | Daniel Kozlowski |
|---|---|
| Stable release | 1.0
/ August 31, 2014 |
| Written in | Bash |
| Engine | |
| Operating system | Linux |
| License | MIT License, Boost Software License |
| Website | http://sourceforge.net/projects/hdfsfiletransfer/ |
Search HDFSFileTransfer on Amazon.
HDFSFileTransfer is a tool written in 'bash' script for a quick transfer of any files into HDFS. It allows users to copy files within the same physical machine as well as between two machines. It is meant to copy files from a Linux file system with an HDFS cluster to another HDFS cluster. E.g., one can have two single Hadoop clusters installed on two different Linux machines. The script can copy files from one Linux machine with Hadoop installed to the other one. HDFSFileTransfer is dual-licensed under the MIT license and the Boost Software License, and its source code is freely available.
Features
* Local Copy
The script runs on the same machine HDFS was installed. What the script does is to:
- Copy files from the local file system
- Paste the files into HDFS
- Archive the copied files on the local file system under the archive folder
-
Local Copy
* Copy From One Machine To Another With Different HDFS Cluster
The script allows you to copy files from a Linux file system with an HDFS cluster installed to another HDFS cluster. It is, however, not meant to copy from one HDFS cluster to another one. E.g., you can have two single clustered Hadoop installed on two different Linux machines – in my particular situation I have done so using VM. The script can copy files from one Linux machine with Hadoop installed on the other one.
-
Copy from one to another
* Email errors encountered during the process
The process consists of a few validations. Should the validation fail, it gets the error status. The error process:
- terminates the transfer process
- sends an email containing the error message to a user/group of people
-
Email - Error Message
Diagram Process
The process is broken down into two parts:
- initial validations - there are three checks being done at the very beginning - each time the script starts up
- a 'while' loop - all the copy and archive bits - although it is done in loop (so can be run as a daemon), there is also an option to run the process each time you wish
-
Diagram Process
Validations
The list of all validations carried out within the copying process:
- Local Folders – checks if all local folders, the files are copied from (SOURCE site), exist. If any of the folders do not exist, the error gets written into a log file, an email sent to the user, and the process gets terminated.
- Hadoop - DataNode - checks if the DataNode process is up and running. It uses the JPS command to validate if the Hadoop process is up and running. If the process is down, the error gets written into a log file, an email sent to the user, and the process gets terminated.
- HDFS Folders – checks if all HDFS folders, the files are copied into (DESTINATION site), exist. If any of the folders do not exist, the error gets written into a log file, an email sent to the user, and the process gets terminated.
- Number Of Copied Files - the validation is carried out each and every time the files get copied from SOURCE to DESTINATION. The process counts the number of files picked up from the SOURCE folder and the ones pasted into the DESTINATION one and compares these two. Should the numbers not match, the error gets written into a log file, an email sent to the user, and the process gets terminated.
Limitations
* Copy files with blank space(s) - Hadoop 2.4.0
Problem: There is a problem when trying to copy files for which their name contains spaces.
When doing the following: hdfs dfs –put /home/testUser/New filename.txt /user/hduser/temp
This returns the following error: put: unexpected URISyntaxException
Solution: See the documentation of this project available on the sourceforge website.
External links
This article "HDFSFileTransfer" is from Wikipedia. The list of its authors can be seen in its historical. Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one.
