Learning Hadoop 2
上QQ阅读APP看书,第一时间看更新

Command-line access to the HDFS filesystem

Within the Hadoop distribution, there is a command-line utility called hdfs, which is the primary way to interact with the filesystem from the command line. Run this without any arguments to see the various subcommands available. There are many, though; several are used to do things like starting or stopping various HDFS components. The general form of the hdfs command is:

hdfs <sub-command> <command> [arguments]

The two main subcommands we will use in this book are:

  • dfs: This is used for general filesystem access and manipulation, including reading/writing and accessing files and directories
  • dfsadmin: This is used for administration and maintenance of the filesystem. We will not cover this command in detail, though. Have a look at the -report command, which gives a listing of the state of the filesystem and all DataNodes:
    $ hdfs dfsadmin -report

Note

Note that the dfs and dfsadmin commands can also be used with the main Hadoop command-line utility, for example, hadoop fs -ls /. This was the approach in earlier versions of Hadoop but is now deprecated in favor of the hdfs command.

Exploring the HDFS filesystem

Run the following to get a list of the available commands provided by the dfs subcommand:

$ hdfs dfs

As will be seen from the output of the preceding command, many of these look similar to standard Unix filesystem commands and, not surprisingly, they work as would be expected. In our test VM, we have a user account called cloudera. Using this user, we can list the root of the filesystem as follows:

$ hdfs dfs -ls /
Found 7 items
drwxr-xr-x - hbase hbase 0 2014-04-04 15:18 /hbase
drwxr-xr-x - hdfs supergroup 0 2014-10-21 13:16 /jar
drwxr-xr-x - hdfs supergroup 0 2014-10-15 15:26 /schema
drwxr-xr-x - solr solr 0 2014-04-04 15:16 /solr
drwxrwxrwt - hdfs supergroup 0 2014-11-12 11:29 /tmp
drwxr-xr-x - hdfs supergroup 0 2014-07-13 09:05 /user
drwxr-xr-x - hdfs supergroup 0 2014-04-04 15:15 /var

The output is very similar to the Unix ls command. The file attributes work the same as the user/group/world attributes on a Unix filesystem (including the t sticky bit as can be seen) plus details of the owner, group, and modification time of the directories. The column between the group name and the modified date is the size; this is 0 for directories but will have a value for files as we'll see in the code following the next information box:

Note

If relative paths are used, they are taken from the home directory of the user. If there is no home directory, we can create it using the following commands:

$ sudo -u hdfs hdfs dfs –mkdir /user/cloudera
$ sudo -u hdfs hdfs dfs –chown cloudera:cloudera /user/cloudera

The mkdir and chown steps require superuser privileges (sudo -u hdfs).

$ hdfs dfs -mkdir testdir
$ hdfs dfs -ls
Found 1 items
drwxr-xr-x - cloudera cloudera 0 2014-11-13 11:21 testdir

Then, we can create a file, copy it to HDFS, and read its contents directly from its location on HDFS, as follows:

$ echo "Hello world" > testfile.txt
$ hdfs dfs -put testfile.txt testdir

Note that there is an older command called -copyFromLocal, which works in the same way as -put; you might see it in older documentation online. Now, run the following command and check the output:

$ hdfs dfs -ls testdir
Found 1 items
-rw-r--r-- 3 cloudera cloudera 12 2014-11-13 11:21 testdir/testfile.txt

Note the new column between the file attributes and the owner; this is the replication factor of the file. Now, finally, run the following command:

$ hdfs dfs -tail testdir/testfile.txt
Hello world

Much of the rest of the dfs subcommands are pretty intuitive; play around. We'll explore snapshots and programmatic access to HDFS later in this chapter.