Hadoop noob here.
I've searched for some tutorials on getting started with hadoop and python without much success. I do not need to do any work with mappers and reducers yet, but it's more of an access issue.
As a part of Hadoop cluster, there are a bunch of .dat files on the HDFS.
In order to access those files on my client (local computer) using Python,
what do I need to have on my computer?
How do I query for filenames on HDFS ?
Any links would be helpful too.
You should have login access to a node in the cluster. Let the cluster administrator pick the node and setup the account and inform you how to access the node securely. If you are the administrator, let me know if the cluster is local or remote and if remote then is it hosted on your computer, inside a corporation or on a 3rd party cloud and if so whose and then I can provide more relevant information.
To query file names in HDFS, login to a cluster node and run hadoop fs -ls [path]
. Path is optional and if not provided, the files in your home directory are listed. If -R
is provided as an option, then it lists all the files in path recursively. There are additional options for this command. For more information about this and other Hadoop file system shell commands see http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html.
An easy way to query HDFS file names in Python is to use esutil.hdfs.ls(hdfs_url='', recurse=False, full=False)
, which executes hadoop fs -ls hdfs_url
in a subprocess, plus it has functions for a number of other Hadoop file system shell commands (see the source at http://code.google.com/p/esutil/source/browse/trunk/esutil/hdfs.py). esutil can be installed with pip install esutil
. It is on PyPI at https://pypi.python.org/pypi/esutil, documentation for it is at http://code.google.com/p/esutil/ and its GitHub site is https://github.com/esheldon/esutil.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments