I want to process CSV files in Python with Hadoop, but I need to reference another file that contains lookup information.
I read that I can use the -files command line option which creates a symlink to the local file, but how do I reference this file in my Python mapper file?
Once this job was created in Amazon EMR, I could copy the file to S3 and reference it directly using the -cacheFile
option:
bin/hadoop ... -cacheFile s3://my-bucket/files/cachefile.csv#reference
In Python I could then open this file:
with open("reference") as reference_file:
references = reference_file.read().splitlines()
この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。
侵害の場合は、連絡してください[email protected]
コメントを追加