I just wanted to enable final output compression in one of my Scalding jobs (because I needed to reorganize a some-TB-data set).
Unfortunately scalding always produced uncompressed files. After some googling, I came across a github issue that adressed exactly this problem. Via some links I got the sample code from this repo which can be used to write compressed TSVs.
Solution:
- Set the parameters correctly as stated in the docs. Beware of your Hadoop version (Yarn vs MR1):
// http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.5.0/CDH4-Installation-Guide/cdh4ig_topic_23_3.html // MR1 // Compress Map output set("mapred.compress.map.output", "true") set("mapred.map.output.compression.codec", "org.apache.hadoop.io.compress.SnappyCodec") // compress final output set("mapred.output.compress", "true") set("mapred.output.compression.codec", "org.apache.hadoop.io.compress.SnappyCodec")
- Get the
CompressedDelimitedScheme
andCompressedTsv
from https://github.com/morazow/WordCount-Compressed - Pipe your output to a compressed TSV:
myPipe.write(CompressedTsv("/tmp/foo"))
- Check your output and the content:
hadoop fs -ls /tmp/foo
it should list a
/tmp/foo/part-00000.snappy
hadoop fs -text -cat /tmp/foo/part-00000.snappy