Scalding hiding NPEs in “operator Each failed executing operation”

Yesterday I was surprised by a failing Scalding task. Everything worked fine locally and all I git was like “job failed, see cluster log”. In the cluster log I saw the following:

2014-10-24 14:38:41,222 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201410101555_2230_m_000005_3: cascading.pipe.OperatorException: [com.twitter.scalding.T…][com.twitter.scalding.RichPipe.each(RichPipe.scala:471)] operator Each failed executing operation
at cascading.tuple.TupleEntryCollector.safeCollect(
at cascading.tuple.TupleEntryCollector.add(
at cascading.operation.Identity$2.operate(
at cascading.operation.Identity.operate(

Continue reading Scalding hiding NPEs in “operator Each failed executing operation”

Enable output compression in Scalding

I just wanted to enable final output compression in one of my Scalding jobs (because I needed to reorganize a some-TB-data set).

Unfortunately scalding always produced uncompressed files. After some googling, I came across a github issue that adressed exactly this problem. Via some links I got the sample code from this repo which can be used to write compressed TSVs.

Continue reading Enable output compression in Scalding

Scalding Exception: diverging implicit expansion for type com.twitter.algebird.Semigroup[T]

I was just doing a again some scalding jobs and again got an .. interesting exception:

In a groupBy operation, I wanted to sum something up using:

.groupBy('a) {
  _.sum('a -> 'c)

And was rewarded with this one:

[error] example.scala:20: diverging implicit expansion for type com.twitter.algebird.Semigroup[T]
[error] starting with method eitherSemigroup in object Semigroup
[error]       _.sum('a -> 'c)
[error]            ^
[error] one error found
[error] (compile:compile) Compilation failed



Spot the mistake? It’s the missing type hint at sum:

.groupBy('a) {
  _.sum<strong>[Int]</strong>('a -> 'c)  //  <-- [Int]

Scalding: unable to compare stream elements in position: 0

I’m currently working quite a bit with Twitter’s Scalding.
Recently I split up a job into sub-jobs and suddenly got an Exception in my join:

Caused by: cascading.CascadingException: unable to compare stream elements in position: 0

If I had remembered the Fields API in detail, I would have thought about this paragraph (it’s about sorting, but the consequence is the same):

Note: When reading from a CSV, the data types are set to String,hence the sorting will be alphabetically, therefore to sort by age, an int, you need to convert it to an integer. For example …


Ensure you are joining the correct data types and possibly convert them before. For example:

.map ('myField-> 'myField) {x:Int => x}

IntelliJ IDEA and Scala being awfully slow on Windows 8.1

At work we are working mostly in Scala and most of us are using IntelliJ IDEA for coding. The choice of the operating system is up to the developer. As I am quite convenient with Windows (and use MS Office quite often), I am a happy Windows 8.1 user (btw: who the hell needs a start button when you have the windows key!? Anyways … different story).

The Problem

After a while when I started a Scalding project, IntelliJ became very slow and often turned to be non responsive for some seconds about once per minute. So over all it was a very inconvenient and unproductive situation.
Continue reading IntelliJ IDEA and Scala being awfully slow on Windows 8.1