Boilerplate free Avro in Scala - Part 1

Here at 51zero, we frequently use Avro as the format when interacting with Parquet based Hive stores. However using the Java API when dealing with types in Scala can be an exercise in tedium - having to write manual conversions to and from the GenericRecord type that underpins Avro.

So we’ve developed a conversion library, that allows boilerplate free conversions between scala case classes and avro schemas and avro records. Even better, the conversions don’t rely on runtime reflection, which is how the Java “conversions” take place, and instead use compile time macros, which means you’re getting no speed penalty from the use of such conversions.

(You’ll need to the import the various classes referenced in this article, but we won’t include those in the code snippets for brevity)

Let start by defining some case classes that we’ll convert.

case class Pizza(name: String, ingredients: Seq[Ingredient], calories: Int)
case class Ingredient(name: String, sugar: Double, fat: Double)

Then we can invoke the apply method on the AvroSchema object, which will invoke the macro, and return an org.apache.avro.Schema object in all it’s glory.

val schema = AvroSchema[Pizza]

The schema generated from the example case class is as follows. You’ll notice that the library handles nested case classes, as well as collections. In addition it will support options (which are treated as an Avro union of null and the base type), eithers (which are treated as a union of the two types), sets, lists maps and Java enums.

Serializing

Generating schemas is fine, but most of the time we also want to read and write data. We can easily serialize instances of case classes  without needing to write any conversions into a record type, and without creating an AvroWriter.

Simply invoke the apply method on AvroOutputStream, specifying the type to accept, and what you’ll get back is a regular Java style output stream, but one that will accept instances of your type, and write those to the underlying stream.

So, back to our Pizza class, then we can easily serialize some pepperoni pizzas:

val pepperoni = Pizza("pepperoni", Seq(Ingredient("pepperoni", 12, 4.4), Ingredient("onions", 1, 0.4)), 98)
val hawaiian = Pizza("hawaiian", Seq(Ingredient("ham", 1.5, 5.6), Ingredient("pineapple", 5.2, 0.2)), 91)

val os = AvroOutputStream[Pizza](new File("pizzas.avro"))
os.write(Seq(pepperoni, hawaiian))
os.close()

Lets read that data back in. The reading process is, as you’d expect, very similar to the writing process. Except this time you’re using AvroInputStream, and you’re calling iterator, instead of write. Iterator will return a standard Scala iterator over the records in the stream.

val is = AvroInputStream[Pizza](new File("pizzas.avro"))
val pizzas = is.iterator.toSet
is.close()
println(pizzas.mkString("\n"))

Will print out:

Pizza(pepperoni,List(Ingredient(pepperoni,12,4.4), Ingredient(onions,1.2,0.4)),98) 
Pizza(hawaiian,List(Ingredient(ham,1.5,5.6), Ingredient(pineapple,5.2,0.2)),91)

The iterator is lazy of course, it will only fetch records from the data as the iterator requests them. If you want to read them all at once, simply invoke toList on the iterator itself.

Compared to using the raw Java API you can see the savings. The equivalent of the writer, would be something like:

val schema = //.. build schema using the builder api

// open up the java api writer
val datumWriter = new generic.GenericDatumWriter[GenericRecord](schema)
val dataFileWriter = new DataFileWriter[GenericRecord](datumWriter)
dataFileWriter.create(schema, out)

// here i create the instances I want to write
val pepperoni = Pizza("pepperoni", Seq(Ingredient("pepperoni", 12, 4.4), Ingredient("onions", 1, 0.4)), 98)
val hawaiian = Pizza("hawaiian", Seq(Ingredient("ham", 1.5, 5.6), Ingredient("pineapple", 5.2, 0.2)), 91)
val pizzas = Seq(pepperoni, hawaiian)
// for each instance I need to create the record and call write with it
for (pizza <- pizzas) {
  val record = new Record(schema)
  record.put("name",  pizza.name)
  record.put("calories", pizza.calories)
  // and so on
  dataFileWriter.write(record)
}
dataFileWriter.close()

As you can see, for the pizza type its not too bad, but imagine having to write this for all your classes, including nested classes. The record builder lines alone would be in the dozens.

Hopefully, this is a little time saver for you, it certainly has helped us when working with our clients.  

In the next article, we’ll show how to apply this technique to Avro based Parquet files. In the mean time we'd love to know your thoughts in the comments below, and please contact us for more information