The Gandiva Expression Compiler

Gandiva is an expression compiler that uses LLVM to generate efficient native code for projections and filters on Arrow record batches.

Using Gandiva involves three steps. First, you use the TreeExprBuilder to build a condition to filter on or a set of expressions to project rows. The builder provides a library of functions that can be composed into complex expressions. Second, you create a Filter or Projector from the expression. This step compiles your expression into efficient native code. Finally, you use the filter or projector you created to operate on Arrow record batches.

Note

As present, Gandiva is only built on PyArrow Conda distributions, not PyPI wheels.

Building Gandiva Expressions

Gandiva expressions are built with TreeExprBuilder. For example, to express 2 < x < 3:

import pyarrow as pa
from pyarrow.gandiva import TreeExprBuilder

builder = TreeExprBuilder()

field_x = builder.make_field(pa.field('x', pa.float64()))
scalar_2 = builder.make_literal(2.0, pa.float64())
scalar_3 = builder.make_literal(3.0, pa.float64())

expr = builder.make_and([
    builder.make_function('greater_than', [field_x, scalar_2], pa.bool_()),
    builder.make_function('less_than', [field_x, scalar_3], pa.bool_())
])

Each of the builder methods returns a new Node in the expression tree. This includes:

Executing Filter Expressions

To create a filter, you can convert a boolean expression node into a Condition, and instantiate the Gandiva filter using make_filter(). When creating the filter instance, the expression is converted into LLVM IR and optimized, allowing it to be reused for multiple record batches.

import pyarrow as pa
from pyarrow.gandiva import TreeExprBuilder, make_filter

builder = TreeExprBuilder()

field_x = builder.make_field(pa.field('x', pa.float64()))
scalar_2 = builder.make_literal(2.0, pa.float64())
expr = builder.make_function('greater_than', [field_x, scalar_2], pa.bool_())

condition = builder.make_condition(expr)

record_batch = pa.record_batch([pa.array([1.0, 3.0], pa.float64())], ['x'])
filter_executor = make_filter(record_batch.schema, condition)

The Filter.evaluate() method runs the filter on a record batch. It returns a SelectionVector, which contains the indices of the matching rows. This can either be used later when projecting rows with Projector.evaluate() or can be used to immediately filter the record batch by converting into an Arrow array with SelectionVector.to_array() and using with the pyarrow.RecordBatch.take() method.

selection_vector = filter_executor.evaluate(record_batch, pa.default_memory_pool())
result = record_batch.take(selection_vector.to_array())

Executing Projection Expressions

To create a Projector, convert nodes into expressions using TreeExprBuilder.make_expression() and then pass that list into the make_projector() function. For example, to project a record batch with a float x into one with x and log(x):

import pyarrow as pa
from pyarrow.gandiva import TreeExprBuilder, make_projector

builder = TreeExprBuilder()

field_x = builder.make_field(pa.field('x', pa.float64()))
log_x = builder.make_function('log', [field_x], pa.float64())

expressions = [
  builder.make_expression(field_x, pa.field('x', pa.float64())),
  builder.make_expression(log_x, pa.field('log_x', pa.float64()))
]

record_batch = pa.record_batch([pa.array([1.0, 3.0], pa.float64())], ['x'])
projector = make_projector(record_batch.schema, expressions, pa.default_memory_pool())

result = projector.evaluate(record_batch)