The Gandiva Expression Compiler¶
Gandiva is an expression compiler that uses LLVM to generate efficient native code for projections and filters on Arrow record batches.
Using Gandiva involves three steps. First, you use the TreeExprBuilder
to
build a condition to filter on or a set of expressions to project rows. The builder provides
a library of functions that can be composed into complex expressions. Second,
you create a Filter
or Projector
from the expression. This step
compiles your expression into efficient native code. Finally, you use the filter or
projector you created to operate on Arrow record batches.
Note
As present, Gandiva is only built on PyArrow Conda distributions, not PyPI wheels.
Building Gandiva Expressions¶
Gandiva expressions are built with TreeExprBuilder
. For example,
to express 2 < x < 3
:
import pyarrow as pa
from pyarrow.gandiva import TreeExprBuilder
builder = TreeExprBuilder()
field_x = builder.make_field(pa.field('x', pa.float64()))
scalar_2 = builder.make_literal(2.0, pa.float64())
scalar_3 = builder.make_literal(3.0, pa.float64())
expr = builder.make_and([
builder.make_function('greater_than', [field_x, scalar_2], pa.bool_()),
builder.make_function('less_than', [field_x, scalar_3], pa.bool_())
])
Each of the builder methods returns a new Node
in the expression
tree. This includes:
TreeExprBuilder.make_field()
creates a field node: a reference to a column in the record batches. The type must match the type of the array in that column.TreeExprBuilder.make_literal()
creates a literal node: a literal value hardcoded.TreeExprBuilder.make_function()
creates a function node, which may take other nodes as arguments. See available functions (C++ documentation).TreeExprBuilder.make_if()
,TreeExprBuilder.make_and()
, andTreeExprBuilder.make_or()
create new nodes based on if-else, boolean “and”, and boolean “or” logic. (For “not”, use thenot(bool)
function inmake_function
.)
Executing Filter Expressions¶
To create a filter, you can convert a boolean expression node into a Condition
,
and instantiate the Gandiva filter using make_filter()
. When creating the filter
instance, the expression is converted into LLVM IR and optimized, allowing it to
be reused for multiple record batches.
import pyarrow as pa
from pyarrow.gandiva import TreeExprBuilder, make_filter
builder = TreeExprBuilder()
field_x = builder.make_field(pa.field('x', pa.float64()))
scalar_2 = builder.make_literal(2.0, pa.float64())
expr = builder.make_function('greater_than', [field_x, scalar_2], pa.bool_())
condition = builder.make_condition(expr)
record_batch = pa.record_batch([pa.array([1.0, 3.0], pa.float64())], ['x'])
filter_executor = make_filter(record_batch.schema, condition)
The Filter.evaluate()
method runs the filter on a record batch. It
returns a SelectionVector
, which contains the indices of the
matching rows. This can either be used later when projecting rows with
Projector.evaluate()
or can be used to immediately filter the record
batch by converting into an Arrow array with SelectionVector.to_array()
and using with the pyarrow.RecordBatch.take()
method.
selection_vector = filter_executor.evaluate(record_batch, pa.default_memory_pool())
result = record_batch.take(selection_vector.to_array())
Executing Projection Expressions¶
To create a Projector
, convert nodes into expressions using
TreeExprBuilder.make_expression()
and then pass that list into the
make_projector()
function. For example, to project a record batch
with a float x
into one with x
and log(x)
:
import pyarrow as pa
from pyarrow.gandiva import TreeExprBuilder, make_projector
builder = TreeExprBuilder()
field_x = builder.make_field(pa.field('x', pa.float64()))
log_x = builder.make_function('log', [field_x], pa.float64())
expressions = [
builder.make_expression(field_x, pa.field('x', pa.float64())),
builder.make_expression(log_x, pa.field('log_x', pa.float64()))
]
record_batch = pa.record_batch([pa.array([1.0, 3.0], pa.float64())], ['x'])
projector = make_projector(record_batch.schema, expressions, pa.default_memory_pool())
result = projector.evaluate(record_batch)