The Gandiva Expression Compiler¶
Gandiva is a runtime expression compiler that uses LLVM to generate efficient native code for projections and filters on Arrow record batches. Gandiva only handles projections and filters. For other transformation, see Compute Functions.
Gandiva was designed to take advantage of the Arrow memory format and modern hardware. Compiling expressions using LLVM allows the execution to be optimized to the local runtime environment and hardware, including available SIMD instructions. To minimize optimization overhead, all Gandiva functions are pre-compiled into LLVM IR (intermediate representation).
Building Expressions¶
Gandiva provides a general expression representation where expressions are
represented by a tree of nodes. The expression trees are built using
gandiva::TreeExprBuilder. The leaves of the expression tree are typically
field references, created by gandiva::TreeExprBuilder::MakeField(), and
literal values, created by gandiva::TreeExprBuilder::MakeLiteral(). Nodes
can be combined into more complex expression trees using:
gandiva::TreeExprBuilder::MakeFunction()to create a function node. See available functions below.
gandiva::TreeExprBuilder::MakeIf()to create if-else logic.
gandiva::TreeExprBuilder::MakeAnd()andgandiva::TreeExprBuilder::MakeOr()to create boolean expressions. (For “not”, use thenot(bool)function inMakeFunction.)
gandiva::TreeExprBuilder::MakeInExpressionInt32()and the other “in expression” functions to create set membership tests.
Once an expression tree is built, they are wrapped in either gandiva::Expression
or gandiva::Condition, depending on how they will be used.
Expression is used in projections while Condition is used filters.
As an example, here is how to create an Expression representing x + 3 and a
Condition representing x < 3:
auto field_x_raw = arrow::field("x", arrow::int32());
auto field_x = TreeExprBuilder::MakeField(field_x_raw);
auto literal_3 = TreeExprBuilder::MakeLiteral(3);
auto field_result = arrow::field("result", arrow::int32());
auto add_node = TreeExprBuilder::MakeFunction("add", {field_x, literal_3}, arrow::int32());
auto expression = TreeExprBuilder::MakeExpression(add_node, field_result);
auto less_than_node = TreeExprBuilder::MakeFunction("less_than", {field_x, literal_3},
boolean());
auto condition = TreeExprBuilder::MakeCondition(less_than_node);
For simpler expressions, there are also convenience functions that allow you to
use functions directly in MakeExpression and MakeCondition:
auto expression = TreeExprBuilder::MakeExpression("add", {field_x, literal_3}, field_result);
auto condition = TreeExprBuilder::MakeCondition("less_than", {field_x, literal_3});
Projectors and Filters¶
Gandiva’s two execution kernels are gandiva::Projector and
gandiva::Filter. Projector consumes a record batch and projects
into a new record batch. Filter consumes a record batch and produces a
gandiva::SelectionVector containing the indices that matched the condition.
For both Projector and Filter, optimization of the expression IR happens
when creating instances. They are compiled against a static schema, so the
schema of the record batches must be known at this point.
Continuing with the expression and condition created in the previous
section, here is an example of creating a Projector and a Filter:
auto schema = arrow::schema({field_x});
std::shared_ptr<Projector> projector;
auto status = Projector::Make(schema, {expression}, &projector);
ARROW_CHECK_OK(status);
std::shared_ptr<Filter> filter;
status = Filter::Make(schema, condition, &filter);
ARROW_CHECK_OK(status);
Once a Projector or Filter is created, it can be evaluated on Arrow record batches. These execution kernels are single-threaded on their own, but are designed to be reused to process record batches in parallel.
Execution is performed with gandiva::Projector::Evaluate() and
gandiva::Filter::Evaluate(). Filters produce gandiva::SelectionVector,
a vector of row indices that matched the filter condition. When filtering and
projecting record batches, you can pass the selection vector into the projector
so that the projection is only evaluated on matching rows.
Here is an example of evaluating the Filter and Projector created above:
auto pool = arrow::default_memory_pool();
int num_records = 4;
auto array = MakeArrowArrayInt32({1, 2, 3, 4}, {true, true, true, true});
auto in_batch = arrow::RecordBatch::Make(schema, num_records, {array});
// Just project
arrow::ArrayVector outputs;
status = projector->Evaluate(*in_batch, pool, &outputs);
ARROW_CHECK_OK(status);
// Evaluate filter
gandiva::SelectionVector result_indices;
status = filter->Evaluate(*in_batch, &result_indices);
ARROW_CHECK_OK(status);
// Project with filter
arrow::ArrayVector outputs_filtered;
status = projector->Evaluate(*in_batch, selection_vector.get(),
pool, &outputs_filtered);
Available Gandiva Functions¶
Common Types¶
To be succinct, we describe the types in the following groups:
Integer:
int8,int16,int32,int64,uint8,uint16,uint32,uint64Float:
float,doubleDecimal:
decimal128Numeric: Integer or Float or Decimal
Date:
date64[ms],date32[day]Time:
time32[ms]Timestamp:
timestamp[ms]
When we wrap the input types in parentheses, that means those inputs must be in that order.
Warning
Functions that take decimals and return decimals do not currently respect or
enforce the scale and precision of the provided return type. For example,
add(decimal(28, 4), decimal(28, 3)) will always return a decimal(28, 4),
so if you specify a return type decimal(28, 3) the resulting array will
have incorrect results. It is up to the user to match the return type with the
expectations of the functions.
Comparisons¶
Function names |
Arity |
Input types |
Output type |
Notes |
|---|---|---|---|---|
not |
Unary |
|
|
|
isnull, isnotnull |
Unary |
Any |
|
|
equal, eq, same, |
Binary |
Any |
|
(1) |
is_distinct_from, is_not_distinct_from |
Binary |
Any |
|
(2) |
less_than, less_than_or_equal_to, greater_than, greater_than, greater_than_or_equal_to |
Binary |
Any |
|
(1)
eqandsameare aliases forequal.(2)
is_not_distinct_fromis the “NULL-safe” version ofequal, meaning it will treat two NULL values as equal, whileequalconsiders NULL values as unknown and never equal.
Casting and Conversion¶
Casts convert values between types. These may raise errors, for example if casting between numeric types causes overflow or if attempting to cast an invalid date string to a date type.
Function names |
Arity |
Input types |
Output type |
Notes |
|---|---|---|---|---|
castBIGINT |
Unary |
|
|
(1) |
castINT |
Unary |
|
|
|
castFLOAT4 |
Unary |
|
|
|
castFLOAT8 |
Unary |
|
|
|
castDECIMAL, castDECIMALNullOnOverflow |
Unary |
|
|
|
castDATE |
Unary |
|
|
(2) |
castDATE |
Unary |
|
|
(3) |
castTIME |
Unary |
|
|
|
castTIMESTAMP |
Unary |
|
|
|
castVARCHAR |
Binary |
(Any, |
|
(4) |
castVARBINARY |
Binary |
(Any, |
|
(4) |
castBIT, castBOOLEAN |
Unary |
|
|
(5) |
to_time |
Unary |
Numeric |
|
(6) |
to_timestamp |
Unary |
Numeric |
|
(7) |
to_date |
Unary |
|
|
(7) |
to_date |
Binary |
( |
|
(8) |
to_date |
Ternary |
( |
|
(8) |
convert_toDOUBLE, convert_toDOUBLE_be, convert_toFLOAT, convert_toFLOAT_be, convert_toINT, convert_toINT_be, convert_toBIGINT, convert_toBIGINT_be, convert_toBOOLEAN_BYTE, convert_toTIME_EPOCH, convert_toTIME_EPOCH_be, convert_toTIMESTAMP_EPOCH, convert_toTIMESTAMP_EPOCH_be, convert_toDATE_EPOCH, convert_toDATE_EPOCH_be, convert_toUTF8 |
Unary |
|
|
(9) |
convert_fromUTF8, convert_fromutf8 |
Unary |
|
|
|
convert_replaceUTF8, convert_replaceutf8 |
Binary |
( |
|
(10) |
(1)
castBIGINT(day_time_interval) -> int64returns the number of milliseconds in interval.(2)
castDATE(int64) -> date64[ms]returns the date using input as milliseconds since UNIX epoch 1970-01-01.(3)
castDATE(int32) -> date32[ms]returns the date using input as days since UNIX epoch 1970-01-01.(4) For
castVARCHARandcastVARBINARY, the second parameter (of typeint64) represents the maximum number of bytes to return. If the string representation of the value is larger then that specified max, the result will be truncated. For example,castVARCHAR("12345", 3)would return123.(5)
castBOOLEANis an alias forcastBIT. Converts"true"or"1"totrueand"false"or"0"tofalse.(6)
to_timetakes a timestamp in seconds and converts into a time, dropping the date information.(7)
to_timestampreturns the timestamp using input as milliseconds since UNIX epoch 1970-01-01.(8)
to_date(string, string [, int32])parses the first string into a date based on the format string specified in the string parameter. The optionalint32parameter indicates to suppress errors, which is turned on with value1.(9) variants that end in
_bereturn bytes in big endian order, while main variant returns in platform-native endianness.(10) The “replace” variations take a second string parameter which is the character to replace any bytes that are not valid Unicode with.
Arithmetic¶
Function names |
Arity |
Input types |
Output type |
Notes |
|---|---|---|---|---|
add |
Binary |
Numeric |
Numeric |
|
subtract |
Binary |
Numeric |
Numeric |
|
multiply |
Binary |
Numeric |
Numeric |
|
divide |
Binary |
Numeric |
Numeric |
|
mod, modulo |
Binary |
Integer, |
Integer, |
(1) |
div |
Binary |
Integer, Float |
Integer, Float |
(2) |
bitwise_and, bitwise_or, bitwise_xor |
Binary |
Integer |
Integer |
|
bitwise_not |
Unary |
Integer |
Integer |
(1)
modulois an alias formod.(2)
divperforms integer division, which for Integer types is identical todivide, but for float types will truncate to the closest integer it is not greater than.
Math¶
Function names |
Arity |
Input types |
Output type |
Notes |
|---|---|---|---|---|
cbrt, exp, log, log10 |
Unary |
Integer, Float |
|
|
log |
Binary |
(Integer or Float, Integer or Float) |
|
(1) |
power, pow |
Binary |
( |
|
(2) |
sin, cos, tan, asin, acos, atan, sinh, cosh, tanh, cot, atan2 |
Unary |
Integer, Float |
|
|
radians, degrees |
Unary |
Integer, Float |
|
(3) |
random, rand |
Nullary |
None |
|
(4) |
random, rand |
Unary |
|
|
(4) |
(1) The binary log function uses the first parameter as the base and the second as the operand. In other words
log(a, b) = log(b) / log(a).(2)
powis an alias forpower.(3)
radiansconverts degrees to radians anddegreesconverts the other way.(4)
randis an alias forrandom. The unary version takes anint32seed. Both versions return a 64-bit float in the range of [0, 1).
Rounding¶
Function names |
Arity |
Input types |
Output type |
Notes |
|---|---|---|---|---|
round |
Unary |
Numeric |
Numeric |
|
round |
Binary |
(Numeric, |
Numeric |
(1) |
abs, ceil, floor |
Unary |
Decimal |
Decimal |
|
truncate, trunc |
Unary |
Decimal |
Decimal |
(2) |
truncate, trunc |
Binary |
(Decimal or |
|
(2) (3) |
(1) The second parameter (of type
int32) is the precision, with positive values rounding to places right of the decimal and negative to the left. For example,round(123.456, 2)returns123.46andround(123.456, -2)returns100.0.(2)
truncis an alias fortruncate.(3) The second parameter (of type
int32) is the precision, which works similarly to the precision parameter ofround.
Date and Time¶
Function names |
Arity |
Input types |
Output type |
Notes |
|---|---|---|---|---|
add, date_add |
Binary |
Integer, |
|
(1) |
subtract, date_sub, date_diff |
Binary |
( |
|
(1) |
extractMillennium, extractCentury, extractDecade, extractYear, extractQuarter, extractMonth, extractWeek, extractDay, extractHour, extractMinute, extractSecond, extractDoy, extractDow, extractEpoch |
Unary |
|
|
(2) |
date_trunc_Millennium, date_trunc_Century, date_trunc_Decade, date_trunc_Year, date_trunc_Quarter, date_trunc_Month, date_trunc_Week, date_trunc_Day, date_trunc_Hour, date_trunc_Minute, date_trunc_Second |
Unary |
|
|
|
last_day |
Unary |
|
|
|
months_between |
Binary |
|
|
|
timestampdiffSecond, timestampdiffMinute, timestampdiffHour, timestampdiffDay, timestampdiffWeek, timestampdiffMonth, timestampdiffQuarter, timestampdiffYear |
Binary |
|
|
|
timestampaddSecond, timestampaddMinute, timestampaddHour, timestampaddDay, timestampaddWeek, timestampaddMonth, timestampaddQuarter, timestampaddYear |
Binary |
|
|
(1) In
addandsubtract, the integer parameter represents the number of days to add or subtract from the give date or timestamp.date_addis an alias foraddanddate_subanddate_diffare aliases forsubtract.(2)
yearis an alias forextractYear,monthan alias forextractMonth,weekofyearandyearweekaliases forextractWeek,dayanddayofmonthaliases ofextractDay,houran alias forextractHour,minutean alias forextractMinute, andsecondan alias forextractSecond.
String Manipulation¶
Function names |
Arity |
Input types |
Output type |
Notes |
|---|---|---|---|---|
bin |
Unary |
|
|
(1) |
space |
Unary |
|
|
(2) |
starts_with, ends_with, is_substr |
Binary |
|
|
|
like, ilike |
Binary |
|
|
|
like |
Ternary |
|
|
(3) |
locate, position |
Binary |
|
|
(4) |
locate, position |
Ternary |
( |
|
(4) |
octet_length, bit_length |
Unary |
|
|
|
char_length, length |
Unary |
|
|
|
lengthUtf8 |
Unary |
|
|
|
reverse, ltrim, rtrim, btrim |
Unary |
|
|
|
ltrim, rtrim, btrim |
Binary |
|
|
(5) |
ascii |
Unary |
|
|
|
base64 |
Unary |
|
|
|
unbase64 |
Unary |
|
|
|
upper, lower, initcap |
Unary |
|
|
|
substr, substring |
Binary |
( |
|
(6) |
substr, substring |
Ternary |
( |
|
(6) |
byte_substr, bytesubstring |
Ternary |
( |
|
|
left, right |
Binary |
( |
|
|
lpad, rpad |
Binary |
( |
|
|
lpad, rpad |
Ternary |
( |
|
|
concat, concatOperator |
2 to 10 |
|
|
(7) |
binary_string |
Unary |
|
|
|
split_part |
Ternary |
( |
|
|
replace |
Ternary |
( |
|
(1)
binconverts integers to their binary representation as a string. For example,bin(7) = "111".(2)
spacecreates a string that is a sequence of space whose length is the given integer.(3)
likehas a ternary variation where the third parameter is an escape character, making it possible to match patterns with%in them.(4)
locatereturns the starting index of the first instance of the first string parameter in the second string parameter. Not that the index is 1-indexed. The optionalint32argument allows you to provide a start position to skip a portion of the string.positionis an alias forlocate.(5) The binary variations of
ltrim,rtrim, andbtrimtake a second parameter a string containing the list of characters to trim.(6)
substrreturns a substring of the original string, with the integer parameters controlling the position and length. In the binary variation, the second parameter is the length from the start of the string (if positive) or the length from the end of the string (if negative). In the ternary variation, the second parameter is the starting position (1-indexed) and the third parameter is acts as the offset like the second parameter in the binary variation.(7)
concattreats null inputs as empty strings whereasconcatOperatorreturns null if one of the inputs is null
Hash¶
Function names |
Arity |
Input types |
Output type |
Notes |
|---|---|---|---|---|
hash |
Unary |
Any |
|
(1) |
hash32, hash32AsDouble |
Unary |
Any |
|
(2) |
hash32 |
Binary |
(Any, |
|
(2) (3) |
hash64, hash64AsDouble |
Unary |
Any |
|
(2) |
hash64 |
Binary |
(Any, |
|
(2) (3) |
hashSHA1, sha1, sha |
Unary |
Any |
|
(4) |
hashSHA256, sha256 |
Unary |
Any |
|
(5) |
(1) Uses hash function from C++
std:hash.(2) Uses MurmurHash3.
hash32is an alias forhash32AsDoubleandhash64is an alias forhash64AsDouble.(3) Second parameter is a seed.
(4)
sha1andshaare aliases forhashSHA1.(5)
sha256is an alias forhashSHA256.