-
Notifications
You must be signed in to change notification settings - Fork 11
Queries
Buzz analytics queries are defined with SQL. The query is composed of different statements for the different stages (see example here). This is different from most distributed engines that have a scheduler that takes care of splitting a unique SQL query into multiple stages to be executed on different executors. The reason is that in Buzz, our executors (HBees and HCombs) have very different behaviors and capabilities. This is unusual and designing a query planner that understands this is not obvious.
To get the list of files to be scanned, the engine needs a data catalog. Currently, the most common catalog is probably Hive, but there are many alternatives such as DeltaLake or Apache Iceberg. Buzz currently supports the following types of catalog:
- DeltaLake, can be queried by specifying the URI of the table on s3:
{
"steps": [
{
"sql": "SELECT payment_type, COUNT(payment_type) as payment_type_count FROM nyc_taxi GROUP BY payment_type",
"name": "nyc_taxi_map",
"step_type": "HBee",
"partition_filter": "pickup_date<='2009-01-05'"
},
{
"sql": "SELECT payment_type, SUM(payment_type_count) FROM nyc_taxi_map GROUP BY payment_type",
"name": "nyc_taxi_reduce",
"step_type": "HComb"
}
],
"capacity": {
"zones": 1
},
"catalogs": [
{
"name": "nyc_taxi",
"type": "DeltaLake",
"uri": "s3://cloudfuse-taxi-data/delta-tables/nyc-taxi-daily"
}
]
}
- Static, where you have to add the list of files, their partitioning and their size in the source code directly.
{
"steps": [
{
"sql": "SELECT payment_type, COUNT(payment_type) as payment_type_count FROM nyc_taxi GROUP BY payment_type",
"name": "nyc_taxi_map",
"step_type": "HBee",
"partition_filter": "month<='2009/06'"
},
{
"sql": "SELECT payment_type, SUM(payment_type_count) FROM nyc_taxi_map GROUP BY payment_type",
"name": "nyc_taxi_reduce",
"step_type": "HComb"
}
],
"capacity": {
"zones": 1
},
"catalogs": [
{
"name": "nyc_taxi",
"type": "Static",
"uri": "nyc_taxi_cloudfuse_sample"
}
]
}