Document [original]

Multi-Backend Zonal Statistics Execution with Raven

Gereon Dusella

Technische Universität Berlin

Germany

[email protected]

Haralampos Gavriilidis

Technische Universität Berlin

Germany

[email protected]

Laert Nuhu∗

Deutsche Kreditbank AG

Germany

[email protected]

Volker Markl

Technische Universität Berlin, DFKI

Germany

[email protected]

Eleni Tzirita Zacharatou

IT University of Copenhagen

Denmark

[email protected]

ABSTRACT

The recent explosion in the number and size of spatial remote sens-

ing datasets from satellite missions creates new opportunities for

data-driven approaches in domains such as climate change monitor-

ing and disaster management. These approaches typically involve a

feature engineering step that summarizes remote sensing pixel data

located within zones of interest defined by another spatial dataset,

an operation called zonal statistics. Although several spatial sys-

tems support zonal statistics operations, they differ significantly

in terms of interfaces, architectures, and algorithms, making it

hard for users to select the best system for a specific workload. To

address this limitation, we propose Raven, a zonal statistics frame-

work that provides users with a unified interface across multiple

execution backends, while facilitating easy benchmarking and com-

parisons across systems. This demonstration showcases Raven’s

multi-backend execution environment, domain-specific declarative

language, optimization techniques, and benchmarking capabilities.

CCS CONCEPTS

•Information systems

→

Spatial-temporal systems; • Applied

computing →Earth and atmospheric sciences.

KEYWORDS

unified spatial data analytics; zonal statistics; parcel-based classifi-

cation; spatial join; satellite imagery; big spatial data

ACM Reference Format:

Gereon Dusella, Haralampos Gavriilidis, Laert Nuhu, Volker Markl, and Eleni

Tzirita Zacharatou. 2024. Multi-Backend Zonal Statistics Execution with

Raven. In Companion of the 2024 International Conference on Management of

Data (SIGMOD-Companion ’24), June 9–15, 2024, Santiago, AA, Chile. ACM,

New York, NY, USA, 4 pages. https://doi.org/10.1145/3626246.3654730

1 INTRODUCTION

Over the past decade, the launch of an ever-increasing number of

satellites has led to the accumulation of unprecedented volumes

of Earth Observation (EO) data [

]. For example, the Sentinel

∗Work performed while at Technische Universität Berlin

This work is licensed under a Creative Commons Attribution-

NonCommercial-ShareAlike International 4.0 License.

SIGMOD-Companion ’24, June 9–15, 2024, Santiago, AA, Chile

ACM ISBN 979-8-4007-0422-2/24/06

https://doi.org/10.1145/3626246.3654730

archive alone contains Earth images captured by eight satellites,

amounting to 6.64 petabytes [

]. The efficient processing of EO data

offers an opportunity to substantially improve our understanding

of our planet’s state and the changes that occur on it [

To extract meaningful features from EO imagery that can be

used to train ML models, it is often necessary to compute aggregate

information for image pixels within specific zones of interest de-

fined by another spatial dataset [

], a process commonly known as

zonal statistics (ZS). Remote sensing images are available in raster

format, a multidimensional array representation where each pixel

corresponds to a geographical region, while the pixel value reflects

some characteristics of that region. However, spatial datasets defin-

ing zones of interest, like city boundaries from OpenStreetMap [

are often in vector format, representing geographical features with

points, lines, and polygons. As a result, computing zonal statis-

tics requires combining heterogeneous raster and vector datasets.

For example, to train an ML model for monitoring and predicting

changes in vegetation health in different land plots over time, one

needs to generate aggregated (e.g., mean and median) Normalized

Difference Vegetation Index (NDVI) statistics as features for each

land plot [

]. This feature engineering process requires joining

remote sensing imagery data (raster) with land plot data (vector).

Current spatial systems for zonal statistics confront users with a

jungle of interfaces, capabilities, and requirements. This plethora of

different systems poses challenges in selecting the best system for

a given workload. First, for systems lacking support for both raster

and vector data, users need to perform additional pre-processing

steps, i.e., rasterizing vector datasets or vectorizing raster datasets.

Furthermore, they might need to perform file format conversions,

given that most systems support only a limited number of file

formats. The diversity of interfaces across spatial systems poses

further challenges. First, it locks users into their initial system

choice due to the significant effort required to rewrite applications.

Second, it introduces a substantial barrier when testing different

systems for optimal performance. For example, while both Beast [

]

and PostGIS [

] support joining raster and vector data, Beast

employs a map-reduce-like API, while PostGIS offers an SQL-like

API. To provide a good user experience, avoid vendor lock-in, and

optimize performance, there is a need to abstract ZS operations and

enable their unified execution across multiple spatial systems.

To address the challenges in processing zonal statistics over large-

scale heterogeneous datasets, we developed Raven, a zonal statistics

framework that offers users a unified interface across multiple

532

SIGMOD-Companion ’24, June 9–15, 2024, Santiago, AA, Chile Gereon Dusella, Haralampos Gavriilidis, Laert Nuhu, Volker Markl, & Eleni Tzirita Zacharatou

spatial systems serving as execution backends.

Raven exposes a

DSL tailored for zonal statistics, abstracts system-specific details,

and optimizes execution. Furthermore, it supports effortless system

benchmarking, assisting users in selecting the most efficient system

for their workload. To the best of our knowledge, Raven is the first

system for unified spatial analytics. Previous efforts to unify data

analytics focus on integrating structured and semi-structured data

with SQL [

] and map-reduce-like interfaces [

]; however,

these efforts do not provide support for spatial operations.

In this demonstration, we first aim to illustrate the complexity of

implementing zonal statistics in different spatial systems. We let the

audience interact with the systems and showcase their diversity in

terms of interfaces, capabilities, and requirements. We then dive into

Raven’s internals, enabling the audience to implement a zonal statis-

tics task within our tool. We discuss how Raven translates this task

into different system APIs, performs necessary pre-processing, and

manages the execution lifecycle. Furthermore, we highlight how

Raven guides users in selecting the best system for their task by exe-

cuting this task on multiple state-of-the-art spatial systems and gen-

erating performance metrics that offer insights into performance

variations among these systems. Finally, we highlight the benefits

of Raven by implementing an exemplary application and letting

users manipulate different parameters and datasets interactively.

2 RAVEN OVERVIEW

Today’s data scientists face multiple challenges when implement-

ing zonal statistics, due to the varying interfaces and configuration

parameters exposed by today’s spatial systems, the varying pre-

processing steps that these systems require, and their divergent

runtime performance capabilities. In response to these challenges,

Raven aims to 1) offer an easy-to-use zonal statistics interface and

2) highlight performance differences in spatial systems. To achieve

this, Raven exposes a declarative zonal statistics interface based on

a DSL that we developed. Using this DSL, Raven can transparently

optimize and execute a given zonal statistics task on multiple spatial

systems. As a result, Raven provides system independence, thereby

helping users avoid vendor lock-ins. Furthermore, by automating

execution and providing detailed performance results, Raven sim-

plifies selecting the most efficient system for a given workload. In

the following, we give a brief overview of Raven’s components.

Data Scientist

Preprocessor

System

Comparison

Instructions

SDMS

Spatial System

Datasets

Tables,

Graphs Datasets

Datasets

Metrics

ZS Exp

Metrics

Experiment

Analyzer

Capabilities

ZS Results Results Table

Pipeline

Planner

Query

Pipeline

Repr

Pipeline

Manager

SpS-Connector

Execution Interface

IR Converter

Metrics

Init SpS-Query

Pipeline Configs

Analysis

Figure 1: Raven Architecture

2.1 Architecture Overview

Figure 1 presents Raven’s architecture. Raven takes as input a

zonal statistics task expressed in its DSL (the query) and relies on

1Raven is open-source, available at: https://github.com/polydbms/RaVeN

1# Datasets definition

2zs_result =ZSGen.build(

3raster="/data/sentinel2a_mol_band9",

4vector="/data/ALKIS_bezirk_MOL")

5# Aggregation operations

6.group("oid")

7.summarize({"max": ZSGen.MAX, "avg": ZSGen.AVG})

8.join_using(ZSGen.INTERSECT)

9# Systems

10 .system([ZSSystem.PostGIS(params),...])

Listing 1: A simple ZS Task in Raven’s DSL

its

Pipeline Planner

for optimization. Additionally, the Pipeline

Planner takes as input a “System Capabilities” file, specifying the

operations supported by the execution backends. Based on this

information, it determines the need for pre-processing steps, such

as format or Coordinate Reference System (CRS) conversions, and

selects the appropriate join type. The Pipeline Planner outputs an

Abstract Syntax Tree (AST) including all required operations, from

pre-processing to joining and aggregation. Then, the

Pipeline

Manager

is responsible for assembling and executing the pipeline.

Here, Raven relies on (system-developer-provided) implementa-

tions of the

Execution Interface

, which includes a

(Internal

Representation)

Converter

and a

SpS

(Spatial System)

Connector

The IR Converter translates Raven’s AST into system-specific code

using parameterized templates, and the SpS-Connector enables

execution on the underlying systems and retrieving the results.

Raven stores execution metrics, e.g., runtime and resource con-

sumption for each step, in its experiment database, which the

Experiment Analyzer uses to gain insights into the execution.

2.2 Raven’s Domain-Specific Language

Performing zonal statistics on raw raster and vector data involves

multiple steps. First, data require pre-processing to handle varia-

tions in format, Coordinate Reference Systems (CRSs), and stan-

dards governing geometry representation and interpretation. Sec-

ond, data might require filtering based on specified conditions. The

next processing stage involves joining and aggregating the data. In

this stage, one can apply various interpretations for the join con-

dition and implement optimizations, such as tuning tile sizes.

abstract these steps, we designed a simple DSL for Raven, allowing

users to easily express zonal statistics on both raster and vector

datasets. The DSL exposes primitives for zonal statistics computa-

tion, such as defining transformations, filter predicates, and join

conditions. Listing 1 shows an example. Here, the user loads a raster

and a vector dataset (Lines 2–4), selects a grouping key, two aggre-

gate functions, and a join method (Lines 6–8), and chooses PostGIS

for execution (Line 10). Note that, for brevity, we do not show DSL

primitives related to other parameters and spatial systems.

Subsequently, Raven converts programs expressed in its DSL

to system-specific implementations, optimizes them, and executes

them across the user-specified spatial systems as described next.

2.3 Zonal Statistics Pipelines

The AST generated by Raven’s

Pipeline Planner

(cf. Figure 1) en-

capsulates the end-to-end processing of a zonal statistics task. This

2Tiles are used to divide raster data into smaller chunks.

533

Multi-Backend Zonal Statistics Execution with Raven SIGMOD-Companion ’24, June 9–15, 2024, Santiago, AA, Chile

Figure 2: Raven’s Config. Panel Figure 3: Zonal Statistics Pipeline Viewer Figure 4: Benchmark Results

includes pre-processing operations, such as changing format to sup-

port loading into the given system, aligning CRSs, and filtering the

datasets, as well as performing the join and aggregation. Raven cur-

rently optimizes the execution plan using simple heuristics, such

as reducing redundant data loading by filter pushdown.

The

Pipeline Manager

(cf. Figure 1) transforms the AST into

system-specific code. To achieve this, it employs parameterized tem-

plates in the APIs of the supported backend systems through the

IR Converter

. System developers need to provide these templates

when integrating a system into Raven. Finally, Raven executes

the system-specific code through the

SpS-Connector

. In our refer-

ence implementation, we support pre-processing using GDAL and

zonal statistics execution on Beast [

], PostGIS [

], RasDaMan [

heavy.ai [

], and Sedona [

]. The Pipeline Manager composes the

pipeline by filling the templates with information from the AST,

and coordinates the execution.

2.4 Benchmarking Mode

The performance of zonal statistics tasks in different spatial systems

can vary significantly depending on data and workload. In addition

to facilitating the seamless execution of zonal statistics across mul-

tiple systems with diverse configurations, Raven also allows users

to benchmark these systems. To facilitate benchmarking, Raven fea-

tures a dedicated benchmarking mode. This mode allows users to

execute multiple pipelines and produce detailed performance plots,

e.g., breakdown performance of different pipeline stages. These plots

enable Raven’s users to compare different systems and parameter

combinations. As a result, users can gain insights into potential bot-

tlenecks and enhance system performance by fine-tuning available

parameters. Overall, Raven’s integrated benchmarking component

provides valuable tools for optimizing zonal statistics tasks across

diverse spatial systems.

2.5 Integration with QGIS

To enhance the interactivity of our demonstration, we seamlessly

integrated Raven into QGIS [

]. This integration enables users

to visually formulate a query through a user-friendly UI, which

is then translated into Raven’s IR AST. The UI automatically pulls

information about the loaded data (or layers) from QGIS. Further-

more, users can specify other processing parameters, such as the

type of vectorization applied. When a user selects multiple sys-

tems or multiple conflicting processing parameters (e.g., different

tile sizes), Raven automatically switches to benchmarking mode.

The benchmarking results are displayed in the main QGIS UI after

Raven concludes its benchmark run.

In addition to the UI, we integrated Raven into the Processing

API of QGIS. This API empowers users to build complex pipelines

with multiple inputs, outputs, and parameters. Consequently, users

can harness Raven through QGIS to tackle more complex problems

and execute recurring tasks effortlessly.

3 DEMONSTRATION PLAN

The goal of this demonstration is twofold. First, it aims to show

the intricacy of implementing zonal statistics using state-of-the-

art spatial data management systems. Second, it aims to showcase

the capabilities of Raven. Specifically, we show its ease of use for

expressing zonal statistics and its practical utility for spatial appli-

cations. Furthermore, we show how Raven assists data scientists in

selecting the most efficient spatial system for a given task, leverag-

ing its benchmarking mode. Attendees can experience a live demo

of Raven using the QGIS UI. The UI runs on a local laptop, where

Raven can be run directly. For scenarios involving large datasets, the

laptop connects to a remote server hosting Raven. In the following,

we describe our demonstration scenarios.

3.1 Exploring State-of-the-Art Systems

First, we introduce the audience to the fundamental characteristics

of geospatial data, emphasizing the differences between raster and

vector data. Then, we describe the task of zonal statistics and ex-

plain how raster data can be combined with vector data. Finally, we

show how users implement zonal statistics tasks in state-of-the-art

534

SIGMOD-Companion ’24, June 9–15, 2024, Santiago, AA, Chile Gereon Dusella, Haralampos Gavriilidis, Laert Nuhu, Volker Markl, & Eleni Tzirita Zacharatou

Figure 5: Zonal Statistics Output

systems by presenting an example task, e.g., computing NDVI sta-

tistics, and demonstrating its implementation using the interfaces

of various systems, e.g., Beast, PostGIS, and RasDaMan.

3.2 Implementing Zonal Statistics with Raven

Our second demonstration scenario aims to familiarize the audi-

ence with Raven. For this purpose, we guide the audience through

Raven’s DSL and show its ease of use for implementing zonal sta-

tistics tasks. Leveraging the integration of Raven with QGIS (cf.

Section 2.5), users can seamlessly compose zonal statistics tasks

in Raven’s DSL through a GUI, as shown in Figure 2. In

, users

can choose among available raster and vector datasets and apply

filter predicates. In

, users can specify values for different pro-

cessing and optimization parameters. Then, in

, users can define

evaluation parameters mostly relevant for Raven’s benchmarking

mode. In this scenario, we also discuss how Raven abstracts the

APIs of different spatial systems, translates zonal statistics tasks

into these APIs, performs pre-processing of raw raster and vector

data, and efficiently orchestrates the execution lifecycle. To that

end, we use the

ZS Pipeline Viewer

(c.f. Figure 3) that visualizes

the generated query plan for each system based on Raven’s IR. The

visualization uses different colors to highlight different types of

operations, i.e., pre-processing, ingestion, and actual execution.

3.3 Benchmarking Zonal Statistics with Raven

In our third demonstration scenario, we aim to illustrate the perfor-

mance discrepancies among state-of-the-art spatial data manage-

ment systems when executing a zonal statistics task. Therefore, we

leverage Raven’s benchmarking mode to execute a given task across

multiple systems and explore different configurations. Throughout

the entire execution lifecycle, Raven gathers statistics that provide

valuable information on each processing stage: from format con-

versions to CRS alignment, data filtering, and finally joining raster

and vector data.

To show the performance characteristics of different systems, we

run pre-defined zonal statistics tasks and discuss the performance

breakdown graphs generated by Raven (cf. Figure 4). Additionally,

users can define their own zonal statistics tasks and benchmarking

parameters using Raven’s configuration panel, thereby gaining a

better understanding of how different settings influence the per-

formance of different systems. Overall, in this scenario, we demon-

strate Raven’s versatility in benchmarking spatial data management

systems and emphasize its utility in guiding users to select the right

system for their needs.

3.4 Raven as a Spatial Application Backend

In our fourth demonstration scenario, we want to show how one can

integrate Raven with current spatial applications. For this purpose,

we describe our integration with QGIS. This integration allows com-

puting zonal statistics on any available underlying spatial system

and then returning the result to QGIS to produce different data vi-

sualizations. In this demonstration scenario, attendees can actively

interact with Raven through the GUI and visually explore raster and

vector data, along with the produced zonal statistics (cf. Figure 5).

We will supply raster and vector datasets, but we also encourage

attendees to bring their datasets that we can explore with Raven.

ACKNOWLEDGEMENTS

We gratefully acknowledge funding from the German Federal Min-

istry of Education and Research under the grants BIFOLD24B and

01IS17052 (as part of the Software Campus project PolyDB).

REFERENCES

[1]

European Space Agency. 2024. Copernicus Data Space Ecosystem. https:

//dataspace.copernicus.eu/

[2]

Ahmet Kerem Aksoy, Pavel Dushev, Eleni Tzirita Zacharatou, Holmer Hemsen,

Marcela Charfuelan, Jorge-Arnulfo Quiané-Ruiz, Begüm Demir, and Volker Markl.

2022. Satellite image search in AgoraEO. PVLDB 15, 12 (2022), 3646–3649.

[3]

Peter Baumann, Andreas Dehmel, Paula Furtado, Roland Ritsch, and Norbert

Widmann. 1998. The multidimensional database system RasDaMan. In Proc.

SIGMOD. 575–577.

[4]

Kaustubh Beedkar, Bertty Contreras-Rojas, Haralampos Gavriilidis, Zoi Kaoudi,

Volker Markl, Rodrigo Pardo-Meza, and Jorge-Arnulfo Quiané-Ruiz. 2023. Apache

Wayang: A Unified Data Analytics Framework. SIGMOD Rec. 52, 3 (2023).

[5]

Michael J Carey et al

1995. Towards heterogeneous multimedia information

systems: The Garlic approach. In Proc. RIDE-DOM.

[6]

Adriana Grazia Castriotta. 2023. Copernicus Sentinel Data Access Annual Report

Y2022. Technical Report 1. European Commission. 119 pages.

[7]

Russell G. Congalton and Kate Green. 2020. Assessing the Accuracy of Remotely

Sensed Data: Principles and Practices. CRC Press, Taylor & Francis Group.

[8]

Arne de Wall, Björn Deiseroth, Eleni Tzirita Zacharatou, Jorge-Arnulfo Quiané-

Ruiz, Begüm Demir, and Volker Markl. 2021. Agora-EO: A Unified Ecosystem for

Earth Observation – A Vision for Boosting EO Data Literacy –. In Proc. Big Data

from Space (BiDS).

[9]

Ahmed Eldawy, Vagelis Hristidis, Saheli Ghosh, Majid Saeedan, Akil Sevim, A.B.

Siddique, Samriddhi Singla, Ganesh Sivaram, Tin Vu, and Yaming Zhang. 2021.

Beast: Scalable Exploratory Analytics on Spatio-temporal Data. In Proc. CIKM.

[10]

Lukas Kondmann et al. 2021. DENETHOR: The DynamicEarthNET dataset for

Harmonized, inter-Operable, analysis-Ready, daily crop monitoring from space.

In Proc. NeurIPS Datasets and Benchmarks.

[11]

Stefanie Holzwarth et al. 2020. Earth Observation Based Monitoring of Forests

in Germany: A Review. Remote Sensing 12, 21 (2020), 3570.

[12]

Haralampos Gavriilidis, Kaustubh Beedkar, Jorge-Arnulfo Quiané-Ruiz, and

Volker Markl. 2023. In-situ cross-database query processing. In ICDE. 2794–2807.

[13] HEAVY.AI. 2024. https://heavy.ai/

[14] OpenStreetMap. 2024. https://www.openstreetmap.org.

[15]

Paul J. Pinter, Jr., Jerry L. Hatfield, James S. Schepers, Edward M. Barnes, M. Susan

Moran, Craig S.T. Daughtry, and Dan R. Upchurch. 2003. Remote Sensing for

Crop Management. Photogrammetric Engineering & Remote Sensing 69, 6 (2003).

[16] PostGIS. 2023. https://postgis.net/

[17]

Jerry C. Ritchie, Paul V. Zimba, and James H. Everitt. 2003. Remote Sensing

Techniques to Assess Water Quality. Photogrammetric Engineering & Remote

Sensing 69, 6 (2003), 695–704.

[18] Apache Sedona. 2024. https://sedona.apache.org/

[19]

Raghav Sethi, Martin Traverso, Dain Sundstrom, David Phillips, Wenlei Xie,

Yutian Sun, Nezih Yegitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte, et al

2019. Presto: SQL on everything. In Proc. ICDE. 1802–1813.

[20] QGIS Geographic Information System. 2024. http://qgis.org

535