GDPRbench is an open-source benchmark that represents the functionalities of a database system deployed by a company that collects and processes personal data. GDPR significantly affects the design and operation of database systems that hold personal data. Yet, existing benchmarks like TPC and YCSB do not recognize the abstraction of personal data, including its legal and interfacing requirements. We design and implement GDPRbench after carefully analyzing the GDPR articles and painstakingly gleaning over legal cases from the first year of GDPR roll out.
Collectively, GDPR articles describe control- and data-path operations that a database system must support. We refer to this set as GDPR queries.
In contrast to the traditional CRUD queries, GDPR queries show a heavy skew towards metadata-based operations (i.e., queries conditioned on purpose, time-to-live, objections, user-id etc). Also, GDPR enforces restrictions on who could perform what operations under which conditions.
Core Workloads & Metrics
We define four workloads that correspond to the four core entities of GDPR: controller, customer, processor and regulator. Each of these workloads is composed using the GDPR queries outlined previously. Then, we glean over legal cases and usage patterns from the real-world to determine the default proportion of queries within a given workload and the distribution of the records they act on. However, we have made these configurable to any changes.
The benchmark then characterizes a database system's GDPR compliance using three metrics: correctness against GDPR workloads, time taken to respond to GDPR queries, and storage space overhead.