Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

About the Speaker

Stefanie Scherzinger is a Full Professor at the University of Passau, Germany.
She earned her Ph.D. from the University of Saarland in 2008. 

She then pursued an industry career, first working for IBM, then for Google.
Her real-world experiences shaped her perspective and laid the foundation for her academic endeavors.

In 2012, Stefanie Scherzinger returned to academia, assuming a professorship at OTH Regensburg, Germany.
Since early 2020, she has been chairing the "Scalable Database Systems" group at the University of Passau.

Her research is deeply influenced by her industry background, with a particular focus on maintaining database applications, especially those powered by NoSQL data stores. Stefanie Scherzinger is committed to the long-term maintainability of software systems and places great emphasis on the importance of achieving long-term reproducibility in computer science research.

About the Talk

Deploying a machine learning pipeline is a resource-demanding task that requires a combination of data and software engineering expertise. However, even with meticulous testing, the risk of encountering run-time errors during pipeline operation remains a significant concern. In this presentation, our focus lies in addressing the run-time errors caused by mismatches between the data to be processed and the code responsible for its processing. We present strategies for the early detection of these mismatches through static and dynamic analysis, leveraging techniques from JSON Schema reasoning. Specifically, we showcase our recent contributions to essential tasks such as schema validation, schema extraction, and checking schema containment. Furthermore, we provide an outlook on the challenges introduced by the latest drafts of JSON Schema. Lastly, we conclude with a discussion on application domains for our contributions, extending beyond the fortification of machine learning pipelines against run-time errors.

Detecting Data-Code Mismatches in Machine Learning Pipelines

Summary written by Sebastian Mitte and Dinh Trung Hieu Le

Motivation

Detecting and locating these errors particularly in ML pipelines is very time consuming especially when the error message is non-specific and reproducing/debugging the error need much computation. In the lecture Professor Scherzinger presented a solutions for these problems with early detection of such mismatches, employing both static and dynamic analysis techniques informed by JSON Schema reasoning. This approach is pivotal in advancing tasks like schema validation, schema extraction, and checking schema containment. Additionally, it addresses the evolving challenges brought forth by recent updates in JSON Schema standards. By focusing on these strategies, the potential for enhancing the robustness and efficiency of machine learning pipelines is significant, offering a pathway to reduce the likelihood of run-time errors and improve overall system reliability.

JSON and JSON Schema

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It's easy for humans to read and write, and easy for machines to parse and generate. JSON Schema is a standard/logical langauge for describing and validating JSON documents (structure, types, values). Furthermore the open document semantics of JSON Schema allows all JSON format except the explicitly prohibited properties, values or structures. A validator (e.g. jsonschemavalidator.net) can apply a schema to a JSON file to evaluate the files format.

{ "title": "Product", "description": "A product from Acme's catalog", "type": "object", "properties": { "productId": { "description": "The unique identifier for a product", "type": "integer" }, "productName": { "description": "Name of the product", "type": "string" } }, "required": ["productId", "productName", "price"] }

  • title and description: state the intent of the schema. These keywords don’t add any constraints to the data being validated.
  • type: defines the first constraint on the JSON data. In the product catalog example below, this keyword specifies that the data must be a JSON object.
  • properties is a validation keyword. When you define properties, you create an object where each property represents a key in the JSON data that’s being validated.
  • required specify which properties defined in the object are required.

IBM LALE

One application Professor Scherzinger emphasized where her research can be helpful is the IBM LALE library. LALE is a python library for semi-automated data science and hyperparameter tuning where JSON Schema is used to check the input and output interface of each operator and the data input. The known interfaces of all the operators allow to detect type or value mismatches early and with expressive error messages.

Schema validation

Scenario 1: Unexpected Input [1]

In machine learning pipeline, effectively managing unexpected or improperly formatted data is crucial. One of the solution is addressing using JSON Schema validation at runtime. As data enters the system, it's immediately checked against a predefined schema, ensuring it matches the expected format and structure. This proactive approach transforms potential runtime errors into immediate validation errors, allowing for swift identification and correction. The advantage of this method is twofold: it prevents the propagation of incorrect data through the system, reducing the risk of more complex issues downstream, and simplifies the debugging process by clearly pinpointing the source of errors, whether they stem from data inconsistencies, software bugs, hardware failures, or network issues. Additionally, dynamic data analysis during this stage adds an extra layer of integrity, continuously monitoring and analyzing data flow.

Challenges

There are several challenges with schema validation as dynamic references are hard to understand for developers and this type of linking schemas with parameters makes the schema validation PSPACE hard (without it would be PTIME complete) so they can crash the validator and therefore they are a security risk. The conditional semantics of JSON Schema leads to the fact that the validation process only has to check the explicitly defined parts of the data and as a third challenge JSON Schema is not compositional so you always have to consider several keywords of the schema to understand the meaning of one part.

Paper

Validation of Modern JSON Schema: Formalization and Complexity

They prove that the issue with data validation in this context is very complex, specifically, it's PSPACE-complete. This complexity mainly comes from dynamic references, rather than from the process of validating based on annotations. They also explore how complex the problem is in relation to both the schema (the structure of the database) and the data. Their findings reveal that the problem remains highly complex (PSPACE-complete) when considering different schema sizes, even if the data set (instance) doesn't change. However, if the schema is kept constant and only the data set varies in size, then the problem becomes much simpler, falling into the PTIME category, which means it can be solved relatively quickly.

Developed validator

Facing these challenges Professor Scherzinger and her team is currently developing a new JSON Schema validator called JTutor. It provides a detailed proof tree for all branches, offering enhanced insights into the schema whether it was tested and succeeded and aiding in the debugging process. JTutor will be able to handle a small amount of dynamic references (what former validators could not) by rewriting.

Containment check

Scenario 2: Incompatible Operators [1]

By connecting two operators and batching the pipeline together, it's possible to determine the compatibility of schemas for inputs and outputs. In LALE it is used between connected operators to make sure the first operators output is always a valid input for the second operator as we know both operators interfaces.

Not elimination

The algorithm with "Not elimination" [2]: transforming JSON Schemas so that they no longer contain the "not" keyword, which is used for negation. This is significant because "not" can make schemas difficult to understand and reason about. For example, Professor Scherzinger thought that the `not` seem like optional - if its `not required`.

The most common use of 'not' in JSON Schema is to indicate `not required`, an older method of specifying that certain properties are not allowed. Regarding boolean operators, the research analyzed the total occurrences and the number of files in which various JSON Schema keywords appear, essentially assessing how often operators in the JSON Schema language are used. While 'not' is not frequently used, it is nonetheless very important. Moreover, there are operators that represent forms of negation, such as 'oneOf', which is a union of choices (like A|B|C, meaning A and not B and C). Although 'not' is not used frequently, different forms of negation, like 'oneOf', are commonly employed.

An Empirical Study on the "Usage of Not" in Real-World JSON Schema Documents [2]

Eliminating these negations can be done by pushing the not down in an algebraic form of the schema Professor Scherzinger and her chair designed.

Witness generation

The LALE containment checker supports no recursion and barely no negation so Professor Scherzinger and others developed a solution [3] that understands recursion and negation by reducing containment checks to witness generation:

A witness for a schema S that is a subset of schema S’ should always be a witness for S’. Now the algorithm tries to generate a witness for S∧¬S’ and if there is a witness found S is not contained in S’ if the algorithm can’t find one it is contained. The new containment checker for classic JSON Schema is the first complete algorithm for this task as before there was only the complexity of EXPTIME complete known.

The algorithm consists of three parts: the not elimination, rewriting the algebra in DNF and a bottom-up witness generation. The not elimination is necessary for the bottom-up generation because otherwise witnesses are generated in multiple iterations and in the last step the algorithm has to restart because there is a not in a higher hierarchical order and a witness for not: X cannot be generated from a witness for X.

File Size vs Runtime [1]

They have developed a new, straightforward algorithm specifically for generating JSON Schema witnesses. They thoroughly tested this algorithm to assess how well it works and how fast it is. These tests were conducted using a variety of schema collections, including thousands of schemas from real-world scenarios. Their analysis focuses on how completely the algorithm can handle the JSON Schema language (with the exception of the "uniqueItems" operator) and whether it can perform efficiently in a practical amount of time on many real-world examples (succeeded 99 % of the GitHubs JSON Schemas).

Witness Generation for JSON Schema [3]

Schema Extraction

Scenario 3: Data without Schema [1]

If we receive data without an accompanying schema, we have the ability to extract a schema directly from the data itself. Then, we can perform containment-checking by comparing these extracted schemas. Current advanced methods for schema extraction focus mostly on understanding the structure of JSON data [4], like how it's nested and whether it contains arrays or objects. Professor Scherzinger and her team, however, are working on identifying 'tagged unions' within JSON Schema. Tagged unions are a design pattern in JSON Schema where a specific property within an object (referred to as the tag) suggests different subschemas for the other properties in the object based on its value. The team aims to formalize these relationships as conditional functional dependencies and represent them using the JSON Schema operators 'if-then-else'. [5, 6]

Extracting JSON Schemas with tagged unions [6]

Conclusion

JSON Schema is widely used in various fields, such as NoSQL document stores for validating inputs, Web APIs for ensuring correct data formats and interactions, and in the creation of forms based on JSON data. This broad application spectrum highlights the versatility and importance of JSON Schema in managing and validating data across different technological domains.

Prof. Dr. Stefanie Scherzinger's research focuses on advanced JSON Schema applications, including witness generation algorithms and containment checks. She explores schema extraction, particularly identifying tagged unions in JSON Schema. Her work also involves addressing machine learning pipeline issues through early error detection using JSON Schema reasoning. Additionally, she is contributing to the development of JTutor, a JSON Schema validator, and has made significant advancements in debugging and accuracy in schema validation.

Resources

[1] Lecture "Detecting Data-Code Mismatches in Machine Learning Pipelines" (tele-task.de)

[2] An Empirical Study on the “Usage of Not” in Real-World JSON Schema Documents (link.springer.com)

[3] Witness Generation for JSON Schema (arxiv.org)

[4] Schema Extraction and Structural Outlier Detection for JSON-based NoSQL Data Stores (researchgate.net)

[5] Tagger: A Tool for the Discovery of Tagged Unions in JSON Schema Extraction (uni-regensburg.de)

[6] Extracting JSON Schemas with Tagged Unions (arxiv.org)