Generating Open API specs with Hypothesis

Recently, while looking for ways to improve Schemathesis' test suite I found an old issue titled "Test with generated schemas". The idea is to generate random Open API specs and verify that Schemathesis does not fail with internal errors, to ensure it will work with a wider range of real-life API schemas.

Initially, I considered using hypothesis-jsonschema with Open API's official meta schemas. However, this approach was hindered by a limitation in the current hypothesis-jsonschema implementation: its lack of support for recursive references, which are common in Open API specs.

This article explores an alternative approach to generating valid Open API specs by leveraging Hypothesis' from_type and abusing Python's "typing" module. In the following sections, we will go through the implementation details, challenges faced, and potential future improvements.

If you develop a tool that works with Open API specs, you may find useful ideas for enhancing your tool's test suite and improving its resilience.

Schemathesis is a powerful tool for testing web applications built with Open API and GraphQL specifications. The prototype described in the article immediately found a bug in it.

Open API meta-schemas

Let's first understand the structures we need to generate. Open API specifications are formally described using JSON Schema, for example, the Open API 2.0 meta-schema looks like this:

{
  "title": "A JSON Schema for Swagger 2.0 API.",
  "id": "http://swagger.io/v2/schema.json#",
  "$schema": "http://json-schema.org/draft-04/schema#",
  "type": "object",
  "required": [
    "swagger",
    "info",
    "paths"
  ],
  "properties": {
    "swagger": {
      "type": "string",
      "enum": [
        "2.0"
      ],
      "description": "The Swagger version of this document."
    },
    "info": {
      "$ref": "#/definitions/info"
    },
  // ...
}

One important aspect of the Open API meta-schema is the presence of recursive references. For example, the schema definition references itself in multiple places, for example through the allOf property where each sub-schema is also a schema:

  "definitions": {
    "schema": {
      "type": "object",
      "description": "A deterministic version of a JSON Schema object.",
      "properties": {
        "allOf": {
          "type": "array",
          "minItems": 1,
          "items": {
            "$ref": "#/definitions/schema"
          }
        },
    // ...

This recursive structure allows for the definition of complex data structures in the API schema.

Generating valid Open API specs requires adhering to these meta-schemas and additional constraints that cannot be expressed with JSON Schema alone. These constraints are documented in the Open API specification, such as ensuring parameters are unique based on their "name" and "in" keys.

The Power of Hypothesis

Hypothesis provides a wide range of strategies for generating test data, including the ability to generate recursive structures using the recursive strategy. However, applying the recursive strategy manually to Open API's recursive definitions looks tedious and complicated.

One of Hypothesis' underrated features is its ability to infer strategies for Python types using the from_type strategy. In the past, I've often avoided using Python's type annotations and instead defined all strategies manually. It turns out that from_type is incredibly useful and works with many standard data types out of the box, including support for recursion.

from dataclasses import dataclass
from typing import List

@dataclass
class Foo:
     bar: "Bar"

@dataclass
class Bar:
    foo: List["Foo"] | None

@dataclass
class Bar:
    foo: List["Foo"] | None

>>> from hypothesis import given, strategies as st
>>> st.from_type(Bar).example()
Bar(foo=[Foo(bar=Bar(foo=[])), Foo(bar=Bar(foo=None)), Foo(bar=Bar(foo=[]))])

The core idea is to express the Open API meta-schemas using Python's dataclasses and type annotations, then use Hypothesis' from_type strategy to generate instances of these dataclasses and transform them into basic Python types that match the JSON semantics resulting in valid Open API specs.

However, there are a few challenges that need to be addressed.

Configure data generation

It's important to ensure that the generated data adheres to the constraints specified by JSON Schema keywords like pattern or uniqueItems. Hypothesis provides mechanisms to configure data generation, and we can use Python's Annotated type to associate these constraints with our dataclass fields.

Here is an example of how to create custom types that wrap Hypothesis strategies to enforce specific constraints:

class Pattern:
    def __class_getitem__(cls, pattern: str) -> object:
        # Strings matching the given regex pattern
        strategy = st.from_regex(pattern)
        return Annotated[str, strategy]

class UniqueList:
    def __class_getitem__(cls, type_: type):
        # Lists with unique elements of the given type
        strategy = st.lists(st.from_type(type_), unique=True)
        return Annotated[list, strategy]

Annotated is used to add context-specific metadata to a type annotation.

These custom types could be used with dataclasses to generate lists of unique strings each one matching the provided pattern:

@dataclass
class MatchingItems:
    items: UniqueList[Pattern[r"\A[0-9]+abc\Z"]]

>>> st.from_type(MatchingItems).example()
MatchingItems(items=['7abc', '3915848abc', '0abc', '6abc'])

As far as I can see, this approach of attaching Hypothesis strategies to dataclass fields using Annotated could handle most of the JSON Schema keywords used in Open API meta-schemas.

Resolving References with Placeholder Classes

To generate valid Open API specs, all references within the spec must be resolvable. We can achieve this by deferring the reference generation step to a postprocessing phase using placeholder classes.

During the initial generation phase, we generate placeholder classes representing references instead of valid reference strings. These placeholders are transformed into valid references during postprocessing when all target objects are available.

Here's an example implementation using a composite strategy in Hypothesis:

from dataclasses import is_dataclass
from hypothesis import strategies as st
from ._types import Reference

@st.composite
def asdict(draw: st.DrawFn, root: object) -> object:
    def _asdict(obj: object) -> object:
        if is_dataclass(obj):
            if isinstance(obj, Reference):
                items = root.get_items_for_reference(obj)
                item = draw(st.sampled_from(items))
                return {"$ref": item.as_reference()}
            result = {}
            # ... (serialization logic for other dataclass fields)
            return result
        # ... (serialization logic for other types)
    return _asdict(root)

The asdict function is a composite strategy that serializes the root schema object. When a Reference placeholder is encountered, it chooses a random target from the available items using sampled_from and converts it to a reference string. If no valid target is found, a new one can be created, ensuring the generation of valid Open API specs with resolvable references:

@dataclass
class Swagger:
    ...

def openapis(version):
    if version == "2.0":
        return st.from_type(Swagger).flatmap(asdict)
    # TODO: Add more spec versions!
    raise ValueError(f"Unsupported version: {version}")

Handling Invalid Field Names

Some field names in the Open API spec, like $ref, are not valid Python identifiers and cannot be used directly as dataclass fields. To handle this, we perform a postprocessing step to rename these fields appropriately.

Here's an example:

@st.composite  # type: ignore[misc]
def asdict(draw: st.DrawFn, root: object) -> object:
    def _asdict(obj: object) -> object:
        if is_dataclass(obj):
            # ...
            result = {}
            # ... (serialization logic for other dataclass fields)
            if hasattr(obj, "map_fields"):
                # Call a custom function to rename fields properly     
                result = obj.map_fields(result)
            return result
        # ... (serialization logic for other types)
    return _asdict(root)

A dataclass representing a schema object can define a map_fields method to handle field renaming:

@dataclass
class Schema:
    ref: str

    def map_fields(self, data: dict) -> dict:
        if "ref" in data:
            data["$ref"] = data.pop("ref")
        return data

Other aspects of the Open API spec that cannot be directly expressed using Python types can be handled similarly in the postprocessing step, such as resolving inter-dependencies between parameter names and path template variables.

Conclusion

The ideas presented in this article are just the beginning. I've created a proof-of-concept implementation and published it as hypothesis-openapi, which currently supports a small subset of Open API 2.0 and has no dependencies besides Hypothesis. Currently, I am working on adding Open API 3.x support and integrating it into Schemathesis.

If you build tools that work with Open API specs, I invite you to contribute to this library. Your contributions, whether it's adding support for more features, improving strategies, or enhancing documentation, are highly appreciated. Let's work together to make hypothesis-openapi a powerful and comprehensive tool for the Open API ecosystem.

Please check out the hypothesis-openapi repository, try it out, and share your thoughts and ideas. I look forward to collaborating with you!

Best
Dmitry