Skip to content

Conversation

@jpbrodrick89
Copy link
Contributor

@jpbrodrick89 jpbrodrick89 commented Dec 4, 2025

Relevant issue or PR

Description of changes

This PR just migrates the endpoints INPUTS used in test_examples.py and otherwise leaves tests untouched. These are all stored in the tesseract_api folder in a test_cases folder. The naming convention is {descriptor:camelCase}_{endpoint}_input.json except for the apply endpoint where the convention is just {descriptor:camelCase}_inputs.json.

Follow-up work will come in two stages (@dionhaefner lmk if you'd prefer this to be one PR or a sequence of PRs):

  1. Feed all tests to their respective inputs storing outputs in either a specific format (i.e. json json+base64 or json+binref) or all formats and then test. The tests now effectively become regression tests (we should consider whether that is indeed what we want and also whether pytest-regressions already posesses the functionality we require).
  2. Refactor so that we don't need to manually associate endpoints with input files/Tesseracts. Careful thought is required on how to handle status codes (either status code should be referred directly from descriptor, i.e. bad... -> 422 (my preference) or we also have files containing expected status code (seems a crazy idea)).
  3. Introduce a public tesseract test/regress/test-regress CLI command to allow users to leverage such functionality.

Optional (could come at any stage): Introduce tesseract testgen command to allow users to automate step 2 above.

Critical concern

The default base64 array encoding is not human-readable and may make tests hard to reason about without additional diagnostic efforts. Potential solutions include always using list format in CI tests (less exhaustive than currently) or we have three json files one for each encoding (but then we can't guarantee they stay synced), or we use list but generate other json+... encodings on the fly (probably the best option if we think human readability is crucial).

Testing done

  • CI passes without modifying expected results.

@codecov
Copy link

codecov bot commented Dec 4, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.37%. Comparing base (12197b3) to head (51665a2).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #411      +/-   ##
==========================================
- Coverage   76.62%   76.37%   -0.26%     
==========================================
  Files          29       29              
  Lines        3359     3420      +61     
  Branches      525      533       +8     
==========================================
+ Hits         2574     2612      +38     
- Misses        553      576      +23     
  Partials      232      232              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jpbrodrick89 jpbrodrick89 changed the title refactor[test]: move example test inputs to json files test: move example test inputs to json files Dec 4, 2025
@pasteurlabs pasteurlabs locked and limited conversation to collaborators Dec 4, 2025
@jpbrodrick89 jpbrodrick89 reopened this Dec 5, 2025
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we call these empty_input.json or handle in a different way (probably partially defeats point of refactor but more explicit)?

@dionhaefner
Copy link
Contributor

Thanks @jpbrodrick89. This looks like a sensible first step.

Wrt the questions you raise, I wonder if a "test case" structure like this would make sense:

{
  "inputs": { }
  "expected_result": {
     "status_code": 400,
     "outputs": { }
  },
  "endpoint": "apply"
}

Unfortunately you can't use file aliasing with this, but almost: (untested)

$ tesseract run mytess apply | jq '.[inputs]' testcase.json

... and perhaps to be augmented with a dedicated tesseract test command later ...

What do people think?

@jpbrodrick89
Copy link
Contributor Author

jpbrodrick89 commented Dec 9, 2025

Good idea, @dionhaefner, I'm in two minds but I think this is probably the right way to go. There are challenges we need to think carefully about but these probably apply regardless of the json structure.

Advantages of single json

  • Better organised (everything in one file)
  • More maintainable
  • Extensible (we could add additional fields such as assertions or have a mixture of output formats referenced).

Disadvantages of single json

  • Harder/less intuitive to generate regression tests (e.g. tesseract test apply -o example_outputs.json @example_inputs.json needs to be followed up by jq -n --slurpfile in example_inputs.json --slurpfile out example_outputs.json '{inputs: $in[0], expected_result: {status_code: 200, outputs: $out[0]}, endpoint: "apply"}' > test_case.json) -> this means introducing tesseract gentest essentially becomes mandatory.
  • Not consistent with approach in pytest-regressions where all output files are separate from their input.
  • How do we treat check-gradients just have a status code and leave outputs field blank?

General challenges of both approaches

What exactly are we trying to achieve here? Just regression testing or a framework for more general testing of tesseract run behaviour?

Do we want to support "fuzzy tests", e.g. allow for slight floating point differences, pytest-regressions mainly handles this by rounding for data_regression or more customisable atol/rtol for pure numeric outputs (e.g. ndarray_regression which would likely not work for us) we could arguable use data_regression straight out the box to some degree if this suits our purposes.

Do we want to check against all output formats, always do one or make this customisable?

Will we deprecate our "contains" tests in lieu of the new full regressions.

Do we want to reuse the apply inputs automatically in gradient endpoints, then extending them with AD fields?

@dionhaefner
Copy link
Contributor

I care less about the Tesseract Core test suite than providing a general way to test user-built Tesseracts. I'd expect those to live in repos with Makefiles and build scripts, not necessarily full-fledged Python packages covered by pytest. So a simple command line tool ~ tesseract test mytestcase.json seems more useful than adhering to a framework like pytest-regressions. Does that make sense?

@jpbrodrick89
Copy link
Contributor Author

jpbrodrick89 commented Dec 15, 2025

Yes, it makes sense and should be easy to implement with just the most basic of functionality with limited feedback (tests fail or pass, if they fail just provide raw diff). My main concern is that if we want to go beyond basic functionality is the future we may have to reinvent the wheel. I did a quick survey with Claude of alternative regression testing frameworks, Claude seemed to be really fond of Venom but when I got it to think harder about a DIY approach we realised that the DIY approach (leveraging DeepDiff) would probably be simpler in the end (Venom would requires us to use yaml instead of json so somewhat more invasive and adds an extra dependency). Hurl was probably a runner up but only supports testing HTTP requests and not cli commands.

What other functionality is a must-have? In my mind the top three are

  • fuzzy comparisons of numerical values with custom tolerances
  • easily generate full test files when outputs are missing
  • intuitive human-readable feedback on how and why tests fail

pytest-regression covers all of the above (although fuzzy comparison is limited to rounded values when stored in text/json format). Note that as tesseract-core is a Python package after all, there is nothing preventing us adding pytest as a requirement under the hood and only accessing it through our tesseract test command.

If you are interested in understanding possible implementation here is Claude's raw unedited unrefined basic plan for a DIY approach Architecture:

.test.json (spec) → tesseract test CLI → Python script → DeepDiff

Test Spec Format (same):

  {
    "inputs": {"a": [1.0, 2.0, 3.0]},
    "endpoint": "apply",
    "expected_result": {
      "status_code": 200,
      "outputs": {"result": [5.0, 7.0, 9.0]},
      "tolerance": 0.01
    }
  }

Python Test Runner (scripts/tesseract_test_runner.py):

  #!/usr/bin/env python3
  """Standalone test runner for Tesseract test specs."""

  import json
  import subprocess
  import sys
  from pathlib import Path
  from deepdiff import DeepDiff
  import base64
  import numpy as np

  def decode_tesseract_value(value):
      """Decode various Tesseract formats to Python objects."""
      if isinstance(value, dict):
          # Handle base64-encoded arrays
          if value.get("object_type") == "array":
              buffer = base64.b64decode(value["data"]["buffer"])
              dtype = np.dtype(value["dtype"])
              return np.frombuffer(buffer, dtype=dtype).reshape(value["shape"])
          # Recursively decode nested dicts
          return {k: decode_tesseract_value(v) for k, v in value.items()}
      elif isinstance(value, list):
          return [decode_tesseract_value(v) for v in value]
      return value

  def run_test(test_spec_file: Path) -> bool:
      """Run a single test and compare output."""
      with open(test_spec_file) as f:
          spec = json.load(f)

      # Check if expected output exists
      if "expected_result" not in spec or "outputs" not in spec["expected_result"]:
          print(f"SKIP {test_spec_file.name}: No expected output (run gentest first)")
          return None

      # Run Tesseract
      endpoint = spec.get("endpoint", "apply")
      inputs_json = json.dumps(spec["inputs"])

      result = subprocess.run(
          ["tesseract", endpoint],
          input=inputs_json,
          capture_output=True,
          text=True
      )

      # Check status code
      expected_status = spec["expected_result"].get("status_code", 0)
      if result.returncode != expected_status:
          print(f"FAIL {test_spec_file.name}: Status code {result.returncode} != {expected_status}")
          return False

      # Parse actual output
      try:
          actual = json.loads(result.stdout)
      except json.JSONDecodeError:
          print(f"FAIL {test_spec_file.name}: Invalid JSON output")
          return False

      # Decode both expected and actual
      expected = decode_tesseract_value(spec["expected_result"]["outputs"])
      actual = decode_tesseract_value(actual)

      # Compare with tolerance
      tolerance = spec["expected_result"].get("tolerance", 0)

      if tolerance > 0:
          # Use significant_digits for fuzzy comparison
          significant_digits = int(-np.log10(tolerance))
          diff = DeepDiff(
              expected,
              actual,
              significant_digits=significant_digits,
              ignore_order=False,
              ignore_type_in_groups=[(int, float)]  # Treat int/float as compatible
          )
      else:
          # Exact comparison
          diff = DeepDiff(expected, actual)

      if diff:
          print(f"FAIL {test_spec_file.name}:")
          print(f"  {diff}")
          return False

      print(f"PASS {test_spec_file.name}")
      return True

  def main():
      import argparse
      parser = argparse.ArgumentParser()
      parser.add_argument("paths", nargs="+", help="Test files or directories")
      parser.add_argument("--pattern", default="*.test.json")
      args = parser.parse_args()

      # Collect test files
      test_files = []
      for path_str in args.paths:
          path = Path(path_str)
          if path.is_file():
              test_files.append(path)
          elif path.is_dir():
              test_files.extend(path.glob(args.pattern))

      # Run tests
      results = {"pass": 0, "fail": 0, "skip": 0}
      for test_file in sorted(test_files):
          result = run_test(test_file)
          if result is True:
              results["pass"] += 1
          elif result is False:
              results["fail"] += 1
          else:
              results["skip"] += 1

      # Summary
      print(f"\n{results['pass']} passed, {results['fail']} failed, {results['skip']} skipped")
      sys.exit(0 if results["fail"] == 0 else 1)

  if __name__ == "__main__":
      main()

Test Generation Script (scripts/tesseract_gentest.py):

  #!/usr/bin/env python3
  """Generate expected outputs for test specs."""

  import json
  import subprocess
  import sys
  from pathlib import Path

  def gentest(test_spec_file: Path):
      """Generate expected output for a test spec."""
      with open(test_spec_file) as f:
          spec = json.load(f)

      if "expected_result" in spec and "outputs" in spec["expected_result"]:
          print(f"SKIP {test_spec_file.name}: Already has expected output")
          return

      # Run Tesseract
      endpoint = spec.get("endpoint", "apply")
      inputs_json = json.dumps(spec["inputs"])

      result = subprocess.run(
          ["tesseract", endpoint],
          input=inputs_json,
          capture_output=True,
          text=True
      )

      if result.returncode != 0:
          print(f"ERROR {test_spec_file.name}: Tesseract failed")
          print(result.stderr)
          sys.exit(1)

      # Parse output
      output = json.loads(result.stdout)

      # Update spec with expected output
      if "expected_result" not in spec:
          spec["expected_result"] = {}
      spec["expected_result"]["status_code"] = 0
      spec["expected_result"]["outputs"] = output

      # Write back
      with open(test_spec_file, "w") as f:
          json.dump(spec, f, indent=2)

      print(f"GENERATED {test_spec_file.name}")

  def main():
      import argparse
      parser = argparse.ArgumentParser()
      parser.add_argument("paths", nargs="+", help="Test files or directories")
      parser.add_argument("--pattern", default="*.test.json")
      args = parser.parse_args()

      # Collect test files
      test_files = []
      for path_str in args.paths:
          path = Path(path_str)
          if path.is_file():
              test_files.append(path)
          elif path.is_dir():
              test_files.extend(path.glob(args.pattern))

      for test_file in sorted(test_files):
          gentest(test_file)

  if __name__ == "__main__":
      main()

Pros:

  • ✅ Simple, single-layer system (JSON → Python → DeepDiff)
  • ✅ DeepDiff handles all format complexity (arrays, nested structures)
  • ✅ DeepDiff has built-in numerical tolerance via significant_digits
  • ✅ No YAML conversion needed
  • ✅ Easy to customize/extend (it's your code)
  • ✅ Can integrate directly into tesseract CLI
  • ✅ Only ~150 lines of Python total

Cons:

  • ❌ Requires Python in environment (but you already have it for Tesseract)
  • ❌ Requires deepdiff package (extra dependency)
  • ❌ No parallel execution out of the box (but easy to add)
  • ❌ No fancy CI output formats (but can add JUnit XML)

Workflow:

1. User creates input-only test (same)

  cat > test_cases/new_test.test.json <<EOF
  {
    "inputs": {"a": [1, 2, 3]},
    "endpoint": "apply",
    "tolerance": 0.01
  }
  EOF

2. Generate expected output

python3 scripts/tesseract_gentest.py test_cases/new_test.test.json

→ Runs tesseract, updates JSON file with expected_result

3. Run tests

python3 scripts/tesseract_test_runner.py examples/vectoradd/test_cases/

→ Direct execution, no intermediate steps

@dionhaefner
Copy link
Contributor

Quick update: We agreed to go for an implementation in tesseract-runtime next to check-gradients, re-using some of the same deep diff logic already present there (no new deps).

@jpbrodrick89 jpbrodrick89 force-pushed the jpb/refactor-test-inputschemas branch from f31d8bb to 1e20b2b Compare December 17, 2025 10:41
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants