Skip to content

Snapshot record then reply testing#740

Open
mateusz834 wants to merge 1 commit intodevelopfrom
dev/MP/snapshots
Open

Snapshot record then reply testing#740
mateusz834 wants to merge 1 commit intodevelopfrom
dev/MP/snapshots

Conversation

@mateusz834
Copy link
Copy Markdown
Member

@mateusz834 mateusz834 commented Apr 23, 2026

This change implements snapshot, record then reply testing for all integration tests.

Re-enables recently skipped (falky) tests:

  • test_agent_understands_other_agents (snapshot was edited manually)
  • test_supervisor_resumes_subagent_thread_across_invocations
  • test_supervisor_resumes_subagent_thread_across_invocations_structured

Introduces a deterministic thread_id mock generator, such that snapshots are deterministic, for:

  • test_supervisor_resumes_subagent_thread_across_invocations
  • test_supervisor_resumes_subagent_thread_across_invocations_structured

Modified test_tool_execution_service_access using tool middleware, to make the test deterministic.

E2E tests still call the LLMs directly.

@mateusz834 mateusz834 force-pushed the dev/MP/snapshots branch 16 times, most recently from 1198d47 to d029c1d Compare April 24, 2026 06:45
@mateusz834 mateusz834 marked this pull request as ready for review April 24, 2026 06:45
This change implements snapshot, record then reply testing
for all integration tests.

Re-enables recently skipped (falky) tests:

- test_agent_understands_other_agents (snapshot was edited manually)
- test_supervisor_resumes_subagent_thread_across_invocations
- test_supervisor_resumes_subagent_thread_across_invocations_structured

Introduces a deterministic thread_id mock generator, such that
snapshots are deterministic, for:

- test_supervisor_resumes_subagent_thread_across_invocations
- test_supervisor_resumes_subagent_thread_across_invocations_structured

Modified test_tool_execution_service_access using tool middleware, to make
the test deterministic.
@mateusz834 mateusz834 requested review from Ickerday and szykol and removed request for szykol April 24, 2026 07:16
Comment thread tests/ai_testlib.py
def _before_record_request(request: Request) -> Request | None:
url = parse.urlparse(request.uri) # pyright: ignore[reportUnknownArgumentType, reportUnknownVariableType]
if url.hostname == internal_ai_hostname:
request.headers = {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we keep some specific headers for like Authorization and keep comparing them?

OPENAI_API_KEY = "ollama"


class TestAgent(AITestCase):
Copy link
Copy Markdown

@splunk-dtaborski splunk-dtaborski Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if correct assumption but i think most of the time will be decorating all tests in the file. Maybe we could add one more decoration on class level that would add function decorator to each test automatically.

High-level idea:
instead of

class TestIntegration:
  @ai_snapshot_test()
  def func1():
    ...
  @ai_snapshot_test()
  def func2():
    ...

Do

@snapshot_test()
class TestIntegration:
    def func1():
      ...

    def func2():
      ...

I dont say it has to be specifically in this PR however i wanted to point it out.

Copy link
Copy Markdown
Member Author

@mateusz834 mateusz834 Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a nice idea, but i feel that the complexity of it might be huge, i.e. what if we want to have some test that is not snapshoted, then we need some kind of exclude decorator.
I also see us running these tests with multiple snapshots, say from openai, gemini, anthropic (as subtests, created by this decorator) and there might be a need to say skip specific providers from specific tests or such. For example in test_structured_output.pl for models that do not support provider strategy.

CC @szykol @Ickerday for opinions.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we wanted to achieve something like this we could just add a fixture instead.
The fixtures can be autoused, so that all files in the scope of the fixture behave the same.

But I agree with Mateusz I don't see an issue with having those tests explicitly say they are snapshot. Also turning of the snapshotting of a specific test is a matter of removing the decorator, no further work needed.

Comment thread tests/ai_testlib.py
return model


def ai_snapshot_test() -> Callable[
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] the name suggests it can be used only for AI. What do you think about record_snapshot name?

Copy link
Copy Markdown
Member Author

@mateusz834 mateusz834 Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name suggests it can be used only for AI.

We do filtering here, so i think it only works for AI 😄 .

Copy link
Copy Markdown

@splunk-dtaborski splunk-dtaborski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the solution.

[
HumanMessage(
content="hi, my name is Chris. Generate a nickname for me",
content="i, my name is Chris. Generate a nickname for me",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a typo by mistake

os.path.join(os.path.dirname(__file__), "testdata", "tool_context.py"),
)
@patch("splunklib.ai.agent._testing_app_id", "app_id")
@ai_snapshot_test()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is calling the decorator needed?

@ai_snapshot_test looks a little easier to the eye

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this requires changing the code in the decorator

Comment thread tests/ai_testlib.py
).hostname
assert internal_ai_hostname is not None

REDACTED_APP_KEY = "[[[--APPKEY-REDACTED-]]]"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this could be moved to the top of file

Comment thread tests/ai_testlib.py
assert settings.internal_ai.client_id.lower() not in out.lower()
assert settings.internal_ai.client_secret.lower() not in out.lower()

return out
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious if we should look for the other secrets for our testing LLM before serializing - to make sure we do not leak anything by mistake when creating new snapshots.

or add some ci stage for that.

WDYT?

Comment thread tests/ai_testlib.py
before_record_request=_before_record_request,
before_record_response=_before_record_response,
# record_on_exception=False,
# drop_unused_requests=True,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be removed if not needed

Comment thread tests/ai_testlib.py
return decorator


def deterministic_thread_ids() -> Callable[
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's cool, we don't need to change the test body to accommodate that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants