This repository contains an extensible benchmark framework to support evaluation of LLM-based agents for Model-Based Systems Engineering (MBSE).
The benchmark contains models from the following two domains, as well as partially LLM-generated set of questions and reference answers for them. The models are transformed to RDF using existing semantic mappings.
- A large-scale system engineering model, the Thirty Meter Telescope (TMT) SysML model,
- Business process models (BMPN) from the Signavio Academic Models.
The benchmark framework compares the performance of LLM modeling agents by several state-of-the-art, text and model-based metrics from the literature and generates several diagrams to visualize the results. We demonstrate the usage of the benchmark framework by evaluating our LLM agents which use ontological knowledgebase and SPARQL queries to assist modeling in safety- and business-critical domains.
We tested the environment on Ubuntu 24.04 LTS.
Minimum requirements:
- Node.js: >= 18
- npm: >=9
Check your local versions:
node -v
npm -vIf node or npm is missing, install them with:
sudo apt install nodejs npmPreferred install command (execute in repo root):
npm ciBERTScore metric requires the Xenova/bert-base-cased model (~100MB-110MB).
agent_evaluation/:download_embedding_model.ts- script downloading the embedding model (Xenova/bert-base-cased) used for embedding in BERT Score metrichf_env.ts: to keep the embedding model cached in.cache/transformers/
Download the model:
npx tsx agent_evaluation/download_embedding_model.tsNote: The model is cached in .cache/transformers/ (repo-local, configured via agent_evaluation/hf_env.ts).
For the documented Ubuntu setup, we install the required Python packages with the system package manager, we use sudo for system-level package installation commands.
Pandas and Matplotlib are used to create the plots.
sudo apt install python3-pandas python3-matplotlibDirect dependencies (declared in package.json, resolved by package-lock.json):
- @huggingface/transformers: ^3.8.1,
- @langchain/anthropic: ^0.3.26,
- @langchain/core: ^0.3.73,
- @langchain/langgraph: ^0.4.9,
- @langchain/openai: ^0.6.11,
- @langchain/tavily: ^0.1.5,
- dotenv: ^16.6.1,
- openai: ^6.7.0,
- tsx: ^4.0.0,
- typescript: ^5.9.2
.env (put in repo root):
Create a .env file with:
OPENAI_API_KEY=sk-or-....
OPENAI_BASE_URL=https://openrouter.ai/api/v1
# --- LLM Judge Configuration ---
LLM_JUDGE_MODEL=anthropic/claude-opus-4.6
LLM_JUDGE_TEMPERATURE=0.2
# --- Keyword Counter Agent Configuration ---
KEYWORD_COUNTER_AGENT_MODEL=anthropic/claude-opus-4.6
KEYWORD_COUNTER_AGENT_TEMPERATURE=0The BPMN benchmark uses an aggregated SAP-SAM TTL file.
Manual prerequisites:
- Clone the BPMN input models repository into
input-models/:
pushd input-models/
git clone --branch DLT4BPM https://github.com/fstiehle/bpmn-sol-llm-benchmark.git
popd- Download the SAP-SAM Zenodo archive manually from:
https://zenodo.org/records/7012043and unpack it into:input-models/sap_sam_2022
The following step requires docker and rdflib.
sudo apt install python3-rdflibIf docker is missing on Ubuntu, install Docker Engine first:
sudo apt install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
cat <<EOF | sudo tee /etc/apt/sources.list.d/docker.sources > /dev/null
Types: deb
URIs: https://download.docker.com/linux/ubuntu
Suites: $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}")
Components: stable
Architectures: $(dpkg --print-architecture)
Signed-By: /etc/apt/keyrings/docker.asc
EOF
sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo docker run hello-worldThen run:
sudo bash converters/bpmn-to-kg/build_sap_sam_aggregated_ttl.shThis generates:
model_databases/SAP-SAM/aggregated_model.ttl: the final aggregated .ttl file,model_databases/SAP-SAM/aggregated_model_skipped_models.csv: the files that were not aggregated,model_databases/SAP-SAM/aggregated_model_matched_models.csv: the files that were aggregated.
We use Apache Jena Fuseki 5.5.0 for this setup.
Run the following commands in a separate terminal session, not in the repo terminal.
sudo apt install openjdk-17-jre-headless unzip wget curl tmux
mkdir -p ~/tools
mkdir -p ~/fuseki-base
cd ~/tools
wget https://archive.apache.org/dist/jena/binaries/apache-jena-fuseki-5.5.0.zip
echo "0306f2c9e81fe17fc42beb66bdec9479cd9078ea6b9e3c5a7a78dce44ff904b538aaa2a4619e8174d8917cfb0e9b1c094dba066046019999f2ce10478d1b9a71 *apache-jena-fuseki-5.5.0.zip" \
| sha512sum -c || echo "CHECKSUM ERROR"
unzip apache-jena-fuseki-5.5.0.zip
tmux new -s fuseki
cd ~/tools/apache-jena-fuseki-5.5.0
export FUSEKI_BASE=~/fuseki-base
./fuseki-serverIf Apache Jena Fuseki 5.5.0 has already been downloaded and extracted, you only need the following commands to start it:
tmux new -s fuseki
cd ~/tools/apache-jena-fuseki-5.5.0
export FUSEKI_BASE=~/fuseki-base
./fuseki-servermodel_databases/: the TMT model graphs in TTL format and related license filesTMT/:COPYRIGHT,LICENSE: copyright and licesen of the TMT model,tmt.ttl.zip: the zipped version of the TMT model in TTL format.
SAP-SAM/:aggregated_model_skipped_models.csv,aggregated_model_matched_models.csv,aggregated_model.ttl.
converters/bpmn-to-kg/: semantic mapping from BPMN process models to knowledge graph
Note: We use Apache Jena Fuseki 5.5.0
- Open the UI:
http://localhost:3030/ - Manage datasets → Add new dataset
- Create dataset
sap-sam-export_aggregatedand datasetsysml- choose TDB2 (persistent) if you want the data to survive server restarts
- For each dataset, open its page → Add data → upload the matching
.ttlfile:sap-sam-export_aggregated←aggregated_model.ttlsysml←tmt.ttl
- Press upload now
fuseki_endpoints/sysml_endpoint.txt: the SysML representation of the Thirty Meter Telescope (also referred to as TMT) projectbpmn_endpoint.txt: aggregated model of the SAP Signavio Academic Models- Format: the first non-empty (non-comment) line is the endpoint URL
(e.g.http://localhost:3030/sysml/sparql)
Fill these so the benchmark can query the right dataset:
fuseki_endpoints/bpmn_endpoint.txt→http://localhost:3030/sap-sam-export_aggregated/sparqlfuseki_endpoints/sysml_endpoint.txt→http://localhost:3030/sysml/sparql
-
system_prompts/:BPMN_SystemPrompt.ts: system prompt for the BPMN assistantSysML_SystemPrompt.ts: current system prompt for the SysML assistantSysML_SystemPromptBeforeExploration.ts: previous system prompt for the SysML assistant
-
questions/:sysml_questions.txt: test questions for the SysML assistant (format: TSV -q001<TAB>Question text)bpmn_questions.txt: test questions for the BPMN assistant (format: TSV -q001<TAB>Question text)question_augmentation/: we store question augmentation-related files heresap-sam_aggregated/andtmt/both have the same structure:augmentation_process.md: the process of augmentationaugmentation_prompt.md: the prompt used for augmentationmodel_content/: the model element attributes and types needed for the augmentationexpert_questions/: the expert questions (for BPMN, it is split into 2 categories: model and organization)augmented_questions/: the questions we got as the result of augmentation (for BPMN, it is split into 2 categories: model and organization)
mutant_questions/: contains the mutated version of the SysML and BPMN question setsquestion_categories/: for BPMN and SysML, a JSON contains the ID's of the questions grouped into categories
-
agent_workflow/:agent_interface.ts: the interface any agent we want to run and evaluate has to implementagent.ts: the logic of our agents implemented in LangGraphagent_graph.mmd: the visual representation of the graph of the agentrun_functions.ts: the functions needed for the running of the agentrun_questions.ts: runs all questions for all LLM models listed, writes one JSON per question
-
tools/: the implementation of the tools our agents useEndpointSparqlTool.ts: the agent can run SPARQL queries on a given endpoint using this toolFinalAnswerTools.ts: if the agent comes to the conclusion of having enough information to form a final answer, it calls this tool to end the run and log the final answerDecisionLogTool.ts: the agent logs its decisions to either form a refined query or to give a final answer
Before running, go through all steps of environment and Fuseki setup (detailed in point 1).
Run all SysML questions:
npx tsx agent_workflow/run_questions.ts questions/sysml_questions.txt fuseki_endpoints/sysml_endpoint.txt system_prompts/SysML_SystemPrompt.tsRun all mutated SysML questions:
npx tsx agent_workflow/run_questions.ts questions/mutant_questions/sysml_mutant_questions.txt fuseki_endpoints/sysml_endpoint.txt system_prompts/SysML_SystemPrompt.tsRun all BPMN questions:
npx tsx agent_workflow/run_questions.ts questions/bpmn_questions.txt fuseki_endpoints/bpmn_endpoint.txt system_prompts/BPMN_SystemPrompt.tsRun all mutated BPMN questions:
npx tsx agent_workflow/run_questions.ts questions/mutant_questions/bpmn_mutant_questions.txt fuseki_endpoints/bpmn_endpoint.txt system_prompts/BPMN_SystemPrompt.tsBy default, the agent runs on all questions with the following LLM models:
- anthropic/claude-opus-4.6,
- google/gemini-3.1-flash-lite-preview,
- qwen/qwen3.5-plus-02-15,
- openai/gpt-5.4-nano,
- minimax/minimax-m2.5
Other models can be tried by giving their ChatOpenAI conventional name.
LLM models can be compared here: https://openrouter.ai/compare/
Optionally, a ChatOpenAI conventional name of an LLM model can be given as parameter, so the agent will only run with that model:
npx tsx agent_workflow/run_questions.ts <questions.txt> <endpoint.txt> <systemPromptFile> [modelName]automated_test_results/<datasetKind>/runs/<runId>/: (datasetkind could be sysml or sap-sam-export_aggregated in our case)questions/: a JSON run log of all questionsrun.meta.json: meta-level information of the run (LLM modell, question set, tokens etc)
-
expected_answers/claude-4.6-opus_runs/: we used our assistants with Claude Opus 4.6 for the base of our expected answers, and we store its run results heresysml_answers.txt: expected answers for the SysML questions (also stored in aq001<TAB>Expected answerTSV format)bpmn_answers.txt: expected answers for the BPMN questions (also stored in aq001<TAB>Expected answerTSV format)keywords/: we store the keywords of the expected answers here for both assistantsmodel_elements/: the relevant model element URI-s for the questions of both models
-
evaluation_metrics/:metric_types.ts: all the implementec metrics have to use this interfaceindex.ts: the collection of all the metrics, the ones commented out won't be runbert_score.ts: BERT Score based metricchr_f.ts: chrF metriccosine_similarity.ts: cosine-similarity using sentence-level embeddingkeyword_counter_agent.ts: keyword count using a specific agentkeyword_count_exact_match.ts: keyword count using setsllm_judge_majority.ts: 3 pass/fail judge calls, majority votellm_judge_score_avg.ts: 3 judge scores from 1..5, then averagemodel_element_uri_metric.ts: metric measuring model element URI recall and precisionrouge_lf1.ts: ROUGE F1 metric
-
agent_evaluation/:replace_sid_ids_with_original_id.py: standardizes model IDs in the run results of the BPMN assistantdownload_embedding_model.ts- script downloading the embedding model (Xenova/bert-base-cased) used for embedding in BERT Score metrichf_env.ts: to keep the embedding model cached in.cache/transformers/keyword_provider.ts: help get keywords for keyword count based metricsmodel_element_provider.ts: help get model element URI-s for model element URI based metricsevaluation_config.json: configuration JSON file helping connect the question sets, endpoints, datasetkinds, expected answers and model element ID fileevaluation_resolver.ts: resolves the full evaluation context from an existing run.meta.json and the external evaluation_config.jsonevaluation_core.ts: the core logic of evaluating a run of the agent with the metricsevaluate_run.ts: evaluating a single run of the agent with the metricsevaluate_all.ts: evaluating all runs of the agent on a selected datasetkind (SysML or BPMN in our case) with the metricsaggregate_results.ts: aggregating the results into a CSV summary filecreate_result_plots.py: creating plots based on the content of the summary
Before evaluating BPMN runs, always run replace_sid_ids_with_original_id.py:
python3 agent_evaluation/replace_sid_ids_with_original_id.pyEvaluate a specific SysML run:
npx tsx agent_evaluation/evaluate_run.ts automated_test_results/sysml/runs/<runId> agent_evaluation/evaluation_config.jsonEvaluate a specific BPMN run:
npx tsx agent_evaluation/evaluate_run.ts automated_test_results/bpmn/runs/<runId> agent_evaluation/evaluation_config.jsonEvaluate all runs under SysML:
npx tsx agent_evaluation/evaluate_all.ts automated_test_results/sysml/runs agent_evaluation/evaluation_config.jsonEvaluate all runs under BPMN:
npx tsx agent_evaluation/evaluate_all.ts automated_test_results/sap-sam-export_aggregated/runs agent_evaluation/evaluation_config.jsonOptional parameters for agent_evaluation/evaluate_all.ts:
--model <modelName>: evaluates only the runs with the given LLM model as modelName (for example, "openai/gpt-5.4-nano")--questions-file <repo-relative-path>: evaluates only the runs with the given question file as repo-relative-path (for example, "questions/sysml_questions.txt")
Aggregate the latest evaluation of each run:
npx tsx agent_evaluation/aggregate_results.ts automated_test_resultsAggregate all evaluations of each run:
npx tsx agent_evaluation/aggregate_results.ts automated_test_results --allpython3 agent_evaluation/create_result_plots.py <summary_dir>Optional parameters: only the the results of selected filters will be plotted
-
Dataset kind:
- sysml or sap-sam-export_aggregated
--datasetkind <value1> <value2>
-
Category:
- for SysML:
- Junior,
- Medior,
- Senior
- for BPMN:
- single-model, general,
- single-model, model-specific,
- organization, general, generic,
- organization, general, domain-specific,
- organization, specific organization, generic
- organization, specific organization, domain-specific
--category <value1> <value2>
- for SysML:
-
LLM modell:
--llm-model <value1> <value2>
-
Diagram type:
- can be bar chart, scatter plot, scatter plot matrix, or heatmap
--diagramtype <bar|scatter|matrix|heatmap>
automated_test_results/<datasetKind>/runs/<runId>/: (datasetkind could be sysml or sap-sam-export_aggregated in our case)evaluations/:<evalID>/: evaluation happening at the given timestampmetrics: JSONL logs of the score of all metrics measured on the runmanifest.json: metadata of the evaluation
LATEST.json: evaluation ID of the latest evaluation of the run
_summary/<summaryID>/: aggregation of results happening at the given timestampmetrics_results.csv: a CSV file aggregating all the metric scores of the summaryplots/<plottingID>/: plots of the summary happening at the given timestampbar: bar chartsscatter: scatter plotsmatrix: scatterplot matrix of metrics
Copyright (c) 2026 The Authors
The benchmark is available under the Apache License 2.0.
Related artifacts:
- Thirty Meter Telescope SysML model
- the SysML model in RDF is available in
model_databases/TMTfolder - original SysML model is available under the Apache License 2.0
- RDF conversion tool: cameo2rdf repo
- available under the MIT License
- the SysML model in RDF is available in
- BPMN2KG: Business Process Model and Notation to Knowledge Graph:
converters/bpmn-to-kg/- semantic mapping from BPMN process models to knowledge graph
- available under the MIT License
- SAP Signavio Academic Models
- available under the SAP-SAM dataset license
- bpmn-sol-llm-benchmark repo
- our benchark uses the subset of the SAP-SAM dataset available in the bpmn-sol-llm-benchmark repo
- related paper: On LLM Assisted Generation of Smart Contracts from Business Processes