Automated Software Test Generation and Augmentation for Improved Debloating

Navy STTR Topic: DON26TZ01-NV008
Office of Naval Research (ONR)
Pre-release 4/13/26   Opens to accept proposals 5/6/26   Closes 6/3/26 12:00pm ET    [ View Q&A ]

DON26TZ01-NV008 TITLE: Automated Software Test Generation and Augmentation for Improved Debloating

OUSW (R&E) CRITICAL TECHNOLOGY AREA(S): Quantum and Battlefield Information Dominance (Q-BID)

COMPONENT TECHNOLOGY PRIORITY AREA(S): Advanced Computing and Software;Integrated Network Systems-of-Systems;Integrated Sensing and Cyber

PROJECTED CMMC LEVEL REQUIREMENT: Level 2 (Self)

OBJECTIVE: Develop an automated solution for developing, enhancing, expanding, and augmenting software tests to more safely broaden the employment of proactive cyber techniques such as debloating and post-construction software refactoring. Technology is needed to refine a suite of tests to a level such that it may serve as a practical expression of a software transformation objective to drive other tools as well as validate their output. Technology should leverage multi-modal methods such as ingesting code and documentation as well as be compatible with DevOps processes.

DESCRIPTION: Modern software development practices such as industrialized code reuse and artificial intelligence (AI) assistance enable developers to produce increasingly complex and capable software more quickly and cheaply than ever before. The tools to ensure that all this software is well-tested and that all of the included code is well-tailored to the deployment scenario, however, have lagged by comparison.

Modern applications often include hundreds to thousands of libraries and other dependencies, with often only a small portion of the code in each being ever needed by users in each deployment scenario. The excess code that remains often tends to be less used in general, less well-scrutinized, and full of obscure features that will often be found (sometimes only years later) to contain vulnerabilities. To address this problem, numerous tools have been developed to identify bloat and then modify the software by removing unneeded code [Ref 1]. Configurations, usage logs, and tests that are fed as inputs to code transformation tools to tell them what to cut are referred to as the debloat specifications [Refs 1, 2].

Because the economics of code reuse will continue to drive library and package developers to maximize generality, debloating must happen through a separate process that begins after those components are built into a specific application. The fact that another process will be modifying code separate from the original one that designed, implemented, and tested those components adds risk—it is not uncommon to see flawed or incomplete transformations. Evaluation results in [Ref 2] showed that 37% of the debloated binaries they created failed to correctly execute the functionality they were intending to retain.

Many factors can contribute to a transformation yielding a broken application, but one of the biggest is a low-quality debloat specification. Developer-authored tests are often limited and the users of debloating tools rarely can specify in exact detail all the features they actually need for a given deployment scenario. These incomplete specifications can lead tools to be overly aggressive in things like security checks and exception handlers that are critical to application safety and robustness [Ref 3].

To better address the problem of low-quality and incomplete debloat specifications, new technology is needed to more fully incorporate and automate the capturing of desired software behaviors for input to a debloat tool. The technology should be able to take advantage of code analysis as well as analysis of related artifacts such as documentation, build configs, existing tests, and even user input, as long as it can be made practical and easy for a user to answer. Various works have explored methods and techniques for capturing exception handers [Ref 3], balancing reduction with a targeted amount of generality [Ref 4], and leveraging AI to incorporate new tests [Refs 5, 6, 7]. All may inform strategies for automated test generation and augmentation that can lead to higher quality debloat specifications.

PHASE I: Define and develop a concept for automated multi-modal processing of code and other DevOps repository artifacts such as user guides, etc. to generate and augment a suite of tests that can serve as the inputs to proactive cyber security tools, namely debloating. Work toward a design that can develop tests based on unstructured documents and interact with a user to refine the tests. The Phase I Option, if exercised, would develop the initial test augmentation capability to create the full prototype in Phase II.

PHASE II: Develop a prototype containerized test augmentation capability to validate the concept defined in Phase I. Demonstrate the automated multi-modal processing of code, DevOps repository artifacts, and, if necessary, user interview inputs, into developing, enhancing, expanding, and augmenting software tests by the prototype. Ensure that the prototype is deployable in a software factory environment and able to develop many tests to sufficiently, reliably, and robustly enable the debloat of (1) an application using only its existing limited test suite, (2) unstructured program documents like user guides, and (3) real-time user input at the non-expert level by the end of Phase II.

PHASE III DUAL USE APPLICATIONS: Integrate the Phase II developed test augmentation capability with Program of Record systems and their applications. Field containerized solutions that integrate with existing build pipelines.

Potential commercial applications include automated software testing and fuzzing harness generation, a growing need due to the proliferation of AI-generated code.

REFERENCES:

  1. Alhanahnah, M.'; Boshmaf Y. and Gehani A. "SoK: Software Debloating Landscape and Future Directions." Workshop on Forming an Ecosystem Around Software Transformation (FEAST), 2024. https://www.arxiv.org/abs/2407.11259
  2. Brown, M. et al. "A Broad Comparative Evaluation of Software Debloating Tools." 33rd USENIX Security Symposium, 2024. https://www.usenix.org/system/files/usenixsecurity24-brown.pdf
  3. Alhanahnah, M. and Jhumka, A. "Software Debloating from Exception-Handler Lenses." Workshop on Forming an Ecosystem Around Software Transformation (FEAST), 2024.pp. 19-24. https://dl.acm.org/doi/10.1145/3689937.3695793
  4. Xin, Q.; Qirun Z. and Orso, A. "Studying and understanding the tradeoffs between generality and reduction in software debloating." 37th IEEE/ACM International Conference on Automated Software Engineering, 2022. https://dl.acm.org/doi/abs/10.1145/3551349.3556970
  5. Lin, B. et al. "Large Language Models-Aided Program Debloating." https://www.arxiv.org/pdf/2503.08969arXiv:2503.08969, 12 March 2025.
  6. Khandaker, S. et al. "AugmenTest: Enhancing Tests with LLM-Driven Oracles." ICST, Naples Italy, 31 March-4 April 2025. arXiv preprint arXiv:2501.17461, 2025 https://conf.researchr.org/details/icst-2025/icst-2025-papers/25/AugmenTest-Enhancing-Tests-with-LLM-driven-Oracles
  7. Dinella, E. et al. "Toga: A neural method for test oracle generation." Proceedings of the 44th International Conference on Software Engineering, 05 July 2022, pp 2130-2141. https://doi.org/10.1145/3510003.3510141

KEYWORDS: Cyber; Software Testing; Automation; artificial intelligence; AI; machine learning; ML; large language model; LLM; Debloating; Feature Specification


Topic Q & A

5/09/26  Q. Does the Navy have preferred application types for debloating validation, such as cybersecurity tools, web services, command line tools, embedded software, mission applications, or software that interfaces with physical systems? This would help proposers design the right validation framework, since physical system software may require different preservation checks than cybersecurity or web applications.
   A. Any type of application that exhibits modern software components is acceptable. Developing approaches for embedded and cyber-physical is not required as we recognize that adds additional complexities beyond enterprise applications.
04/29/2026  Q. The topic references generating tests from unstructured documentation to improve preservation specifications for debloating workflows. Should proposals assume that improving exception-path preservation and configuration-dependent execution behavior are priority areas for augmentation, given their known role in debloating regression failures?
   A. Both examples are desirable areas for augmentation.
04/29/2026  Q. Does the Navy anticipate that the Phase II prototype should integrate with containerized CI/CD environments (e.g., Platform One-style pipelines), or is the expected transition environment primarily research-grade tooling used to evaluate debloating effectiveness?
   A. In Phase II, the prototype should integrate with CI/CD environments such as Navy software factories.
04/29/2026  Q. Are the generated tests intended primarily to increase behavioral coverage for coverage-driven debloating tools, or should they also function as semantic preservation specifications capable of validating correctness beyond execution-path coverage?
   A. The primary intent behind the generated tests is to more effectively drive the debloaters.
04/29/2026  Q. Should Phase I focus primarily on demonstrating the feasibility of automated feature-intent extraction and documentation-driven test generation, or is the expectation to also implement an operational prototype capable of producing executable test suites from real repositories?
   A. The Phase I prototype is a proof of concept to demonstrate feasibility of approach.
04/29/2026  Q. The topic description suggests AI-based extraction of feature intent from unstructured artifacts. Is the Navy primarily interested in LLM-driven semantic interpretation of documentation for test generation, or are hybrid approaches combining static analysis, symbolic reasoning, and LLM-assisted inference equally acceptable?
   A. All approaches are acceptable that can address the challenges described in the topic description. We do not require any specific technique.
04/29/2026  Q. The Objective and Description indicate that generated test suites will serve as inputs to proactive cybersecurity tools such as debloating systems. Has the Navy identified specific debloating tools that proposers should assume as integration targets, or should proposals remain tool-agnostic while demonstrating compatibility with representative research or open-source debloating frameworks?
   A. Proposals should focus on test augmentation. The tests could be consumed by various debloating tools.
04/29/2026  Q. Should we assume there is access to the source and repo docs such as Readme.md, API docs, deploy config data (helm), and current test cases?
   A. Yes.
04/29/2026  Q. What languages will be the focus: C/C++, Java, Python, golang?
   A. Any source language is acceptable for Phase I proof of concept.
04/27/2026  Q. For DON26BZ01-NV030, could you clarify whether the expected solution is intended to operate primarily before build artifact generation (e.g., optimizing material selection, geometry processing, and AM parameter sets prior to creation of the machine build package), or whether the Navy is also interested in post-generation analysis of build artifacts to identify unused or unnecessary structures, parameters, or features that could be removed to improve performance or reduce defects?

In particular, if post-generation approaches are within scope, should proposers assume access to sufficient internal representation of the build package to support feature-level analysis, or is the expectation that optimization occurs earlier in the digital-twin workflow where full lifecycle parameter visibility is available?
   A. The scope can include analysis of build artifacts and the configuration used to generate them.

** TOPIC NOTICE **

The Navy Topic above is an "unofficial" copy from the Navy Topics in the DoW FY-26 Release 1 SBIR BAA. Please see the official DoW Topic website at www.dodsbirsttr.mil/submissions/solicitation-documents/active-solicitations for any updates.

The DoW issued its Navy FY-26 Release 1 SBIR Topics pre-release on April 13, 2026 which opens to receive proposals on May 6, 2026, and closes June 3, 2026 (12:00pm ET).

Direct Contact with Topic Authors: During the pre-release period (April 13, through May 5, 2026) proposing firms have an opportunity to directly contact the Technical Point of Contact (TPOC) to ask technical questions about the specific BAA topic. The TPOC contact information is listed in each topic description. Once DoW begins accepting proposals on May 6, 2026 no further direct contact between proposers and topic authors is allowed unless the Topic Author is responding to a question submitted during the Pre-release period.

DoD On-line Q&A System: After the pre-release period, until May 20, 2026, at 12:00 PM ET, proposers may submit written questions through the DoW On-line Topic Q&A at https://www.dodsbirsttr.mil/submissions/login/ by logging in and following instructions. In the Topic Q&A system, the questioner and respondent remain anonymous but all questions and answers are posted for general viewing.

DoW Topics Search Tool: Visit the DoW Topic Search Tool at www.dodsbirsttr.mil/topics-app/ to find topics by keyword across all DoW Components participating in this BAA.

Help: If you have general questions about the DoD SBIR program, please contact the DoD SBIR Help Desk via email at DoDSBIRSupport@reisystems.com


[ Top  -  Return ]