Product · 6 MIN READ · DR. XENIA MOUNTROUIDOU AND JANE HUNG · SEP 4, 2025 · TAGS: AI & automation
In part one, we outlined our goal: to enhance transparency by using a Large Language Model (LLM) to generate clear explanations for benign security alerts. We established that this approach was feasible and that we had the right data to proceed.
Now, we get into the technical execution. The first and most critical step in any LLM project is structuring the data to guide the model toward a correct and precise outcome. This process, known as feature engineering, forms the bedrock of our AI Resolutions.
TL;DR
- “Context is king” for LLMs. We developed structured data from Expel Workbench™’s rich evidence collection to guide our AI Resolutions (AIR) toward accurate and precise outcomes.
- Evaluation is one of the best ways to ensure precision in critical security applications. Since there’s no single “correct” answer for a benign alert explanation, Expel uses a combination of LLM judges (who grade responses based on a rubric), data-derived metrics, and human analyst feedback to evaluate the AIR’s output.
- The system was built on a hybrid prompt design. After extensive experimentation, Expel found that a combination of chain of thought (CoT) and self-reflection (SR) techniques yielded the best results in generating high-quality, accurate comments.
Feature engineering
We used Expel Workbench™ data as context to guide the LLM in generating accurate close comments for benign alerts. This involved structuring data with appropriate semantics, similar to feature engineering in traditional ML, but focused on helping the model infer precise text. We used a structured dictionary with features grouped into specific categories, shown below.
{
The LLM’s primary objective was to detail the detection and justify the alert based on provided evidence. A secondary goal was to leverage IOCs to explain why—despite a correct detection trigger—the evidence indicated no malicious activity. This was achievable through our extensive automated workflows that generate Indicators of Compromise (IOCs), Indicators of Alerts (IOAs), and investigation findings. Finally, the organization’s type provides context for why certain activities might be normal.
Our features were developed using two key methods:
- Exploratory analysis: We utilized statistical analysis and visualization to understand the characteristics of data commonly found in effective close comments.
- Experimentation: We conducted experiments using metrics that identify errors and semantic inaccuracies.
This brings us to the most important part of the development process: experimentation.
Generating with accuracy
Developing LLM features for critical security applications necessitates extensive experimentation and the application of robust metrics to guarantee accuracy. A key challenge in this experimentation is defining quality metrics for evaluation, particularly as LLM applications can’t often be characterized strictly via mathematical and objective measures. This requires a creative approach to metric development.
Metrics
Evaluating LLM applications through metrics is an ongoing area of research. In our specific scenario, defining a “good” close comment for benign alerts lacked an objective truth. We explored several metrics, which are presented in the table below:
| Name | Definition | Type | Values | LLM judge?1 | Metric category |
|---|---|---|---|---|---|
| A customer satisfaction survey rubric assigns a numerical grade. | Integer | 1-5 | ✓ | Quality | |
| The grade indicates the proportion of input data used. | Integer | 1-3 | ✓ | Coverage | |
| BERT F1 similarity |
This function compares word semantics to determine the similarity between a close comment and the input data. |
Float |
0-100 |
X |
Coverage |
| Correctness |
Numerical grade determined by the accuracy of its assumptions based on the provided data. |
Integer |
1-3 |
✓ |
Precision |
| Errors |
Count application-specific types of errors, for example, incorrect timestamps. |
Integer |
>0 |
X |
Application |
| Situational awareness |
Did the model demonstrate awareness of the environment by using correlated data? |
Boolean |
True/false |
✓ |
Coverage |
| Adoption |
To what extent did the analyst incorporate the close comment? |
Float |
0-100 |
X |
User acceptance |
1 A checkmark indicates that an LLM judge was used for this metric.
Evaluating experiments requires creative approaches, as shown in the table above. We combined LLM judge metrics—where language models grade responses based on specific instruction rubrics—with derivative metrics derived from the data and objective mathematical functions. For instance, an LLM can assess completeness using a clear rubric defining “complete,” “partially complete,” and “incomplete.” This can be supplemented by deterministic metrics (such as BERT similarity) to determine how much of the original data was utilized in the close comment.
Our metrics are categorized as follows:
- Quality metrics: These measure the precision and coverage of the close comment in relation to SOC and customer requirements.
- Coverage metrics: These show the extent to which data was utilized in composing the comment.
- Application metrics: These assess the application’s performance.
- User acceptance metrics: These indicate the adoption rate of the application.
Customer satisfaction grades were generated using a well-defined rubric and an LLM judge. While this approach could lead to varied results with an LLM judge, given its statistical nature, we mitigated this by incorporating a correctness metric and error metrics. Although it also used an LLM judge, this metric had a smaller scale and reduced the potential for error.
Furthermore, error counting serves as a crucial derivative metric, significantly enhancing application quality. By deterministically counting logical errors, such as incorrect hostnames or timestamps, we can achieve the desired accuracy.
Finally, a post-adoption metric was developed using a semantic BERT similarity. This metric quantifies the extent to which analysts adopted the generated close comments, which is holding substantially as it reinforces positive model behaviors and facilitates future enhancements.
The example below illustrates two key metrics for a particular alert and its corresponding model outputs. In the first instance, the comment demonstrates tunnel vision, resulting in inaccurate situational awareness. In the second case, the model fabricates information, leading to a correctness grade of 1.
Beyond automated metrics, we implemented annotation queues, enabling human analysts to grade model outputs and express preferences. This critical technique engages human experts in LLM evaluation, connecting them with the applications and fostering ownership and oversight of the final product. Analysts used the same rubric and instructions as the LLM judges, assessing correctness, completeness, and situational awareness, and could add comments and corrections as needed.
Experiments
We performed several experiments using defined metrics, focusing on two objectives:
- Prompt design: Our initial prompt designs aimed for accuracy and correctness in the closing comments. We evaluated all options to arrive at the most suitable hybrid prompt.
- Model selection: A key comparison was made between reasoning and non-reasoning models, yielding clear results towards selecting the reasoning models. However, comparisons between different types of reasoning models were less conclusive.
Langsmith, a platform for debugging, testing, and monitoring AI applications, proved invaluable for our experimentation, particularly its pairwise experiment feature, which allowed for comprehensive comparisons across all prompt combinations. Its robust software development kit (SDK) enabled us to automate the entire experimental setup, eliminating the need for manual configuration.
Prompt design
We evaluated the following prompt design techniques:
| Type of prompt | Definition | Abbreviated example |
|---|---|---|
|
Chain of thought (CoT) |
Guides the model through reasoning steps, asks the model to show its reasoning |
Follow these instructions if they’re applicable to the data:
|
|
Self-reflection (SR) |
The model is asked to reflect and rewrite its response |
…Then, fact-check your own analysis by:
|
|
Meta prompt (Meta) |
Prompt re-written by another LLM | (Similar to the above prompts, just rewritten by an LLM)) |
|
Constrained reasoning (CR) |
Constrains model generation to the given data through specific instructions |
Based only on the alert data, provide reasons why the alert was closed as benign. For each reason:
|
After numerous experiments, we observed a consistent level of good accuracy and completeness when using both CoT and SR. Consequently, we opted for a hybrid approach combining the two. To ensure an accurate comparison, the same model was used for all prompts in this evaluation. The table shows the averaged final results from hundreds of experiments conducted. The bolded measurements demonstrate the best-performing prompt engineering techniques.
| Metric | CoT | SR | Meta | CR |
|---|---|---|---|---|
| Grade | 3.92 | 3.85 | 3.72 | 3.83 |
| Completeness | 3 | 2.8 | 3 | 2.9 |
| BERT F1 | 78 | 77 | 76.53 | 77 |
AI responsibly
To ensure the robust security and integrity of your system, it’s crucial to diligently address several critical points, even if they seem like recurring themes. In our case, we’re protecting customer sensitive data and system integrity. To this end, our goal is to prevent the following vulnerabilities:
- Data leakage: This means implementing stringent controls to prevent sensitive information from being unintentionally or maliciously exposed outside of its intended boundaries. This problem is solved with standard security practices that we adopt: properly configured systems and secure API access.
- Data cross-pollination: This refers to the strict segregation of data, particularly when dealing with multiple customers, projects, or sensitive categories of information within a single system. Data from one context should never inadvertently mix with or be accessible by another. We prevent this with strict access control lists (ACLs) and data isolation techniques.
- Prompt hacking: This is a critical security vulnerability where malicious actors attempt to manipulate an AI model’s behavior by injecting unauthorized commands or data into its prompts or inputs. This could lead to a variety of undesirable outcomes, such as data extraction, denial of service, or the generation of harmful content. This application doesn’t offer an open chat mechanism, so prompt hacking isn’t feasible.
We created a reliable AI assistant—AI Resolutions—by transforming security data through meticulous feature engineering, robust evaluation, and hybrid prompt design. This demonstrates that building trustworthy LLM applications requires a deep, data-driven approach, prioritizing precise context, rigorous experimentation, and human oversight. Our system justifies benign alerts clearly and improves through analyst feedback, while maintaining data security.

