This week we published a new open-source project — Bud Symbolic AI, an open-source framework designed to bridge traditional pattern matching (like regex and Cucumber expressions) with semantic understanding driven by embeddings. It delivers a unified expression framework that intelligently handles single words, multi‑word phrases, dynamic parameters, and context‑aware validation by leveraging FAISS for efficient similarity search and a flexible registry system for parameter types—making it ideal for advanced guardrails, caching layers, and NLP applications
Why Use It?
Bud’s symbolicai brings together the precision of traditional patterns with the flexibility of modern embeddings, all under one roof. Instead of forcing you to choose between brittle regexes or “black‑box” vector searches, it lets you:
- Write templates with named slots (e.g.
{date}
,{device}
,{sku}
) and get back rich, typed objects—complete with start/end indices, data‑typed values, and similarity scores. Your downstream code can immediately consume match.value as a date, number, phrase, etc., without extra parsing or validation. - Apply hard constraints first (regex, enums, exact matches) and only fall back to FAISS‑powered semantic matching when you need fuzziness—meaning no more noisy hits or missed edge cases.
- Disambiguate by context, so “bank” near “loan” resolves to financial institution, while “bank” near “river” is ignored—without training a giant contextual LLM.
- Manage dynamic vocabularies in‑process, adding thousands of phrases or SKUs at runtime while still hitting sub‑millisecond lookups on CPU‑only machines.
- Mix and match parameter behaviors (
:regex
,:quoted
,:phrase
,:semantic
, or your own custom types) in a single expression, rather than stitching together multiple tools. - Trace and debug every match, with clear logs like “regex passed, semantic score 0.87≥0.8, context check OK,” so you know exactly why a value was accepted or rejected.
- Train with a handful of examples, going from ~75 % zero‑shot accuracy to ~90 %+ by supplying just 20–50 positive/negative samples per slot—no manual hyperparameter tuning required.
Under the hood, symbolicai batches embedding calls, caches prototypes and query vectors, and auto‑builds FAISS indices, delivering:
- Cold‑start latencies around 0.03 ms
- Warm‑cache lookups as low as 0.002 ms
- Optimal FAISS + cache runs near 0.001 ms
This makes it ideal for anything from LLM guardrails and conversational interfaces to high‑volume NLP pipelines—anywhere you need both structure and semantic recall, without sacrificing performance or interpretability.
Extendability & Hybrid Logic
Symbolicai
gives you a middle ground between traditional rules and full-blown large language models. You can mix :regex
, :quoted
, :semantic
, :phrase
, even custom parameter types in one expression. For example,
Schedule a meeting on {date:regex} about {topic:semantic}
A rule in symbolicai like: Remind me to {task} at {time}
It will match:
- “Remind me to email Bob at 10am”
- “Remind me to workout at 6am”
It also understands similar phrasing:
- “Can you remind me to call Mom at 8pm?”
- “Set a reminder to call Mom tonight”
Use Cases
It can be applied across a range of natural language tasks where both flexibility and structure are important. It is particularly useful in scenarios that require interpreting varied user input and converting it into actionable intent. Key use cases include:
- Conversational Interfaces: Enables chatbots and virtual assistants to understand diverse phrasing in user commands and respond appropriately.
- AI Guardrails: Detects sensitive, out-of-scope, or special-case prompts to ensure safe and controlled model behavior.
- Semantic Search & Retrieval: Supports phrase-level matching based on meaning, improving the relevance of results.
- Caching & Deduplication: Matches similar queries and reuses prior responses, enhancing efficiency in repeated interactions.
- Form Filling & Task Routing: Extracts structured values such as dates, names, or actions from unstructured text inputs for backend automation.
Core Features
1. Intelligent Phrase Boundaries
The engine can automatically detect the most appropriate boundaries between words and phrases. This allows it to isolate meaningful expressions within a sentence without relying on hardcoded rules, making it better at identifying what parts of a sentence are actually relevant. For phrase libraries exceeding 1,000 entries, Symbolic AI integrates with FAISS resulting in a 5–10x speedup in phrase similarity lookups. This allows the engine to scale to large vocabularies without degrading performance.
2. Semantic Phrase Categories
It can group phrases based on their semantic category. For example, it can recognize that a phrase like “iPhone 15 Pro Max” falls under the broader category of smartphones. This ability to map phrases to conceptual categories improves its performance in classification, filtering, and context tagging tasks. It uses FAISS-enhanced phrase embeddings to understand variations in user input.
3. Flexible Length Handling
Language is unpredictable, and user input often varies in length. The engine handles this variability gracefully by supporting phrase extraction across short and long inputs alike. Whether it’s a brief command or a multi-part instruction, Symbolic AI adapts without requiring strict formatting.
4. Adaptive Matching
Context matters — and Symbolic AI takes that into account through adaptive matching. It considers the surrounding content and overall sentence structure to validate whether a phrase truly matches the intended pattern. This reduces false positives and improves reliability in complex language scenarios. The engine supports Cucumber-style expressions and modular rule definitions that are easier to read and update. Developers can define patterns in natural, human-readable form rather than cryptic regex strings.
5. High Performance
Designed with production environments in mind, the engine delivers sub-millisecond matching latency and can process more than 50,000 operations per second. This level of performance enables it to support real-time applications where speed and responsiveness are essential.
6. Backward Compatibility
Symbolic AI integrates smoothly with existing systems built on Bud’s Expression syntax, ensuring that teams can adopt the new engine without needing to rewrite their existing rule sets. This makes it easier to experiment with advanced features while maintaining compatibility with current workflows.
7. Regex Compilation Cache
Symbolic AI maintains an internal cache for compiled regular expressions, achieving 99%+ hit rates. This eliminates the need for repeated recompilation of expressions and improves runtime efficiency, particularly in systems with large or frequently reused rule sets.
8. Prototype Embedding Pre-computation
To enable instant similarity checks, the engine pre-computes embeddings for prototype phrases. This reduces the need for on-the-fly vector generation, allowing the matcher to respond more quickly when comparing incoming user inputs to stored patterns.
9. Batch Embedding Computation
When dynamic embeddings are necessary (e.g. for user queries), the system supports batch processing, which reduces model invocation overhead by 60–80%. This is particularly useful in high-volume applications such as conversational agents and logging pipelines.
10. Multi-level Caching Architecture
The engine uses a multi-tiered caching strategy to speed up different stages of the matching process. This includes L1 cache for compiled expressions, L2 cache for embedding vectors, L3 Cache for Semantic prototypes. This layered design ensures that repeated requests are served with minimal recomputation.
11. Optimized Semantic Types
Embeddings associated with frequently used semantic types—such as time expressions, device names, or locations—are shared across matches, reducing both latency and memory usage.
12. Thread-Safe Architecture
All caching layers and core matching logic are designed to be thread-safe, using appropriate locking mechanisms. This makes the engine safe for concurrent use in multi-threaded applications or environments with parallel request processing.
Example: Basic Multi-Word Phrase Matching
from semantic_bud_expressions import UnifiedBudExpression, EnhancedUnifiedParameterTypeRegistry
# Initialize enhanced registry with FAISS support
registry = EnhancedUnifiedParameterTypeRegistry()
registry.initialize_model()
# Create phrase parameter with known car models
registry.create_phrase_parameter_type(
"car_model",
max_phrase_length=5,
known_phrases=[
"Tesla Model 3", "BMW X5", "Mercedes S Class",
"Rolls Royce Phantom", "Ferrari 488 Spider"
]
)
# Match multi-word phrases intelligently
expr = UnifiedBudExpression("I drive a {car_model:phrase}", registry)
match = expr.match("I drive a Rolls Royce Phantom")
print(match[0].value) # "Rolls Royce Phantom"
Example: Semantic Phrase Matching
# Create semantic phrase categories
registry.create_semantic_phrase_parameter_type(
"device",
semantic_categories=["smartphone", "laptop", "tablet"],
max_phrase_length=6
)
expr = UnifiedBudExpression("I bought a {device:phrase}", registry)
match = expr.match("I bought iPhone 15 Pro Max") # Matches as smartphone
print(match[0].value) # "iPhone 15 Pro Max"
Example: Context-Aware Matching
from semantic_bud_expressions import ContextAwareExpression
# Match expressions based on semantic context
expr = ContextAwareExpression(
expression="I {emotion} {vehicle}",
expected_context="cars and automotive",
context_threshold=0.5,
registry=registry
)
# Only matches in automotive context
text = "Cars are amazing technology. I love Tesla"
match = expr.match_with_context(text) # ✓ Matches
print(f"Emotion: {match.parameters['emotion']}, Vehicle: {match.parameters['vehicle']}")
Performance
Benchmarked on Apple M1 MacBook Pro:
Expression Type | Avg Latency | Max Throughput | FAISS Speedup |
Simple | 0.020 ms | 50,227 ops/sec | N/A |
Semantic | 0.018 ms | 55,735 ops/sec | 2x |
Multi-word Phrase | 0.025 ms | 40,000 ops/sec | 5-10x |
Context-Aware | 0.045 ms | 22,000 ops/sec | 3x |
Mixed Types | 0.027 ms | 36,557 ops/sec | 4x |
FAISS Performance Benefits:
- Small vocabulary (<100 phrases): 2x speedup
- Medium vocabulary (100-1K phrases): 5x speedup
- Large vocabulary (1K+ phrases): 10x speedup
- Memory efficiency: 60% reduction for large vocabularies
- Automatic optimization: Enables automatically based on size
With All Optimizations Enabled:
- Cold start: ~0.029 ms (first match)
- Warm cache: ~0.002 ms (cached match) – 12x speedup
- FAISS + cache: ~0.001 ms (optimal case) – 25x speedup
- Throughput: 25,000+ phrase matches/second
Real-world Performance:
- API Guardrails: 580,000+ RPS capability
- Semantic Caching: 168,000+ RPS capability
- Phrase Matching: 25,000+ RPS with 1000+ phrases
- Context Analysis: 22,000+ RPS capability
Real-World Training Results
Based on comprehensive testing across different domains:
Domain | Target Context | Untrained Accuracy | Trained Accuracy | Improvement |
Healthcare | “your health medical help” | 72% | 91% | +19% |
Financial | “your banking money” | 75% | 94% | +19% |
E-commerce | “shopping buy purchase” | 68% | 88% | +20% |
Legal | “legal contract law” | 70% | 89% | +19% |
Technical | “support help assistance” | 73% | 92% | +19% |
Average Performance:
- Untrained: 71.6% accuracy, 0.68 F1 score
- Trained: 90.8% accuracy, 0.89 F1 score
- Improvement: +19.2% accuracy, +0.21 F1 score
Higher Accuracy Through Domain-Specific Training Support
For domain-specific applications that require higher accuracy, the library also provides a training system. This system can optimize context-aware matching using your own examples, allowing it to adapt more precisely to your language patterns and use cases. While the engine performs well out of the box, training can further enhance its precision in specialized environments.
Key Training Features
- Zero-Training Readiness: Symbolic AI is usable out of the box with no setup, offering competitive accuracy for general-purpose matching.
- Training Enhancement: When example data is provided, accuracy typically improves to 85–95%, especially in specialized use cases.
- Automatic Optimization: The training system intelligently tunes thresholds, window sizes, and chunking strategies to optimize for your context.
- Context Length Handling: It balances comparisons between longer user inputs and shorter target expressions, maintaining match quality even in imbalanced cases.
- Perspective Normalization: It learns to normalize phrases that vary by speaker perspective — for instance, matching
"your help"
with"patient needs assistance"
in a healthcare setting. - False Positive/Negative Reduction: A multi-strategy optimization approach helps minimize misclassifications, improving the reliability of guardrails and decision logic.
Getting Started
You can explore the project on GitHub. Documentation, usage examples, and planned features are available in the repository.
🔗 https://github.com/BudEcosystem/symbolicai
For teams working on language interfaces, conversational systems, or semantic pipelines, Symbolic AI may offer a useful building block to explore.