Don't Add AI Chatbots to Your Product. Integrate AI Into Features and Workflows.

Yev Spektor

January 16, 2024

Everyone is rushing to jam AI into their projects in one of a few ways.

Add a chatbot. Turn your app into a glorified ChatGPT wrapper. Give the AI access to docs and settings so it can help users navigate your app and solve problems on their own, then tell customers they can ask the AI anything—“How many issues did my team close last week?,” say. It’s the quickest way to check AI off the feature list. Generally, we’re not a fan of this approach.

Our goal is to build a load test platform that makes it as easy as possible for developers and engineering teams to test their products—and if GPT could help do that better, building it in would be a clear fit. Yet every time we tried to build test suites with an LLM, something, inevitably, would be wrong. The AI output was generally not even close to what we needed, due in part to the chat being too open-ended. It is faster for me to write the test suite myself than to write the prompt to get AI to do it.

We wanted to use AI to play a small, limited role that helped us build better software, faster. So we gave AI individual, specific, narrow jobs. Multiple doesn't have an open-ended chatbot. Instead, we’ve tactfully applied AI for specific use cases that play to AI’s strengths and avoid its shortcomings, tasks that would have been difficult to accomplish with traditional, rule-based programming.

The AI Trust Dilemma

LLMs have taken computing from always right to mostly right.

We trust code because it is binary. It works, or it doesn't. LLMs are anything but binary; they’re full of errors, hallucinations, questionable judgments, and flawed recommendations.

Chatbots are a lazy way to build interfaces, putting the hard decisions on the users instead of on the design and development teams. Users don’t know what to ask from an open-ended chat or what to expect from a new app’s AI feature. So they poke around to see what’s possible. But it’s difficult to get the same response every time with the same prompt, harder to get chatbots to solve your specific problem.

Every time the LLM is incorrect—which it will be, inevitably, as it’s writing answers by predicting the next word based on codecs of human knowledge—users lose trust in the AI. In the software where the AI lives. In computing, in general. The excitement of a new way to interact with an app is quickly replaced by a distrust of everything in it.

Building AI Guardrails

Imagine if Siri was sold as only a way to set timers by voice. It’d be hard for it to get things wrong, harder to complain over its capabilities.

Instead of giving AI a blank canvas, we are giving it a coloring book. We’re giving the user a coloring book, too. AI doesn’t know the limitations of the task, and where to stop; users don’t know the limitations of the input or what the output could be. With guardrails and guided, carefully defined inputs and outputs, both the user and the AI are set up for success.

Users will be more forgiving of mistakes when they’re involved in the output. They’re emotionally committed to the creation process, as they are when coding with GitHub Copilot. Even when something breaks, the value of the AI is clear, as it saved them 95% of the work. They only need to review the output and add final tweaks.

But what if you took that idea to its logical conclusion, and gave the LLM a small enough task that sets it up for success? We've narrowed the scope so much that the likelihood the result is wrong is lower. And if it is wrong, our guardrails/validation ensure the output format is correct, and we are honest with our users about where things didn't work. That lets us treat it like any other dependency, as a single object for a single, narrow task.

The Fuzziest Logic

We tried, at first, to get LLMs to write entire load tests—the core reason one might want to use an AI in a tool like Multiple. We asked OpenAI’s GPT-4 to extract data from examples, and it made what, at first, seemed like a usable script. Dig deeper, and it’s unusable. Inconsistent. Phantom URLs. Irrelevant code. Close, and yet so far from deployable.

It’d waste our time in debugging and correcting, in doing the robot’s homework for it.

Consistently creating synthetic data with an LLM was also an exercise in frustration. It could create synthetic data, just not consistently.

The question was where an LLM is better than traditional rule-based programming. After working with OpenAI and other LLM APIs, it hit me: They’re not AI, they’re the fuzziest logic APIs.

Rule-based programming is better at math. Better at facts. Better at branching, at logic, at following precise instructions.

Sequencing endpoints with Multiple's TestGen — *Multiple automatically sequences endpoints from HAR files, and GPT-4 extracts the base API URL.*

Fuzziest logic APIs are better at, well, thinking. Better at ambiguity. Better at classification, categorization, inference. Just like how fuzzy logic makes searching for abstract text possible across complex files, AI’s fuzziest logic brings together connections and what feel like insights extracted from the corpus of human knowledge.

Rule-based programming is better at creating a fake birth date in YYYY/MM/DD format. Fuzziest logic APIs are better at guessing that 1980/01/13 is a birthday, at handling the edge cases, at guessing correctly even when a date is formatted slightly differently.

The first way we use AI in Multiple is in our load test generation. We wanted to automatically generate load tests from HAR and Swagger files. Parts of that task are easy to accomplish with traditional programming, but two parts are incredibly difficult: Generating synthetic data for API requests, and identifying the most relevant base URL from HAR files.

The latter was the most straightforward. It’s difficult to use regex or rule-based programming to pick the base URL out of all the analytics and tracking noise in a HAR. GPT-4 proved surprisingly adept at confidently picking the base URL with near-perfect accuracy, with us guaranteeing that accuracy by writing tons of code to handle every edge case imaginable. And with a single output, it’s easy to get users to verify that the URL is correct—far less cognitive load than asking them to verify an entire load test.

The former problem—generating synthetic data to test APIs—needed to be further refined. Rule-based programming, in a JavaScript library like Faker, is great for generating all sorts of data: Names, emails, addresses, etc. It’s a solved problem, as long as you know what type of data you need.

The challenge is in classifying the types of data. Swagger or HAR files tell you if fields are a string, number, boolean, and so on. They don’t tell you if that string is a name or an address, if a number is actually a date and how that date should be formatted. Those need to be inferred from context or approximated from other similar instances. This is where AI shines. It would be impractical to cover every situation with code. Instead, we make one small request to GPT-4, and it does a nearly perfect job at defining fields.

We scoped the synthetic data problem down as much as possible. Multiple sends each request to OpenAI, to balance efficiency and contextual responses. With the full request, the LLM can contextually decide if a field is a company or person name, say, something that would be impossible to infer without the context. We also tell the AI how to use Faker, and which Faker functions are available. Then the AI matches the available Faker functions to the API payload, and it can usually identify the fields where synthetic data is not needed.

Multiple's TestGen automatically generates synthetic data payloads — *Multiple asks users to check the AI and code-generated request bodies—relying on users to add query strings for IDs and other data points it’s impossible to infer.*

Then we involve the user. HAR files have sequences embedded, but JSON files only list endpoints, so users need to sequence them. AI doesn’t know what needs to be tested; only the user does. And when a field needs something specific like a user ID, we don’t try to get the AI to guess—we get the user to fill it in.

We pair the AI calls and user feedback loop with traditional, rule-based programming to validate every step of the way and keep the AI on the right track, then assemble the final test script with traditional backend code to verify the format and syntax.

The AI is not the feature. The feature, instead, is a business process, and the AI helps with some of the steps. The magic is in the workflow and the prompting.

We gave the AI very specific, scoped tasks, tasks that couldn’t be solved as easily with logic and rules. We built the tasks specifically where the LLM’s nondeterminism could be constrained and its inference would be an asset. That approach resulted in a better product than would have been possible without AI.

As a side benefit, our LLM-powered features are more future-proofed than a chatbot could ever be. Focused, scoped tasks are far more likely to work on an alternative large language model than larger, ambiguous tasks would be. We’re using OpenAI’s gpt-4 model today, but could switch to Gemini, Anthropic, or an open source model, if needed, with little change.

It’s cheaper to run too—no wasted credits in long-ranging chat conversations. And more consistent—hallucinations are far less worrisome with a carefully defined prompt and rules-based validations. Best of all, it’s trustworthy—no more worrying that your AI chat will say something absurd, promising users something you cannot fulfill.

Start With Intent

If you think AI/LLMs can solve a specific problem in your app, you don't need a chatbot. Build the function in the backend and make it a Point A-to-Point B workflow so users (and the LLM) don't go off track. And if you don't have a specific problem you want to solve, adding AI/LLMs will only create confusion and dead ends.

The problem, perhaps, is in expectations. Give people a chatbot, an open-ended conversational interface, and they expect it can answer any question or complete any task. Give them a calculator, with digits and operators and little else, and they’ll expect to calculate values and nothing else. Form factors set expectations.

Calling an LLM a fuzziest logic API, itself, changed how I think about AI. LLMs tend to overpromise and underdeliver, but they are better than traditional coding at problems with small amounts of ambiguity. Classification. Categorization. Filtering through what’s signal, what’s noise. Noticing deviations from the norm.

The thorniest problems for rule-based programming are worth retrying with that fuzziest logic API. Don’t give it open-ended scenarios, don’t feed it free text input. Put rules-based guardrails on both input and output; no garbage in, and catch any garbage output that shows up.

Give AI a single task. Never trust; validate everything. Involve the user; ask them questions, then get them to verify output. Build the rest of your product as normal, with the fuzziest logic API as just another tool in your toolbelt.

That, perhaps, is how LLMs start giving productivity gains to software. Not with open-ended chat, arguing with a robot over the output, but one silent, thoughtful, smarter feature at a time.

It’s a more thoughtful, deliberate way to design a product, with a fuzziest logic API.

Subscribe for Updates

Thank you! Your email has been received.

Oops! Something went wrong while submitting the form.

View All Blog Posts