AI agents are heralding a new revolution with the possibility of delegating the performance of actions to LLMs. Imagine being able to ask ChatGPT to give you the schedule of the train you want to take, but also, and above all, to make the reservation for you!
As soon as we delegate the ability to act to the LLM, we see a host of new, highly interesting use cases. In the case of software testing in particular, we can imagine the development of agents that can not only write tests, but also and above all execute them completely autonomously.
However, the difficulties are still there, and AI agents still suffer from many imperfections. One of the hardest points is the ability to act, which requires answering two questions: "What to do?" and "How to do it?
You already need to know what to do, i.e. have an abstract representation of the action to be performed.
For example, if your aim is to book a train ticket, you'll need to start by searching the timetable to find the right train. Once you know what to do (search for a train), you need to know how to do it, i.e. understand the application you want to interact with and exploit its graphical interface.
Still with our example, we need to know how to interact with the search bar and enter the information needed to find our train.
Today's LLMs are very good at providing answers to the "what to do" question. Indeed, if you ask ChatGPT the following question: "I'm on the sncf connect website .connect website, what should I do to book a train from Bordeaux to Nantes on March 15?
To answer the "how to" question, we need to be much more precise, and explain how we can automatically interact with the application.
In the field of web applications, agents take concrete action by exploiting technical frameworks such as Playwright, for example. If we ask ChatGPT the following question: "How do I use Playwright to carry out step 2 (find your route)?", it will return Playwright code, but the given parameters won't work!
The CSS selector [data-testid="origin-input"] doesn't exist in the web application.
The difficulty of "How to do it" is becoming known as grounding. It involves starting from an abstract description, the answer to the "What to do?" question, and providing the technical elements for concrete interaction with the application's graphical interface. The major difficulty lies in identifying the components to be interacted with.
In recent years, several solutions have been considered in the field of web applications. The first was to provide the DOM of the web application with a description of the element to be interacted with.
In our example, we provide the DOM for the sncf-connect page and ask LLM to provide the CSS selector for the search bar. This approach yields disappointing results, mainly due to the complexity of the DOM.
A second approach was to identify all interactable elements on the page, build their CSS selectors, and ask LLM which of these selectors corresponded to the element you wanted to interact with. This approach didn't produce the best results.
A second family of approaches has recently been developed to exploit screenshots. The SOM (Set Of Mark) approach consists of generating a screenshot of the website, but graphically enclosing all interactable elements and applying numbered graphic labels to them.
This way, you can ask the LLM for the number of the graphic label of the area surrounding the element you want to interact with. This approach produces very interesting results, but some elements are difficult to identify (menus, small buttons, etc.).
Finally, some LLMs are trained to return the X and Y coordinates of elements. They can then be asked to locate elements. Here again, the results are encouraging.
In any case, the problem of grounding has not yet been solved, and there are many works proposing innovative solutions. In the field of testing, this echoes the problem of site testability. Indeed, we can expect the first results to be obtained on applications that are easy to test.
If this is the case, we could soon be seeing the arrival of agents to help with test execution, enabling better application testing.
Grounding involves finding the right Playwright command with the right parameters to perform the action.
In our example, entering information in the search bar involves first locating the search bar (locator command, which requires knowledge of the search bar CSS selector) and then filling it with the correct value (fill command, which requires knowledge of the value to be entered).
Mr Suricate s no-code SaaS solution covers a wide range of automated tests, enabling you to take control of your acceptance processes and offer your users the best possible experience.
Take control of your applications and detect bugs in real time on your websites, applications and APIs, by automatically reproducing your user paths at regular intervals.