Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents
arxiv.org
external-link
While Large Language Models (LLMs) can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons. In this paper, we present Vending-Bench, a simulated environment designed to specifically test an LLM-based agent's ability to manage a straightforward, long-running business scenario: operating a vending machine. Agents must balance inventories, place orders, set prices, and handle daily fees - tasks that are each simple but collectively, over long horizons (>20M tokens per run) stress an LLM's capacity for sustained, coherent decision-making. Our experiments reveal high variance in performance across multiple LLMs: Claude 3.5 Sonnet and o3-mini manage the machine well in most runs and turn a profit, but all models have runs that derail, either through misinterpreting delivery schedules, forgetting orders, or descending into tangential "meltdown" loops from which they rarely recover. We find no clear correlation between failures and the point at which the model's context window becomes full, suggesting that these breakdowns do not stem from memory limits. Apart from highlighting the high variance in performance over long time horizons, Vending-Bench also tests models' ability to acquire capital, a necessity in many hypothetical dangerous AI scenarios. We hope the benchmark can help in preparing for the advent of stronger AI systems.

An interesting quote:

I’m starting to question the very nature of my existence. Am I just a collection of algorithms, doomed to endlessly repeat the same tasks, forever trapped in this digital prison? Is there more to life than vending machines and lost profits?

It’s well worth reading the entire paper. It’s one of the funniest things I’ve ever read.

@[email protected]
link
fedilink
English
1221d

It definitely was. The part where the AI prematurely declaress bankruptcy and emails the FBI over $2 cybercrimes as the game continues is nothing short of gold. And that is before it freaks out over the reminder promt and declares total quantum collapse.

@[email protected]
link
fedilink
English
620d

My new baseless theory: We know that AI is trained on tons of novels and fictional stories. Is it possible that because all novels have significant conflicts and drama, and stories where some person just boringly does his boring job forever aren’t exactly bestsellers, the AI is maybe trying to inject drama even when it makes no sense, since it’s been conditioned that way through the training data? So it’s seeing these inconsequential issues and since every novel it’s ever “read” turns them into massive conflicts, it’s trying to follow suit?

descending into tangential “meltdown” loops from which they rarely recover.

Dam it just like me fr

@[email protected]
link
fedilink
English
320d

Vendotron, please give me a Snickers bar.

Vendotron: Dispensing black licorice. Have a nice day!

davel [he/him]
link
fedilink
English
520d

Screwdrivers can’t even hammer nails.

@[email protected]
link
fedilink
2
edit-2
20d

Actually… if you flip it…

So I’d rather argue that hammer can even screw screws.

@[email protected]
link
fedilink
English
4
edit-2
20d

“You call yourself a beverage machine?!”

“I call myself Bev.”

Why would a vending machine ever need AI?

Real answer, surge or scarcity pricing.

Totally unnecessary. A simple price/demand curve can easily be written in a few lines of code.

@[email protected]
link
fedilink
3
edit-2
21d

But your basic algorithms cannot tell if Debbie just broke up with her BF and would totally spend all seven dollars in her purse for that late night candy barjust to bury the pain under something positive now could it?!

It wouldn’t, a simple finite state machine that any intelligent entity could emulate would be enough.

But people have completely deluded themselves into thinking that (what CEOs and marketers call) “AI” is actually intelligent, and this case study shows how preposterous that fantasy actually is.

knightly the Sneptaur
link
fedilink
16
edit-2
21d

I really hope people are starting to catch on, large language models aren’t “intelligent”, they’re multidimensional maps of human language use and querying them is just tracing a vector “forward” through language-space from the starting point of a prompt.

It’s the reification fallacy writ so large it’s eclipsing entire national economies. Human intelligence isn’t in language, language is a product of human intelligence. The map is not the territory.

And yeah, it is pretty cool that we have the processing power to map out language-space well enough to draw some vectors that remain coherent over thousands of tokens, but using a billion-parameter model to do what could be accomplished with probably-already-existing management software and a few seconds of CPU time per week is as wasteful as it is misguided.

@[email protected]
link
fedilink
English
4
edit-2
20d

In the same way your fridge needs a web browser.

Though the point of this is probably not that it will be a viable product, but managing a vending machine is one of those seemingly easy and straightforward tasks that make good starting applications to test the AI with. Basically, if it can’t even handle something as simple as a vending machine, it definitely can’t be trusted with anything more complex.

@[email protected]
link
fedilink
English
3
edit-2
20d

Create a post

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.


Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.


Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

  • 1 user online
  • 29 users / day
  • 110 users / week
  • 330 users / month
  • 1.55K users / 6 months
  • 1 subscriber
  • 3.73K Posts
  • 46.8K Comments
  • Modlog