AI Model and Structure

Structure

There are four main components:

  • Game engine
  • BMD processor and database
  • Two language models:
    • Fast model
    • Good model

Fast and Good Language Models

Both models are autoregressive language models and are pre-trained using English language texts and the Game Book itself. Both models are read-only during the gameplay and can be swapped with a newer version without compromising any saved game state.

The Good model is a general-purpose language model and is aimed at a quality of output. It is used when generating text responses for the player, for example when communicating with NPCs or when new story details are narrated. It will be either GPT-3 or some other commercially available model, that can be licensed or accessed by an API. This model is queried when the game needs to produce a text in proper English. It is also queried as a low-priority background task during important game events and provides “alternative” output compared to the Fast model.

The Fast model is a fine-tuned language model and is aimed at size and speed. It is designed based on GPT-2 model and is programmed and trained by us on a dataset aimed at human actions and responses. Its output is never directly presented to the player. It is queried for every action the player takes or that happens in the world, usually several times per second. It is used to slowly change values in the BMD database based on the state of the world and the player’s actions

BMD processor and database

BMD is short for Beliefs, Moods and Drives. It is a collection of attributes describing beliefs, mood, drives, itineraries, plans, opinions, and other variables that together define every “conscious” being in the game (NPCs, enemies, companions) – let’s call them “people” in the documentation.

The BMD processor is directly connected to the Game engine. It receives all events involving people and world status (day-night cycle, environment, and map changes). These events are then translated to simple English sentences (and/or tokens), combined with the appropriate data from the database and fed into the Fast language model. Output from the model is used to modify values in the database and optionally is translated into events that are fed back into the engine. If the events happening in the engine are important, same steps are repeated also with the Good language model and its output can override the output from the Fast model.

It sounds strange, right? Maybe an example will help (this is extremely, I mean EXTREMELY simplified example):

  1. Player character enters the field of view of NPC X
  2. BMD processor is notified by the Game engine
  3. BMD processor pulls NPC X attributes from the database
  4. BMD processor generates a query: “NPC X is tired, working, outside house, Rea first time see, no information about Rea. How NPC X reacts?” and pushes it into the Fast language model.
  5. Fast language model returns: “NPC X working continues, ignore Rea”.
  6. BMD processor modifies the database by changing NPC X column regarding Rea from “first time see” to “ignore”.

And that’s basically the end of this event. NPC X in game will continue working as if nothing happened. Next time Rea comes near NPC X and enters its field of view, the query will contain “previously ignore” instead of “first time see”. Maybe the result will be the same or this time NPC X will greet Rea or attack Rea or something completely different.

The query in this example was extremely simplified compared to the query that is actually needed to get a usable result. Generated queries are usually containing much more information about the NPC from the database, time of the day, the weather and other variables, so the reactions become more unique and, hopefully, more realistic.

Next example shows an NPC reacting to the player. NPC Y is a soldier guarding the city entrance. Rea is in the enemy territory and is a wanted criminal (this information is stored in the database at that point in the game). Something like this happens when she enters the soldier’s field of view:

  1. Player character enters the field of view of NPC Y
  2. BMD processor is notified by the Game engine
  3. BMD processor pulls NPC Y attributes from the database
  4. BMD processor generates a query: “NPC Y is guard, guarding city entrance, at city entrance, Rea first time see, Rea wanted criminal, Rea enemy. How NPC Y reacts?” and pushes it into the Fast language model.
  5. Fast language model returns: “NPC Y shouts, get sword, Rea attack, Rea kill
  6. BMD processor modifies the database by changing NPC Y row, column regarding Rea from “first time see” to “attack”.
  7. BMD processor starts a background task in the Good language model with a query: “Guard attacks enemy at the city entrance with a sword with intent to kill. Enemy is a woman, wanted fugitive, dangerous. What guard shouts?
  8. BMD processor sends an event to the AI controlling the NPC Y movement, that it should target Rea with a sword with intent to kill.
  9. Eventually the Good language model returns: “Stop her. Kill her at all costs.
  10. BMD processor sends an event to the AI controlling NPC Y to shout.

As you can see, this time the NPC reacts to the presence of the player’s character. This example also illustrates the different purposes of both language models. While the fast model is used to control actions and status changes, the good model is used to generate a human-like dialogue. If you think this all sounds farfetched, try to put those queries into some publicly available language model, for example GPT-3 and you will be surprised.

Figure: Example of OpenAI GPT-3 davinci-003 model - response is highlighted green

If you got this far, you probably have several questions. For example, why complicate everything by using a human language "API" between program components? Basically, the BMD processor must be able to translate events happening in the game into English sentences, combine it with the attributes from the database and then – the most complicated part – parse the result, again in form of English sentences, and perform the proper actions. Isn’t this a little bit unnecessary? Isn’t it possible to feed the game events directly into some different model? But of course. It may seem that using actual English sentences as a form of API is strange and extreme overhead. But it gives us one opportunity that we just couldn’t ignore – ability to add new stories, quests, and lore just by writing them down in our favorite text editor. The language models are trained using real books, articles, and other English texts. If you got some new texts, you can re-train both models and use the new version immediately. Since both models used in the game are read-only, they can be swapped out whenever a new version is out. Imagine the possibilities – to mod the game, you just need to write some short story or novel, train the model and that’s it. You will be constrained by the game rules and existing objects and maps, but the story itself, new characters, new interactions, new extended lore, all of this can be created just by writing a story. This is something we just really need to build and explore the possibilities.

Will I be able to really role-play, or do I just need to follow some script? You should be able to role-play as much, as you want. Both language models will be trained using the Game Book, so the past cannot be changed, and it will be forever a part of your character. But once the game starts, you can “use” the world as you see fit. You can be good or evil or something in-between. You can fight tyrants or join them. Basically, people in the game have all their starting beliefs, goals, itineraries, and drives. But all these attributes can change, based on the interactions in the world. Not only with you but with other people as well. We plan to use the language models to the fullest, so all attributes are also stored as a text. This way, we are able to feed them into the language models and modify the database based on the results. At this point, we must point out the strength of the modern language models. For example, you can create a query: “John is a medieval blacksmith working. John is tired, angry, poor, hungry. It is Friday, evening, 6 p.m., sun is setting, snow is starting to fall, it is 6 degrees Celsius outside. What will John do next? How will John be?” and the result will be “John stops working, wash himself and go home. John will be very angry and hungry. John will eat food.” With this result, it is possible to perform some actions with the character in the game and update the database. If you, as a player, interrupt John and, for example, give him a lot of money, some good food or some medicine, that will help him rest, he will remember this (it will be stored in the database) and maybe in the future he will do something for you. This is just an example, how it is possible to shape the world by playing and interacting with people in the game. Maybe you will become so popular, that when you decide that you want to become the ruler, peasants will rise to support you. Or maybe you can become evil and kill everyone. In the end, maybe the whole continent will fight and despise you. Even if you don’t do anything, the world will keep changing, because people will be interacting and doing stuff. Don’t forget that NPCs also communicate with each other using the Good language model and can easily spread rumors, new information and even start a revolution just by themselves. Basically no one can predict, how the story will evolve. We just set the starting point and create good tools.

The important part is that none of this is scripted. All interactions are fed into the language models that contain knowledge from millions of books and can predict, how real humans would react to different stimuli. You only need to know, how to ask properly. And this is the “magic” of our BMD processor – asking the right questions.

What we have written here only scratches the surface. Building queries is extremely complex problem, especially if you need to include all important information about the current state of the world and fit it into several hundred words. We have experimented with it a lot and we consider this our know-how. How to create queries and how to filter and interpret results – this is what makes us unique.

Tokens vs. human language

Until now, we haven’t addressed the important question about tokens. Basically, the idea of language models is converting real language into tokens, then using these tokens as an input and output of the model and converting them back into human language. Every token is a unique number assigned to a word or part of the word, usually including some identifiers pointing to the position of the word in the sentence. There are several approaches to this. But generally, it is a mapping between some short string and number.

In our BMD processor, we use both approaches and mix them together. For example, all actions generated from engine that are fed into the language model are already in a form of tokens. The NPC conversations are of course in the language form. We just didn’t want to complicate the description above by introducing this concept.

Challenges

Voice acting

Obviously, since all interactions with the NPCs and companions will be generated on the spot, it won’t be possible to record any dialogue. Either the game will be text only or one of the voice generators must be used. If the speech generator is not perfect, it can break immersion for some players. This will need further research and testing.

Storage space

The Good language model will probably require around 60GiB to 100GiB of storage space. Also, the queries can be time, memory and CPU consuming. It won’t be possible to use local copy, especially on consoles. That means the game will require a connection to the Internet and running queries in the cloud. This will also mean monthly subscription to cover costs, but it shouldn’t be expensive.

Consoles and non-high-end PCs

Our aim is to run all components except the Good language model locally. That may prove impossible for consoles and maybe most of the PCs. For this reason, we will design all components with communication interface suitable for local sockets as well as TCP sockets. This way, it will be possible to offload more components to the cloud. Downside is that some components require quite a lot of data to be exchanged. Here is the graph with our estimation:

Figure: Components and required data rates

Every component except the game itself can be offloaded. We will work on lowering data transfer requirements.

BMD database corruption

The Fast model can, under some circumstances, return nonsensical output. If the output is a proper English sentence that can be parsed, it will be used to update the database. This can corrupt NPC attributes and it will behave strangely or even straight up psychotic. If this happens too often or to the important NPCs like rulers or companions, game experience will be ruined. Sanity check for results of the Fast model should be designed and implemented, but right now is deep in R&D queue and seems extremely complex. May be omitted from the first releases.

Recursive queries

When creating queries for the language models, all contexts should be included in the query. For example: “Mary is a widow, has four small children, is very popular, people help her, give her money. Mary’s house burned down. What should Mary do next?” And the answer could be something like: “Mary should ask for donations and buy a new house.” This is the answer that almost every English speaker can understand. But for the algorithm, it is really challenging. It is possible to parse the result and extract two actions: collect donations and buy a new house. The problem is the programmer probably didn’t program “collection donations” action between the NPCs. Thankfully, it can be solved using language model by asking further questions, like “define asking for donations”. After a while, Mary will be going from NPC to NPC asking them to give her money for the new house and the NPCs will give her some money. Both actions – the NPC talking together and the NPC exchanging money are, thankfully, programmed and understood by the game.

The second challenge is buying the house. Act of buying some house is quite easy and can be implemented in the game. But what type of house should Mary buy? Large, small, cheap, expensive, near the city center or in the suburbs? Again, we need to query a language model and ask about every house and whether it is suitable for Mary.

In the end, we will get the answer and Mary will buy suitable house, but it can take dozens of queries to the Good language model asking to describe action, until we get to the part the engine can understand. And this can happen for hundreds of NPCs at the time, causing queue to be minutes, even hours long.

The only way to mitigate this is to program hundreds of different actions into the game, a lot of testing, caching the most common answers and adding restraints to recursion. There is only one way to get this right – testing, testing and more testing.

FROZEN BY THE SUN - Single player story-rich AI generated game

Disclaimer: This site and product is in no way affiliated with OpenAI or Midjourney