There are four main components:
Both models are specifically designed to understand and generate text in English, as they were pre-trained using English language texts and the game book itself. They are read-only during gameplay, meaning that they are only used for generating text responses and cannot be modified. However, they can be swapped out for newer versions without affecting the saved game state.
The Good model is a general-purpose language model that is focused on producing high-quality output. It is used when the game needs to generate text responses for the player, such as when interacting with non-player characters (NPCs) or when new story details are narrated. It will either based on the GPT-3.5 (or ChatGPT) model or some other commercially available model that can be accessed through an API. This model will queried when the game needs to generate text in proper English, and it will also queried in the background during important game events to provide alternative output compared to the Fast model.
The Fast model, on the other hand, is a fine-tuned language model that is designed to be smaller and faster. It will based on the GPT-2 model and will be programmed and trained by the game developers on a dataset specifically focused on human actions and responses. Its output is never directly presented to the player, but it is queried for every action the player takes or that occurs in the game world, usually several times per second. It is used to gradually change values in the BMD database based on the state of the game world and the player's actions.
The BMD, or Beliefs, Moods, and Drives, is a system that describes the characteristics and attributes of all conscious beings in the game, including NPCs, enemies, and companions. It includes information about their beliefs, moods, drives, plans, opinions, and other variables that help to define their behavior. The BMD processor is connected to the game engine and receives all events that involve these conscious beings and the game world, including changes in the day-night cycle, environment, and map. These events are then translated into simple English sentences and combined with data from the database, which are then fed into the Fast language model. The output from this model is used to modify values in the database and, if necessary, is translated into events that are fed back into the game engine. If the events happening in the game are particularly important, the same process is also repeated with the Good language model, which can override the output from the Fast model if necessary.
It sounds strange, right? Maybe an example will help (this is extremely, I mean EXTREMELY simplified example):
And that’s basically the end of this event. NPC X in game will continue working as if nothing happened. Next time Rea comes near NPC X and enters its field of view, the query will contain “previously ignore” instead of “first time see”. Maybe the result will be the same or this time NPC X will greet Rea or attack Rea or something completely different.
The query used in this example was greatly simplified in comparison to the actual query required to obtain a useful result. Typically, generated queries include a vast amount of data about the NPC from the database, including the time of day, weather conditions, and various other variables. This allows the NPC's reactions to be more unique and realistic.
Next example shows an NPC reacting to the player. NPC Y is a soldier guarding the city entrance. Rea is in the enemy territory and is a wanted criminal (this information is stored in the database at that point in the game). Something like this happens when she enters the soldier’s field of view:
As you can see, this time the NPC reacts to the presence of the player’s character. This example also illustrates the different purposes of both language models. While the fast model is used to control actions and status changes, the good model is used to generate a human-like dialogue. If you think this all sounds farfetched, try to put those queries into some publicly available language model, for example GPT-3 and you will be surprised.
Figure: Example of OpenAI GPT-3 davinci-003 model - response is highlighted green
If you got this far, you probably have several questions. For example, why complicate everything by using a human language "API" between program components? Basically, the BMD processor must be able to translate events happening in the game into English sentences, combine it with the attributes from the database and then – the most complicated part – parse the result, again in form of English sentences, and perform the proper actions. Isn’t this a little bit unnecessary? Isn’t it possible to feed the game events directly into some different model? But of course. It may seem that using actual English sentences as a form of API is strange and extreme overhead. But it gives us one opportunity that we just couldn’t ignore – ability to add new stories, quests, and lore just by writing them down in our favorite text editor. The language models are trained using real books, articles, and other English texts. If you got some new texts, you can re-train both models and use the new version immediately. Since both models used in the game are read-only, they can be swapped out whenever a new version is out. Imagine the possibilities – to mod the game, you just need to write some short story or novel, train the model and that’s it. You will be constrained by the game rules and existing objects and maps, but the story itself, new characters, new interactions, new extended lore, all of this can be created just by writing a story. This is something we just really need to build and explore the possibilities.
Will I be able to really role-play, or do I just need to follow some script? You should be able to role-play as much, as you want. Both language models will be trained using the Game Book, so the past cannot be changed, and it will be forever a part of your character. But once the game starts, you can “use” the world as you see fit. You can be good or evil or something in-between. You can fight tyrants or join them. Basically, people in the game have all their starting beliefs, goals, itineraries, and drives. But all these attributes can change, based on the interactions in the world. Not only with you but with other people as well. We plan to use the language models to the fullest, so all attributes are also stored as a text. This way, we are able to feed them into the language models and modify the database based on the results. At this point, we must point out the strength of the modern language models. For example, you can create a query: “John is a medieval blacksmith working. John is tired, angry, poor, hungry. It is Friday, evening, 6 p.m., sun is setting, snow is starting to fall, it is 6 degrees Celsius outside. What will John do next? How will John be?” and the result will be “John stops working, wash himself and go home. John will be very angry and hungry. John will eat food.” With this result, it is possible to perform some actions with the character in the game and update the database. If you, as a player, interrupt John and, for example, give him a lot of money, some good food or some medicine, that will help him rest, he will remember this (it will be stored in the database) and maybe in the future he will do something for you. This is just an example, how it is possible to shape the world by playing and interacting with people in the game. Maybe you will become so popular, that when you decide that you want to become the ruler, peasants will rise to support you. Or maybe you can become evil and kill everyone. In the end, maybe the whole continent will fight and despise you. Even if you don’t do anything, the world will keep changing, because people will be interacting and doing stuff. Don’t forget that NPCs also communicate with each other using the Good language model and can easily spread rumors, new information and even start a revolution just by themselves. Basically no one can predict, how the story will evolve. We just set the starting point and create good tools.
The important part is that none of this is scripted. All interactions are fed into the language models that contain knowledge from millions of books and can predict, how real humans would react to different stimuli. You only need to know, how to ask properly. And this is the “magic” of our BMD processor – asking the right questions.
What we have written here only scratches the surface. Building queries is extremely complex problem, especially if you need to include all important information about the current state of the world and fit it into several hundred words. We have experimented with it a lot and we consider this our know-how. How to create queries and how to filter and interpret results – this is what makes us unique.
Obviously, since all interactions with the NPCs and companions will be generated on the spot, it won’t be possible to record any dialogue. Either the game will be text only or one of the voice generators must be used. If the speech generator is not perfect, it can break immersion for some players. This will need further research and testing.
The Good language model will probably require around 60GiB to 100GiB of storage space. Also, the queries can be time, memory and CPU consuming. It won’t be possible to use local copy, especially on consoles. That means the game will require a connection to the Internet and running queries in the cloud. This will also mean monthly subscription to cover costs, but it shouldn’t be expensive.
Our aim is to run all components except the Good language model locally. That may prove impossible for consoles and maybe most of the PCs. For this reason, we will design all components with communication interface suitable for local sockets as well as TCP sockets. This way, it will be possible to offload more components to the cloud. Downside is that some components require quite a lot of data to be exchanged. Here is the graph with our estimation:
Figure: Components and required data rates
Every component except the game itself can be offloaded. We will work on lowering data transfer requirements.
The Fast model can, under some circumstances, return nonsensical output. If the output is a proper English sentence that can be parsed, it will be used to update the database. This can corrupt NPC attributes and it will behave strangely or even straight up psychotic. If this happens too often or to the important NPCs like rulers or companions, game experience will be ruined. Sanity check for results of the Fast model should be designed and implemented, but right now is deep in R&D queue and seems extremely complex. May be omitted from the first releases.
When creating queries for the language models, all contexts should be included in the query. For example: “Mary is a widow, has four small children, is very popular, people help her, give her money. Mary’s house burned down. What should Mary do next?” And the answer could be something like: “Mary should ask for donations and buy a new house.” This is the answer that almost every English speaker can understand. But for the algorithm, it is really challenging. It is possible to parse the result and extract two actions: collect donations and buy a new house. The problem is the programmer probably didn’t program “collection donations” action between the NPCs. Thankfully, it can be solved using language model by asking further questions, like “define asking for donations”. After a while, Mary will be going from NPC to NPC asking them to give her money for the new house and the NPCs will give her some money. Both actions – the NPC talking together and the NPC exchanging money are, thankfully, programmed and understood by the game.
The second challenge is buying the house. Act of buying some house is quite easy and can be implemented in the game. But what type of house should Mary buy? Large, small, cheap, expensive, near the city center or in the suburbs? Again, we need to query a language model and ask about every house and whether it is suitable for Mary.
In the end, we will get the answer and Mary will buy suitable house, but it can take dozens of queries to the Good language model asking to describe action, until we get to the part the engine can understand. And this can happen for hundreds of NPCs at the time, causing queue to be minutes, even hours long.
The only way to mitigate this is to program hundreds of different actions into the game, a lot of testing, caching the most common answers and adding restraints to recursion. There is only one way to get this right – testing, testing and more testing.