The science behind the technology.
Often we are asked about our approach to emotional AI. Essentially, we take an approach based on a modular architecture inspired by the natural intelligence emergent in living beings. Without going into too much technical detail, here is a brief discussion of what we mean by that.
Through our earler research into the neuro-aesthetics of written language, we knew that quite a lot of emotional information is carried through language itself, particularly through word-choice and phrasing. This is true even when facial expressions, tone of voice, and body language are not present. Our first question was "how do people do this?" Once we had enough of the answer, our next question was "can we simulate the process in a computer?"
The answers to the first question involved quite a bit of study into the research literature. This approach was necessary because the answer did not come from any single academic field. It included neuroscience, evolutionary biology, psychology, sociology, anthropology, linguistics, and a good many cross-over disciplines, such as developmental psychology. These had to be balanced with ideas from artificial intelligence, information theory, complex systems, and even game theory.
Surveying this research even superficially would fill a large book, but briefly, a human reacts using a number of more-or-less specialized systems or brain regions, which aggregate signals, transmit signals of their own, cause the release of neurotransmitters and other chemicals, trigger endocrine and other physical reactions, and these in turn cause other systems to react to those effects, and so on. None of these systems is ever completely offline, although levels of activation vary widely, as do levels of coordination.
A simple analogy would be that of a group of specialists, each trained to detect and respond to a specific feature of an incoming message, and forwarding on its response to others. For example, anything threatening might be picked up by the amygdala, which would pass that information on, but also trigger the release of noradrenaline. On the other hand, a soothing reference to a loved one might cause a different system to trigger the release of oxytocin, inducing a different set of bodily responses and a pleasant reaction. Interestingly, most of these early systems linked to basic feeling states can and do react much more quickly than those of higher, neocortical areas. That means that emotional content is detected earlier than semantic content, and shapes its interpretation.
Among our set of experts are also those that attend to context. Context is important in multiple ways. There are the social context of an exchange, the physical context including location, the relationship context, and the immediate context of the present conversation. Any or all of these can affect the interpretation of a message. "Raise your hands over your head." means something very different when being measured by a tailor than it does when encountering a police officer on the street.
Of course, each piece of language needs to be interpreted for its grammatical features. Is it a question? A statement? An exclamation? A command? This created sense of the speaker's intent and changes the response posture. What is the level of the language? Is it formal or casual? Is it friendly or hostile?
There are also neurons in brain regions that react to specific inputs, which includes their linguistic tags. That is why you can't help noticing when someone says your name, even in a room full of people all talking at once. Other tags get your attention as well, including the names of your loved ones, your friends, your company, your town, and anything else that holds importance for you.
Finally, there are certain types of memory, which we might thing of like a crew of file clerks, each offering their own inputs. There is the fast shuffling juggler of the working set, the inbox monitor of recent short-term memory, and the archivist of long-term memory. There are those that hold procedural details, including how to hold formal and informal conversations, put together stories, or tell jokes, and another that keeps track of what step you are currently on.
While this is indeed a vast oversimplification, it is still a useful one. Imagine this crowd of experts all putting in their two-cents worth at the same time, pushing for one sort of response or another, some trying to suppress others, some trying to modify the outputs of others, and a few, like our friend the amygdala, ready to ring the alarm and hijack the whole show. Some way of ordering the confusion is needed. This takes a few more systems, including salience, attention, and executive decision-making.
Salience is essentially a threshold problem. How strong does a signal of any particular type have to be to break through and be noticed? Of those things that are worth noticing, which should get the most attention? After something has captured attention, how should the contextual and memory frames be used to guide a reaction? Of course, some of these functions are much faster than others, and many do not require merging of inputs to trigger a reaction, creating a sort of hierarchical system of shortcuts and heuristics.
For example, we know that when a human or other animal receives input from the world, that input is parsed in different ways and signals go to many different areas to be combined and interpreted. As noted, these are not looking for the same things. Consider that if you put a bit of food into your mouth, parts of your brain will parse the input for taste, temperature, moisture, and texture. If the food is rotten, you will reflexively spit it out. That is because another part of your brain, sensitive to repellent stimuli, is also getting the information, and can trigger a reflex to expel the food. This is a fast reaction, evolutionarily older than intelligence, that works without the need for salience, attention, or decision-making. Interestingly, the brain also uses this reaction for producing sensations of moral disgust, such as that produced when hearing cruelty or offensive hate speech, which can also cause an immediate reaction.
On the other hand, if you take a bite of something tasty, a pleasurable feeling will be induced by another part of the brain. And yet another may release dopamine to make you want more. And yet another may check your semantic memory to recall its name. And yet another may begin recording a new memory of this food. And yet another may compare this experience to a previous memory, as with Proust and his famous madeleines. In this, as in the case of the rotten food, your higher brain region finds out about your reaction after the fact, interprets it, and then decides what to do next afterward.
What all this suggests is that intelligence, like awareness itself, depends on a modular architecture. For awareness, there need to be things to be aware of, meaning internal signals since those are the only things a brain can be aware of, and there needs to be something that can interpret those signals and so can be aware of them. Likewise, intelligence in any real sense is not simply a reaction to inputs, including internal associations and memory fetches, but a system that can mediate those inputs and shape a reaction.
So, rather than trying to fill a machine-learning model with trillions of parameters, our method is concerned with how to quickly detect any important features that language can convey and use them to direct generative AI in shaping a response. In the case of conversational AI, this means translating the spoken language into text, and then sending it to the various modules to get their reactions to it, and using a proprietary system to make use of those reactions. As for the generative AI itself, we favor small language models tuned for the their specific purposes. This, we believe, makes them computationally less expensive and easier to direct appropriately.