One of my few childhood dreams is to make video games. Standing before that is playing games. I find myself not different from many other gamers in rating what a good game is: you live in that game’s world and you are not alone in that world. Whether it is a first person shooting game, or role playing game, or real time strategy one, a great game is when the player is bathed in the storyline and the narrative, albeit stunning graphics and delicately crafted synth are not less as important. For I am a fan of history, when playing games with gameplay inspired from history I often find myself imagining how ancient people had lived, breathed, thought and fought in their own time. I want to be a part of the story and the story to have mine. This depends pretty much on non-player characters (NPC) of how they “live”, think, act, and interact with human players.
Any real-time strategy (RTS) gamers would agree with me that multi-player mode in which human-vs-human is, most of the time, more exciting than single-player mode in which human player is against computers. You may object my argument and I know there are great single-player campaigns that I have enjoyed too, however this opinion of mine has its rationales. Pro-gamers often find not so much rewarding as playing against computer players, built-in with classical Artificial Intelligence (AI) techniques, than against their peers. Let alone levels of difficulty implemented are inadequate to challenge them, the experiences that a player feels from playing with computers are quite poor compared to playing with human. Reasoning lines, rational thinking, creative strategies, dirty tricks, so on and so forth, they would all contribute as a wondrous experience. Would it be thrilling to play a campaign of you leading your kingdom waging wars against heathens whose wicked king — a NPC simulated by game AI — with tons of tricks ready to stab you in the dark ? And with open-ended stories where your decisions from time to time drive the story line to turning points that never been designed (for example Choose Your Own Adventure books, or Bandersnatch TV series on Nexflix). What is missing here and now ? Game AI lacks a true character behind its decisions, a character that poses senses, mentality, reasoning, and emotion.
In a great article that really opened my eyes of how AI being used in the game industry: not so much acknowledged is the unprecedented levels of super-human but focused intelligence. Remember the renowned success of DeepMind’s AlphaGo over Go master Lee Sedol? There were times when I only played checkerboard games and undoubtedly they are truly challenging, those games are confined by rules dictating the abstract world of squared grid. The truth is, game AI in general requires a not so extreme levels of intelligence like in Go or chess; that is only a half of the recipe. Missing ingredients include smooth and satisfying human-machine interaction of how they perceive AI in ways only the real world used to be capable of, narrative is open-ended and change to suit player’s decision and actions, NPC's behaviors are defined based on flexible storylines and also their own psychology and traits. A good game AI would change to suit players so that the same gameplay would never be played twice nor more than one player plays the same game. The amount of possible reactions of the game toward the player just grows enormously: the game plays the player as much as the player plays the game.
In fact what the game industry wants is an intelligent agent that essentially pass the Turing test: a test of a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. Slow adoption of advances in AI and deep learning is then very easy to see why: Artificial (General) Intelligence is not there yet. If not now, then what make us believe and keep working until that utopia could be built?
Connectionism
The revitalization of artificial neural networks, also known as connectionism, branded as deep learning, has made bold steps and giving new hopes for artificial intelligence to be within reach within a lifetime from now. Connectionist approach relies on the idea that brain is a computational device in which the smallest computing units are biological neurons, interconnected each other in a sense that excitation of one neuron is among inputs of another one; complex smart behaviors are the results of layering many and many layers of neurons thus increasing the complexity of the outputs. The computational modeling of connectionism is well known as neural networks — conceptually computational graph whose nodes are essentially Perceptron which was invented by Frank Rosenblatt.
The factor behind that give learnability to neural networks is differentiability of perceptron’s weights. Back-propagation (or backprop) is the algorithm built upon this principle in order to giving updates to every perceptron with respect to the measurement of prediction error made by the network. This error feedback, technically defined as the loss, measured as the difference between predicted output versus true output, is differentiated wrt every perceptron’s weights resulting to a gradient vector whose opposite direction guides the perceptron not making the same error again, conditioned on seen data. Backprop works from the topmost down to the bottom of the network where raw signals are first received.
Iterating over the data, often more than once, is like a student working on an assignment several times to really understand its problem. Backprop is often viewed as casting learning problem as optimization, its generalization from seen (training) to unseen (test) data, like the student be able to understand and solve related assignments, is not guaranteed by the algorithm itself but prerequisited by the Independent and identically distributed (i.i.d) of training and test data, at the core of statistical learning theory.
There is more than one learning theory of the mind and predictive coding theory one of such besides backprop. It proposes a model of predictive mind where the brain first predicts what information should it be given the raw input received from a lower level; the difference between its prediction versus the actual input is used to correct the presentation at higher levels. Essentially it is different from backprop where bottom-up cognition process is employed: sensory inputs first received at the lowest levels and then forwarded to the highest levels to deliver a prediction. Predictive coding is advocated with evidences from neuroscience research as well as Gestalt psychology, and being promising in explaining with some psychological evidences of how learning is done in human and animals. There have been research, here and here, connecting backprop with predictive coding, giving the potential of parallelizing weight updating process layer-wise.
Connectionism has enjoyed a plenty of successes across different domains: visual recognition, image generation from text, natural language processing and text generation, audio synthesis, and playing Go and Starcraft. These achievements really mean that machine truly comprehends what it was trained on? Does it make sense the meaning of the answer it gives to us?
As soon as visual recognition problem achieves outstanding classification accuracy on the public dataset ImageNet in 2012, the stability of its prediction were seriously questioned. They conducted a very simple experiment where images are added with Gaussian noises which are imperceptible to human vision; a pre-trained neural net predicts based on the inputs as the two versions of the image: the original image and its “adversarial” version with noise addition. The pre-trained model was vulnerable against adversarial examples; it not only predicts wrong labels but also being very confident with its wrongs.
Language
Not limited to vision problems, adversarial examples are also controversial in language processing. Recent GPT models from OpenAI, which are essentially transformer neural nets with attention mechanism trained on billions of English sentences collected from the Internet, demonstrated spectacular generated text quality but at the same time it is found that the generated text has meaning implied by the algorithm, in other word GPT is mind-less. GPTs apparently do not pass the Turing test. In reality however its quality of generated text is far more than adequate to be useful for commercial uses, perhaps with human in the loop.
Speaking of practical use aspect of language models, would GPT’s generated text be sufficient, sensible enough, to create at least a slim moment of human-level AI for NPCs ? Would it give game AI to the next level ? It has been proved possible, at least to some extent, such as AI Dungeon, a text adventure game which uses artificial intelligence to generate story content. If I try a bit with AI Dungeon to help generating a beginning for a sci-fi story of space exploration, here is what I found.
The story started with an auto-generated paragraph, then followed by a prompt asking for a human player to select among three modes: Say — to start or continue a conversation, Do — actions the player performs, and Story — a narrative given by the player either to clarify an ongoing context, or to control the storytelling. All of the generated texts are in italic fonts. First AI Dungeon generated a short opening, giving a context and character which is shown in bold text. I, then, continued the story from where the AI left. After each of my sentence, the story was handed over to AI, and so on and so forth.
You're the commander of the spaceship Harbinger. You've been traveling through the Magellanic Clouds when you find a strange planet near a dying star. While most of your crew is in hibernation, you and a few of the flight crew are keeping things running. As you scan the planet to add it to your charts, your navigator detects a signal coming from the planet. It's weak but it's clearly not natural. The radio signal is in a very low frequency and sent in short bursts. It's also very wide. It could be a distress signal or some kind of code.
> You say "Maybe some spaceship sending distress signal ?"The pilot of the ship, a large reptilian looking creature, listens to the signal for a few minutes. Then he turns to you and says "This is a distress beacon, we have to investigate."
> You say "Prepare for landing, but scramble the signal so that they won't reveal our location."
The pilot nods and prepares the ship to land while you hide the ship's transponder so that it can't be tracked.
It is not surprising for what GPT-3 has been much praised: high quality of generated story consistent with human inputs. Vast amount of training data and clever design of Transformer neural network open up exciting experiments. Ongoing debates arguing that the algorithm is purely based on the statistics of the data — or “curve fitting” in other word — and not an understanding, are far to be ceased among scientists and critics. Major limitations of deep learning are currently acknowledged; nevertheless it raises hopes, giving us a hunch that AI could be around the corner, or at least, not for us to wait till 2100s.
If I am asked to give an opinion, the phenomenon of deep language models and their implications with AI are too good to be overlooked. It is not when a convolutional network classified accurately a plethora amount of fauna categories from ImageNet that even a layperson cannot do as good, nor when DeepMind’s AlphaGo won the Go master, that people really think AI is somewhere near. Making sense of languages, a loosely defined medium of communication, is the sign of intelligence, the sign that computers start having slightest ideas about common sense.
Why it is so important for machines to understand and use natural languages ? Seeing a photo versus reading a sentence, technically it is the process of perceiving raw signals and translate into mental concepts. The latter, however, involves dealing more with symbols which are letters and words and thus being closer to “understanding”. That is not to say that visual cognition is all about low level processing but it certainly involves more to raw signal perception.
During the history of civilization of the human kind in all the continents, languages and the making of written languages mark turning points of the history. In one hand, languages facilitate communication and exchanges of ideas; in the other hand, languages is like an external thinking device of human being. The inventions of written languages help us not only in keeping knowledge and teaching to others but also consolidate our thinking, develop our logic, and reasoning. The development of languages is dictated by human intelligence whose sophistication is unprecedented in other species. We have all the reasons to believe the way of human thinking is embodied in languages, not just individual minds but the collective mind.
English, for example, is based on the alphabet of 24 Latin letters, which in turn being combined in different orders and repetitions to differentiate about 470,000 words; once structured, representing unlimited ways of expressing feeling, stories, ideas, not just in one sentence, or a few, but beautifully sequenced into paragraphs, chapters, books, and so on. Possibilities are immense. A much simpler form of languages, the Egyptian hieroglyphs, illustrates earliest attempts in the history of human kind to express their thinking or to tell stories using writing system.
Languages reflect our thinking, sharpen our reasoning, is the utmost symbol of intelligence. Early concepts about AI always include language comprehension and the use of languages as standards. Ideation of early AI systems took place with logic, as the rigorous and simplified form of language, to express about commonsense — something very simple for human to understand but really daunting for machines. At the begining of his paper in 1959, John McKarthy stated that
Interesting work is being done in programming computers to solve problems which require a high degree of intelligence in humans. However, certain elementary verbal reasoning process so simple that they can be carried out by any non-feeble minded human have yet to be simulated by machine programs.
For better or worse his argument still stands more than fifty years later. What is then, commonsense, which makes it so hard to model and harness by machines? Its modern meaning in English, coined since 18th century, is defined as
"Those plain, self-evident truths or conventional wisdom that one needed no sophistication to grasp and no proof to accept precisely because they accorded so well with the basic (common sense) intellectual capacities and experiences of the whole social body."
Human, different from computers that are instructed based on imperatives — think of programming languages such as C or Python, most of the time use and understand well declaratives. McCarthy emphasized that imperative instructions do not assume previous knowledge, explicit and thus are carried out faster. Declarative sentences are more assumptive on previous knowledge; the machine has logical consequences of what it is told and what it previously knew; being implicit which facilitates after-thoughts. It is common sense that enables such a great flexibility of declaratives in natural languages. For a machine to exhibit artificial intelligence, it must learn and comprehend common sense in order to bridge the gaps of untold imperatives from declaratives.
I find this is really exciting to think about: if a machine has common sense, does it necessarily pass the Turing test ? Imagine a Turing test where machine has a conversation with human; if it is an “equal” conversation between the two, then the machine understands what human says, which also means it understands common sense hidden in human’s languages. Having common sense in machine’s mind is very much equivalent to passing the Turing test. In other word common sense is a goal that machines lack of and need to achieve.
One of the prerequisites is to make common sense knowledge explicit. Though such effort probably may never come to an end there have been considerable attempts to encode vast amounts of common sense facts, including Open Mind Common Sense project since 1999 — also known as ConceptNet, Cyc since 1984 and now becomes a part of commercial developments, and lately ATOMIC the Common Sense Knowledge Graph. McCarthy also suggested using formal logic to encode and doing reasoning that knowledge. The topic however goes beyond the scope of this article and I will come back in later research.
Revising to the case of GPT can one say that to some extent it has learned common sense from languages ? From the example of Dungeon AI’s story generation it looks like the AI program understands human inputs so that it comes up with sensible feedback. There have been however counter-examples on GPT-3 suggesting it does not understand nor being able to do reasoning but merely memorizing training data. In the end of the day, a deep neural net with many billions of weights can memorize a lot of data and by so how we can tell if a trained model somehow discovered at least a slightest trace of how to make sense of the data ? A recent evaluation reveals that while language models like GPT or BERT rather learn common sense superficially than truly comprehend it. Nevertheless pursuing this question would reveal lots of brave new ideas.
Early days of AI developed and being influenced much from philosophy. Thomas Reid, a Scottish philosopher, in his series of work related to common sense, notably An Inquiry Into the Human Mind dated as early as 1764, also views languages as a materialization of common sense. If commonsense presents in English, it must present in French, Greek, Russian, or Chinese, too. Taking into account cultural drifts between different languages, does commonsense retain an universal list of rules, a common way of speaking and understanding of the human species ? Reid sees relevant features are the ones that can be found in the structure of all languages, essentially common rules of linguistics, including active-passive voices, parts of speech, subject-verb-object grammar syntax (though order depends on languages), are indications of common sense beliefs. Though Reid’s arguments at the time were probably deduced from observations across half dozen European languages, were his claim true when taking into account other language families ? Is commonsense cross-lingual ? Though there has not been such an attempt to study about universals between hundreds of languages, this research is looking for such an empirical evidence by going beyond English to test commonsense reasoning of multi-lingual language models on the other 15 languages. Transferability between languages says something about the universals of commonsense ? It certainly does.
Before language
Embodied in languages and by languages, those are the only ways to perceive common sense ? The short answer is No. Commonsense is more universal than that and can be found when languages are not spoken and were not spoken yet. Newton knew that apples fall to the ground (before he even think of Why’s); we learn that rainbow is likely to happen when there are showers in sunny days; a toddler tends be afraid of stepping over a sewage drain; a baby learns using cups and buckets as containers to carry water or sand. Common sense works not only in reasoning lines of languages but also in space, time, and acknowledged of physics rules. In Common Sense Problem page, understanding language is just one among other aspects: planning, physical reasoning, spatial reasoning, naive psychology, and more.
At the dawn of a human lifetime, neurons in a newborn’s brain was wired with some unknown inductive biases which could be described as a core knowledge system: what babies know prior to language, and that language builds upon. Phasing to the next stage, toddlers start learning words and basic sentences and use them to explore their surrounding world. Beyond this age kids speak more fluent, enrich their vocabulary, learn phonics, writing and reading. They are vehicles for their thoughts to grow more sophisticated including commonsense. There are research questions about life-long learning models, directly inspired by the development of natural intelligence in humans, then such models should start with a “minimal set” of common sense; this set takes the role as bridges between perception, action, and language. One of which is intuitive physics, or naive physics.
For audiences who are parents, evidences could come from playing “peekaboo” with their babies: parents cover their faces with palms and then suddenly show their faces to surprise the child; a newborn brain expects their parent’s face is not just disappear, but only occluded. This concept is called object permanence — object continues to exist though it cannot be seen like before. Can we be so sure the otherwise: the baby does not expect their parent’s face being still there, and the fact that they were happy because their parents just come back ? Another game can be played with babies to reaffirm that infants really expect objects they have seen still be there despite of occlusion. Put a marble on the floor, let the infant see it, then cover it with a cup; once the cup is removed but the marble is gone too, the infant is confused and maybe looking around to find the missing marble. In other word the infant understands that objects maintain location in space and do not simply disappear.
Without the need to play games with infants, adult minds can even be tricked by the same way: remember when you watch magicians drawing pigeons and bunnies from their top hats, or make cards disappearing out of their hands. Your jaw dropping moments are the reaction to the impossibility of those events. We are confused because everything shows their contradicting ways, not obeying our intuition of physics. Pigeons do not nest in hats and bunnies live in barrows, are these knowledge innate or learned ? The answer is trivial because infant has no idea of what being birds or rabbits; this commonsense knowledge is learned from the environment we live.
Artificial common sense models can be “trained” to learn this physics by the same way we human perceive the impossibility of it; in a setting of learning with guidance, for each pair of examples, one is an instance illustrating a principle of intuitive physics, the other is one impossible version of that principle. In this research, such a dataset is created letting deep learning models to learn intuitive physics from scratch; IntPhys benchmark provides a rich source of data for the purpose of intuitive physics reasoning, by 3D rendering four of the principles: object permanence, shape constancy, spatio-temporal continuity, and energy conversation.
The unrealistically simple world of IntPhys, found to be useful as toy examples to build up naive physics, however is nowhere near the sophistication of the reality. For compositional physic interactions, there are more than one possible event and so many impossible events that one hardly finds feasible to explicitly render all such scenarios. A clever approach has been developed to address this challenge: rather than introducing two contradicting views of the modeled world to the artificial agent, a model of the world is rendered and so does the agent, habituate in its world. The model of the world, also called the environment, is interactive, obeys physics laws, and decorated with objects of many kinds. The agent can give actions and perceive the consequences of its actions.
Given a goal, for example a robot must navigate through the room and pick up a yellow duckling at the other side of the room, it can try many different paths, avoiding obstacles, turning left and right, looking back and forth, such that the final goal is achieved. It may discover and learn commonsense during the trials, for instance by knowing that it cannot get through a solid obstacle, or its view could be occluded by objects so that it must look around them.
Beside the goal, the agent must know whether it is reaching closer to the goal, or further from it; in the worst case scenario is exhaustive search is performed and the agent is not very smart (or the object can’t be found). That kind of “reward”, or the evaluation of its current state, in our example, could be the following: the duckling is recognized in the field of view and the bigger the duck’s appearance seems to be, the closer the agent is to the goal. The learning process henceforth is reinforced by some guidance but not wholly supervised — it does not tell how to succeed the task, nor to fail it, nor which actions should be doing because there are probably so many or even endless alternatives that may lead to the same outcome.
The reward is used more like a compass and not a map: a traveler with a map may find the shortest and easiest way to trace back to his tent, while having only a compass vaguely tell which paths to follow and help him avoid being completely got lost. Neither the artificial agent nor the traveler may succeed in the first place, nonetheless they could follow the trial-and-error approach more than once, until senses of navigation through their environment were meaningful and stabilized. In terms of mathematics and computer science, is there a tool to model best the aforementioned learning strategies ?
It is Reinforcement learning (RL), originated in the psychology of animal learning. The concept “reinforcement” was coined by Ivan Pavlov’s research on conditioned reflexes where reinforcement means the strengthening of a pattern of behavior due to an animal receiving a stimulus, also called a reinforcer, or a reward in our context. Initiated from this concept and eventually intertwined with optimal control problems such as Bellman equation and Markov decision processes provides flesh and bones to layout the foundation of the field, the modern day RL is accredited to the successes of Deep Q-Learning playing Atari games, AlphaGo played itself millions of games and championed of the best human Go player, self-driving agents maneuvering cars on tens of thousands hours in simulations at Waymo, Uber, Tesla, and so on. The three pillars of RL, reward — agent — environment, interacts one to another in a closed cycle so that the agent learns from feedback of its actions toward the environments while reward drives agent’s reasoning and decision making of its actions. David Silver and coauthors hypothesized that
…reward is enough to drive behavior that exhibits abilities studied in natural and artificial intelligence, including knowledge, learning, perception, social intelligence, language, generalization and imitation.
But what if reward is not a real concept in natural intelligence ? That surely happens in reality where ones wander in an old quarter of a historic town, out of curiosity rather than to look for particular building. Is reward merely an invented technique for soul-less robots ? While a concept of an external reward function looks alien to natural intelligence, learning may leverage on more than one level. If there is such thing called meta learning, the learning about learning, then reward can be learned, or predicted, so that to facilitate agent’s learning. An inter-discipline neuroscience & AI research found that dopamine neurons in the brains release different levels of pessimism or optimism thus characterizing a complex reward distribution, allowing reward to be modeled, learned, and predicted; in the absence of external reward or when reward is scarce, predicted reward can guide the agent through melancholic plateau.
For better or worse reward is still much of a simplified version of learning incentives whilst the secrets of human mind remains cryptic. A reward function may represent well animal behaviors but it is very limited when applied into human intelligence. Another major hurdle that reinforcement learning has to get through is the trial-and-error fashion with millions of learning iterations being involved searching for the best parameters of the agent’s model. On the contrary learning with infants requires not more than a fraction of that. If the artificial agent had the core commonsense of human, then would it be much more efficient for the agent to solve its problem ? Well it depends on tasks where an innate reasoning certainly cannot beat world champion chess players but it is more than capable of mastering seemingly trivial tasks of grasping objects or opening and closing doors which are unreasonably hard the most intelligent robots out there.
This is a land of promise; and since the time of deep learning inter-disciplinary research has been pursued to get the most from psychology, neuroscience, robotics, and machine learning. If the definitions about common sense in the above that is rather inanimate, the one cited by this research is more actionable:
common sense as a set of “obvious deductions” made quickly and easily by most members of a population when behaving in their environment.
Obvious deductions mean little to no deliberation is required. In other words it is a straightforward computational concept, as efficient as a reward function, but offers much greater versatility. Could it could be a pre-trained neural networks ? That is a why-not, first because a feed-forward computation in neural nets requires polynomial time, and second common sense can be learned from data. Databrary is among datasets suitable for this purpose; it consists thousands of hours of realistic videos capturing infant’s playing and learning across the continents. A curriculum of model learning and reasoning can be designed upon this dataset, offering an opportunity to model commonsense development in infants, using neural nets as its computational representation and inference.
Seemingly orthogonal to commonsense in machines, this research aims to teach robots to understand human-machine verbal communication and thus could be instructed to complete tasks. The novelty in their research puts virtual agents at the crossroad of language, visual perception, navigation and planning; the agents are also capable of commanding and listening to each other and show their abilities beyond what they were trained on. In this research there is neither mention of teaching common sense to robots nor encoding common sense priors into robots’ mind. But can we state that those trained robots somehow learn a representation of what so-called commonsense so that it can act based on what being told ?
Common sense is as much elusive as artificial intelligence. I have argued that for AI to pass the Turing test is equivalent to say AI to have common sense. What if pursuing common sense is an AI-within-AI problem ? As an analogy, common sense is like a silhouette of an unknown figure — which is AI; even if the silhouette were known, it is inadequate to reconstruct the complete appearance of the figure for most of the details are lost in the projection; what makes it worse is that we do not even know the shape of that silhouette, and what else is needed if not the figure itself in order to know the silhouette ? Put it in another way, commonsense — though sensed very much a grounded concept — maybe just a product of human wisdom. Were it immagination of the mind ? And there were no such thing of common sense core but only a name to call for the projection of our intelligence into trivial daily tasks that our species all shares ? This may explain why it is possible for some research directions about common sense to use symbolic approaches to represent it but one can never go to the bottom of it: one simply cannot has an exaustive list of all possible common sense rules. How your shadow could become yourself ? You may find similar conclusions from Marvin Minsky’s Scociety of Mind.
I would argue further that if there is a way to make common sense an explicit piece of knowledge that is usable in intelligent robots, then it has to be a highly compressed representation of the world and being universal to various tasks across multiple data channels and senses. It is forged and carved along collisions between physical interactions and visual appearance, languages and touches, between play games and learn things. It can give explanations but itself cannot be explained. Because naming common sense out of a single task is no different than naming the task itself.
Epilogue
It is easy to see that if deep learning has revolutionized AI, the movement from which we see not only hope but fear, compliments but also critics, ambitions and controversies. Hope is about stepping closer to the realm of true AI; fear is that just another cycle that may stagnated in the a death valley of breaking ideas; compliments were given to extraordinaries happend only within 10-year time; critics are about silly and trivial errors that DL algorithms still make; ambitions are about creating more powerful machines to run even much bigger neural models with trillions of parameters so that to approach a true complexity of a human brain; controversies are between believers on connectionism could eventually learn and distill knowledge versus critics of being just another fanciful curve-fitting.
Further readings
Thomas Reid, An Inquiry into the Human Mind
David Silver, Satinder Singh, Doina Precup, Richard S.Sutton, Reward is enough, Artificial Intelligence, 2021.
Buckner, Cameron and James Garson, "Connectionism", The Stanford Encyclopedia of Philosophy (Fall 2019 Edition), Edward N. Zalta (ed.)
What’s Next ?
In the next issue I write about memory, attention, and emotion as other stepping stones in AI.