What If Siri Could See?

Brad Folkens
B. Folkens|07.08.16

Sponsored Links

It is human nature to wonder what the future will look like. How will the evolving technology of today impact society in the long run? In the '80s, humans depicted their imagination through movies such as Terminator and Short Circuit. These films personified machines, specifically robots. In these films not only were computers interacting with humans, but they exemplified true cognition, or understanding, of context through their ability to answer complex questions and even read the emotions of their human counterparts.

This depiction of the future of technology motivated society to innovate. Ever striving to replicate the possibility portrayed in the movies of a machine that looks, acts, and thinks like a human. This representation that we envision is a result of what we believe technology has the potential to be in the future - humanlike. So, we continue to constantly develop in order to engage with computers in new ways. Through doing this, we are actually teaching computers to engage with us. This goes beyond teaching them to react to commands and act accordingly, but by helping them to think for themselves with true a understanding.

We do this by teaching computers as we do children. We teach our kids from a young age to utilize each of their senses in order to make inferences about things within the world. From this they learn and develop a true understanding about people, places, and objects. As computers don't have senses to make inferences, we need to provide them with this ability. This means further developing technologies such as voice recognition, the ability for computers to hear, and image recognition, the ability for computers to see. These two "senses" are allowing computers to continue to learn about the world around them in order to gain understanding. With these advancements, computers have the potential to do extraordinary things for mankind.


Computers are very good at completing tasks by following a set of instructions down to every last detail. This is why we have "bugs," computers sometimes can't figure out what we intend for them to do, but rather just do exactly what we instruct them to. As a result, there were many issues when companies in the early '90s tried to realize the idea of voice recognition. They were trying to bridge the communication gap between humans and computers in order to allow both to communicate in a new vertical, through speech, but they were having trouble programming computers to understand what humans meant by what they were saying.

For example, when humans would speak to early software they would say, "Do you recognize speech?" Computers would hear, "Did you wreck a nice beach?" The smallest differences in our inflections and voice easily confused the software, which became a painstaking process for humans to correct. Consider the differences in the meaning of the words their, there, and they're. Even though these sounds the same, they have totally different meaning. This is what computers struggled to comprehend and decipher.

These early attempts served as a building block. It wasn't until Siri that a computer finally displayed voice recognition that we had only previously seen in the movies. The creators of Siri did this through a combination of AI and voice recognition. Although Siri does make some mistakes, she has the ability to recognize speech and react appropriately to human commands. If you ask Siri, "what's the weather," she responds with, "It is 80 degrees."

Her ability to genuinely understand the context of situations comes from the combination of voice recognition and AI. Although the technology isn't flawless, she understand what we are trying to tell her most of the time. She shows that it's not about recognizing bits of speech, but rather understanding the meaning of what we are communicating.


As humans, we retain 80% of what we see. This is a significant amount compared to the 20% of information that resonates with us when we read, and the 10% that sticks after we hear something. However, 90% of all information sent to the brain is visual, meaning that almost all of our perception is visual.

The brain is a complex, yet beautifully simple organ. It contains a vast network of connections between neurons that we can model mathematically. Knowing this, scientists are able to train the brain to recognize inputs from any source. This is called neuroplasticity. What we can draw from this is that the brain's neuron structure, when networked with others, can "learn" anything. The same mathematical simulation can be programmed into the brain of computers. This process is being used to teach computers to make mistakes and correct them, which has been extremely helpful in programming them with the ability to see. As a result, the image recognition software we have today is able to allow computers to identify objects by taking a photo.

With this capability, computers can technically see and are increasingly gaining the ability to understand images. This detail is the difference between a computer seeing an apple and being able to identify it as a Golden Delicious, or seeing a plant and being able to identify it as Aloe. It's taking surface level thoughts and adding in the ability to infer based on specific details in order to draw a specific conclusion as to what something is based off of context.

The future of this technology is depicted in the movie Her. In the movie, a human man falls in love with his virtual assistant and through using his phone's camera is able to show her his surroundings. This helps her to interpret what situations he's in, picture how things work, and gain a better understanding the world. What can be learned from Her is that a major component for computers to truly understand the world and continuously learn is to be able to see. With this ability, there is no need to type in queries or to make verbal commands to personal assistants. Machines would be able to visualize what was going on around them, gather information, and act accordingly.


Visual cognition, when computer vision has a true understanding of images rather than just simple visual recognition, is bridging the gap between humans and technology. Imagine wearable computers that are able to see and interpret the world. The technology could be embedded in different devices such as smart contacts for humans. The lenses could automatically recognize everything you are looking at and you would instantly be provided with information about the breed of a dog passing you on the street, medical advice for a rash you contracted after hiking, or the recipe for making homemade cookies while walking through the grocery store.

This technology has the one crucial element which we are missing in the connected devices of today - the ability to not only see, but to understand. With this, Roombas of the future automatically identify and vacuum dirt piles without mapping the whole room, garden probes alert humans when their vegetable garden is ready to be picked or at the earliest stages of disease, and virtual assistants can call the attention of parents when their child wanders away in a store. In the future, your home will be able to tell beforehand who your visitors are. Based on an understanding of the way in which certain individuals dress, virtual assistants will know if you just received a package from UPS or if your next door neighbor is coming to say hello.


Technology is advancing daily and with each new innovation we see a change in human behavior and ultimately society as a whole. The world is increasingly becoming more and more connected. Computers no longer only have a home on your desk, but in every room of your house, in your pocket, and even on your wrist.

As with every piece of assistive technology created throughout history, machines with visual cognition would help humans to focus their time and attention on other things. This means devoting energy to accelerate productivity in other areas of life and allowing machines to take care of smaller projects and other duties.

Biological evolution happens over the course of millennia, however as Moore's law states, we roughly double our computing power every two years, evolving computers at an ever increasing, exponential rate. Once we start to merge human and artificial intelligence, evolution will begin to move at the pace of technology, outpacing organic evolution. Consider those things which humans can do now that are changing the world, such as graft healthy tissue on top of diseased tissue in order to make hearts beat again. Machines of the future will be able to learn this procedure in order to do this more efficiently and precisely than a human ever could. Not only this, but as machines are able to learn they will be able to suggest better alternatives to the procedure and contribute to the developments of the medical field utilizing a wealth of endless resources.

One of the reasons why artificial intelligence is called artificial is because there is an element missing which creates limitations on this technology -- visual cognition. If machines could see with full understanding, they would be able to to exhibit these human traits we portray in movies. Humans would no longer need to input commands using a keyboard or verbalize situations in order to help machines understand, computers would interpret their surroundings on their own. Technology would learn over time to understand the world in order to think for itself. Ultimately, technology would engage with humans in a whole new way and provide assistance in an entirely different form. This, is true machine cognition.
Popular on Engadget