Do you have to watch a new object 1000 times to recognize it? Or if you watch it another 100000 times will you recognize it better? I think not, and this is how image recognition systems based on the Haar cascade classifier or neural networks work.
SaraVision is an image recognition project that is conceptually completely different from the approach that the whole world is pinning their hopes on. It's different from better and better neural networks that learn from bigger and more available databases. We assume that sometimes just one or several glances at an object are enough to remember it and recognize it. It all started from the need to add the sense of sight and some intelligence to one of our sub-projects called SaraCam, which raises the current Google and Alexa assistants to the 2.0 level.
At first, in order to test some basic assumptions we wrote a simple program that recognizes the MNIST character set, which I describe on our blog in a slightly provocative article "About the nonsense of deep learning, neural networks in image recognition (using the MNIST kit)". Already there we managed to create a very universal program, recognizing characters regardless of their size, slant or font type, but it was only a programming "sandbox".
The next stage was to create something more universal, allowing for recognition of any objects, and for a start, allowing for quick detection of basic geometric figures and also to check a theory that our brain can see very well what we cannot see, literally drawing in our imagination the missing elements (see: visual perception and gestaltism, reification theory), and that our system will work similarly:
It worked, it works similarly, as you can see in this amateur video below, and it works so that you don't have to see the whole square to detect that the square is there.
It may seem that detecting simple figures is easy and any programmer can do it. You can use some "ready-made" programs, you can also algorithmize everything, but we don't want to write an algorithm for every shape, but to write one for all shapes, and most importantly we don't want to teach the system with thousands of images.
The next step was to test the system to see if it could handle face detection in the camera video. Importantly, the system is designed to detect a face very quickly, the face can be tilted left or right, slightly sideways, sideways, poorly lit, visible in color or IR rays - it worked. The system, despite the fact that it is under construction coped almost 20 x faster than standard face detection systems, and most importantly coped where other systems could not cope at all (for example, a face illuminated from one side by the sun with the head slightly tilted to the side):
(video taken on a Raspberry Pi 4 microcomputer, use of a single CPU core at 20-30%, image analyzed in real time for obstruction from a moving pan tilt camera that tracks the user's face)
We are at the beginning of the road, but the results we are getting seem sensational, and we are already thinking about a 3D space recognition system based on our method.
Although at the moment we are thinking of applying our method to our other sub-project SaraCam, the possibilities of this system are enormous, the main ones being:
1. We don't teach the system with thousands of samples, just a few or a dozen ones.
2. The angle at which we analyse the recognised image is not important.
3. Up to 20 times higher speed of object recognition.
4. Minimum computer power (Raspberry Pi microcomputer can detect a face in 10 ms, not in 500 ms).
5. No internet connection is needed.
I do not deny any methods of image recognition based on neural networks, the huge progress of these methods especially in recent years, I think that tools like tensorflow are brilliant, but I also think that many things can be done differently, that not everything should be pushed into neural networks, and if we want to use them, let's give them the data on which networks have a chance to work best.
We already know what our voice assistant SaraCam - one of our artificial intelligence subprojects SaraAI - will look like.
As we wrote earlier, thanks to the received funding we have strongly accelerated.
SaraCam project is about upgrading voice assistants to a higher level by adding sight and intelligence.
You can find more information on the project website SaraAI.com/SaraCam, and here I would like to present our journey from the model to the final look.
The idea of creating Sara was born a long time ago, in the times when the Internet was in its infancy, speech recognition didn't work and there was no access to open knowledge bases. Fortunately, those limitations are behind us now, which allowed us to return to the project and start the first tests of the previously thought out assumptions. In one of our first published videos you can see our first prototype assistant made from a regular IP camera, where we show some aspects of the assistant that we would like to develop more. This only one and a half minute video, although older and amateurish, shows some key solutions, like establishing a kind of bond with the device or continuity of dialogue, which seems to us to be crucial and which we already described in another article "We are looking for Artificial Intelligence, and we get.... a speaker."
After the initial tests, seeing the limitations of using standard IP cameras, we further developed our assistant by adding a more powerful processor, a set of 6 microphones and fast motors, so that the camera could keep up with fast movement. The next hybrid version of SaraCam was born:
At the same time, we also made our first video showing some of the functionality we want to do in the already commercial version of SaraCam:
In late 2020, thanks to the funding we received for SaraCam and our collaboration with MindSailors Design Studio, we are finally creating the final shape and functionalities of SaraCam, which we will soon present in action, and at the moment we can already reveal its design:
How do you like it?
Sara AI is implementing a project co-financed by the European Funds from the Operational Programme Eastern Poland 2014-2020, Measure 1.1. Starting platforms for new ideas 1.1.2 Development of start-ups in Eastern Poland entitled "Development of start-ups in Eastern Poland". "SaraCam - interactive voice assistant based on proprietary image recognition algorithms with the function of eye tracking and face recognition".
Project objective: The aim of the project is to introduce the SaraCam device, i.e. AI Assistant with breakthrough functionality related to the use of the camera and its own algorithms to activate and authenticate with eyesight and eye tracking functionality.
Planned effects: The project will bring a new product innovation to market - the SaraCam hardware module.
Completion date: 01.12.2020 – 28.02.2022
Project value: 1,232,608.08 PLN (~320000$)
Amount of funding: 937,279.70 PLN
We are pleased to announce that our artificial intelligence project, won a distinction in the category START UP WITH POTENTIAL POLAND - WORLD in the competition promoting innovative and value-creating products and services: "Eagle of Innovation" of the daily newspaper "Rzeczpospolita".
We are pleased to announce that one of our Artificial Intelligence subprojects SaraAI - interactive voice assistant based on proprietary image recognition algorithms "SaraCam" received EU funding from the Polish Agency for Enterprise Development.
Completion date: 01/12/2020 - 28/02/2022
Project value: PLN 1,232,608.08 (~320,000$)
Co-financing amount: PLN 937,279.70
In this article, I will demonstrate the nonsense of using Deep Learning, a neural network, to recognize images using the example of handwriting recognition from the MNIST character set and prove it in practice.
I will also show that to recognize handwritten numbers or a wider picture, it is not necessary to teach the system with tens of thousands of samples, and one or at most several samples will be enough, and it will not matter what size the sample used to teach the system is, or what sample is used to test its effectiveness. Also, it will not matter what the centring, size, angle of rotation or the thickness of written lines are, which is of paramount importance in machine learning or deep learning image recognition methods.
As we know, each image in a computer is made up of single points (Max Planck says, our whole world is made up of pixels), which when connected in an appropriate way create an image. The same is true of the characters that form digits in the MNIST character database.
The MNIST character set and its classification is a kind of abecaddress for every beginner student or person starting to play with deep learning or neural networks.
This collection was compiled by Yann Lecun (http://yann.lecun.com/exdb/mnist/) and contains 70,000 images with a resolution of 28x28 pixels in 255 shades of grey. It is assumed that 60,000 objects are the training set and 10,000 are elements from the test set. The image (in this case the number from 0 to 9) is represented as a 2D matrix of 28x28 pixels with 0-255 pixel grayscale. An easier form to process by machine learning algorithms is the vector form, which is obtained by reading the pixel values from the image line by line. From here we get a 784-dimensional vector (28*28=784) representing one image with a digit. From this it follows that the whole training set is a 60000×784 matrix.
Now all you have to do is use the TensorFlow tools available in the cloud, download a few files in Python, read about many wise ways of teaching the network, buy a few books about it, or just download a ready-made what most do and create your first neural network.
Wow, to one character consisting of 28 lines of 28 points each, specially prepared for recognition, that is, properly centered, calibrated in perfect resolution, the big heads came up with the idea that if our human brain consisting of neurons can read it, we will create a mini electronic brain from artificial neurons, he will also read it. And it worked! The results are almost perfect, close to 100%, but...
Using neural networks to recognize numbers is like replacing a stick for writing numbers on the beach sand with a specialized robot picking one grain of sand from the beach to get the same pattern!
Do you think that our brain looking at a number cuts it into strips, pixels, vectors and analyses it point by point? If we take Dr. Clark's calculation that the resolution of our eyesight is 576 megapixels (not really more than 20 megapixels at the center of the gaze), then there is really no such energy drink that will help our brain in such a job.
What's more, is this what teaching is about? Do we have to see 60,000 characters to learn them, or did the primary school teacher write it once and make us remember it?
If she draws you a certain sign once, for example:
will you recognize the one below as different?
Of course it is a bit different, but if we have a certain set of digits and this sign next to it - then each sign will recognize it even though it is the first time it is seen. He will also have no problems with drawing it after a while in another place.
And now, most importantly, how do we remember this sign? Pixel by pixel, 784 points? And what if it is a 256 by 256 point mark, which is the same as a regular icon on our home computer consisting of 65536 points?
No, our brain is great when it comes to simplifying all the data that comes to it, and even more so when it comes to how much data is as huge as the image data.
Let's think for a moment how we would describe this sign in words to someone else, and we'll get one of the simplified memorization methods - we'll probably describe it more or less like this:
two segments, horizontal and vertical, the first segment is horizontal, and at the end of it, it meets the vertical segment at 90 degrees. Let's note that in this case practically 4 information is enough to describe the sign, not information about 784 points in 255 shades of grey. What is more, the information written this way is independent of the size of the sign.
Exactly the same method can be used to remember every sign, digit, but also the object in the image regardless of whether the object will be in 2D or 3D space (in a nutshell, because recognizing the image in 2D or even 3D space requires additional actions - if you are interested, I will gladly describe some easy recognition methods in the next article).
So once again, what is this simple method?
We will remember:
1. "starting points"
2. "starting angle"
3. "episode type"
4. the "angle of bending of the section"
5. the 'angle of the joint of the section'.
As you can see, I think the human brain remembers the basic shapes and the way they are connected.
We easily detect straight lines, circles, arches more or less bent and their mutual position or connection.
Let's see the number 6 as an example (I also recommend to see everything on the attached video https://youtu.be/do9PM2PtW0M):
The number 6 can be recognized in several ways - in fact, it consists of several segments connected together always at a positive angle, either from a few segments forming a circle at the bottom, or from one segment forming a kind of a snail, in this case, outlining 476 degrees from beginning at the top to end at the bottom.
Such a method also has an additional advantage, it does not matter if the image is rotated by practically any angle, the thickness of the line does not matter, the size of the image or its position, i.e. everything that has a very negative influence on deep learning methods.
After all, the same sign drawn at a 45-degree angle has all the parameters described above the same.
An attentive reader will probably catch it right away, but what about the numbers 6 and 9? - Well, up to a certain angle, the number 6 is a six, and up to a certain angle it is a nine.
See the video attached, how great are the recognition problems of standard deep learning methods when you write the numbers at an angle. Remember also that the MNIST character set is already specially prepared, the characters are properly calibrated, centered, always at the same resolution of 28x28 and so on.
At this point, more efficient programmers will probably pay attention to one big problem.
The human brain simply sees the whole picture, the whole picture, lines, circles - it's simple, but the computer sees in one moment actually only one point, one of these 784 points, so how to program it, how to detect these lines? An additional problem is that some digits are written in bold, so when we see one point and check next to it there are also painted dots everywhere, how do we know in which direction the line is?
(this is how the computer sees the line)
We can use two techniques for characters:
1. simple slimming until we get a line of point thickness (the point may be a pixel, but it may also be 4 pixels, depending on needs)
2. The "fast injection method" - we imagine that we need to paint a given number with a colour (such a fill function) we let the paint in one point and look where it can spread, but as if it is under pressure, so e.g. in number 8 at the crossroads it does not spill on all sides but flies to the opposite side of the "crossroads" and flies further.
After slimming down, when we have one line, we easily extract straight, curves, angles and remember this.
I devised the method myself over 20 years ago, before writing this article I decided to write a simple programm to see if I was right. After a few days I received a program, not perfect because I don't make a commercial program, but enough to confirm that it works and has character detection over 97%, and probably if I worked for a few more days I would reach 99%.
On the other hand, it can be said that the methods described everywhere on the Internet those using neural networks do with the same detectability, so why do it differently? - Yes and no - this method does not require teaching, super computers, it does not have to be a set of only 10 characters, there can be any number of them, there does not have to be 28x28 characters because they can be in 450x725, and actually each character can be in any, each time a different format, the characters can be rotated. Adding more characters does not need to teach the network again, there is no problem that such an algorithm can recognize additional characters printed in almost any font available on the computer, of course without the need for any additional teaching.
Some people will probably write me right away that it's really just about showing on MNIST how to teach neural networks, but I can't agree with that, it really doesn't make sense. After all, we don't learn how to drive a nail with house demolition ball during technical classes.
Leave those signs alone because there is no point in using such methods.
Blow out all those thick books about Deep Learning, leave this great tool in the clouds, sit down and think about better and better algorithms by yourself, such an abecaddad of a programmer.
I was forced to write this article because while working on our strong Artificial Intelligence project SaraAI.com I am constantly being flooded with questions about what deep learning methods I use, what NLP methods, and so on. - And when I answer that I'm not directly using practically any known methods, I see only surprise and doubt on the faces of my interlocutors, so I wanted to show that we don't always have to follow the generally accepted line of solving programming problems, we can do something completely different and get much better results.
To the article I also recommend the "proof of concept" video: https://youtu.be/do9PM2PtW0M
"Get rid of all those fat books about Deep Learning." - after these words, I hope you already know that the article is specially written in a very provocative style, because what if not emotions lead to an intensified discussion :-) (do not throw away these books!)
I don't mean to abandon Deep Learning, but to pay attention to using these methods where it is optimal. In addition, I think that the data you throw into learning should be pre-prepared as our brain does.
Although our brain receives an image from the eye made up of pixels (photons), before comparing it with a memory pattern, it processes it and prepares it in the visual cortex in V1-V5 areas - let's do the same.
Note that if you read more, it's not about the MNIST character set at all, in this method, I can add any number of characters including all Chinese and it will not change the recognition - see on the video how I add to the digits 2 and 3 Roman numerals II and III, which is completely irrelevant in such a method (I add once and they are recognized as 2 and 3 Arabic), and it is of paramount importance for methods based on neural networks.
Consider this method as a basic ABC, which, while developing, can be easily used for face recognition, or Fashion MNIST collection, as well as ImageNet.
One of the accusations is that I've been doing this program for a few days, and yet it can be done on networks in 10 minutes. OK, TensorFlow was not written in 10 minutes either. Now that I have the algorithm ready, I can add more alphabets in a few minutes.
"But there is a few-shot learning (or zero-shot learning), where we don't have to teach on 60,000 samples" - yes, but for now these methods have totally unsatisfactory results, so why, if we have here a simple and working algorithm requiring one or more samples.
"But the data augmentation is used (small angle revolutions, scaling, expansion, elongation)" - yes, but it's only the enlargement the number of samples of another thousands.
"The 97% result you're giving has no measurable value, because no one but you can test it." - The point is that now everyone can get such a result because I've given the whole method, so a good programmer will not only repeat this result but also improve it without any problem.
"Why use your algorithm and not neural networks?"
Out of curiosity I only ask, because I haven't done so much research, do you know any program/algorithm that recognizes any characters of any alphabet written and printed in any format, size, color or thickness, or on which you can perform such a test:
1 person draws any character, can be imaginary
10 other people write the same sign or a bit different
the system has to determine who wrote the wrong sign.
To make it harder, everyone has to draw the sign on the tablet in a different resolution, with a different brush thickness, different colors, different angles?
In addition, this algorithm has to work on a "calculator" without access to the network, adding more images does not require re-learning the entire database, the resources used are really negligible.
Man has always looked at the stars, asked if he was alone. In the most pessimistic option of the Drake equation there are 250,000 highly developed civilizations somewhere in the infinite universe that are able to visit us.
But we also know that the chance to get in touch with such highly developed intelligence is close to zero, and maybe that's why we'd like to create our own artificial intelligence.
Artificial intelligence has been with us for a long time, actually from the beginning ... of cinema. Most often it is shown as an ominous robot or animal-like creature. Why? Because we can't imagine something we've never seen, which is not like something that already exists. This is a huge limitation of our brain and it means that our evolution is not rapid but rather slow.
We live in a time of science and knowledge that seems to be exploding. According to Moore's law, the performance of computers has doubled every two years since the 1960s. Due to this right in a few years, the memory on your smartphone will be calculated in terabytes, and a dozen or so years later in something that does not even have a name because such large numbers have not even been used so far.
According to all these laws, soon the performance of smartphones will be greater than our brains, so I ask, what's going on?
We have 2019, powerful computers with unbelievable memories and computing power, we have powerful IT companies with billions of budgets and what do we get in 2019?
Talking loudspeaker, encyclopedia in loudspeaker, talking clock watch with completely zero intelligence.
Waiting for intelligence,it is a good idea to hibernate for a few years.
Why did we get a speaker? Why are the products of Boston Dynamics, the producer of incredibly efficient robots, in fact, ordinary remote controls?
There is one "product", maybe you can guess what, I think it's worth taking a closer look.
The "product" does not have a very good speech synthesizer, it only produces some strange voices usually in bad moments, it leaks terribly especially at the initial stage of operation, there is practically no knowledge base, you will not know from it who the president of the United States is, in fact you know nothing from it.
He performs only a few voice commands, but the dog, as it is all about him, is however the greatest friend of man.
Why such a "simple" being causes such great emotions in a person, why can we talk to it for hours, even though it does not really answer?
Why are we so excited about the "loudspeaker with AI", why, the longer we have it, the more our enthusiasm decreases, and why an ordinary dog, the longer it is with us, the more we like it?
I will tell you, because of contact, invisible threads of agreement, nonverbal, but very strong, and in this agreement one of the most important features is eye contact (eyes, faces, head movements can often show more than words).
Can't we do that now? Is it really enough to put a "speaker" and hope that people will love it?
Well, we are able to do it, in our Sara AI project, we give personality to Sara, we give senses, identity, but most importantly, we give intelligence, at the beginning a little, as much as a dog, maybe a child of several years, is that not enough? Isn't dog intelligence enough to spend hours with it? We also remember that we give intelligence of a dog or a child, but with the knowledge of the entire world database.
Without intelligence, albeit minimal, no natural language processing systems will ever be able to pretend to be the least intelligent and will always be just talking speakers.
We give it a minimum, contact, a thread of understanding, surprise, unpredictability. Not ready 3 answers to previously programmed questions. Not that way.
You get simple human answers to simple questions. If you share your impressions on a given topic, you can expect any interaction, not encyclopedic answers.
You get eye contact, a non-verbal way of communicating, you don't have to use the calling word at the beginning of each sentence. You talk to Sara as a human, so you don't have to say "Hey, Sara" to her, then wait for her to activate and keep talking. To achieve it, Sara has eyes (of course cameras), and also shakes her head and thinks.
I think most of us have seen Metalhead (episode 5, the 4th Black Mirror series). In this black-and-white episode, dogs-like machines take control of the Earth, terrorising people with their ruthlessness.
After watching this episode, probably every IT fan is immediately reminded what Boston Dynamics creates. The similarity of film killers to real robots created in this company is striking. Interestingly, the company Boston Dynamics in 2019 is going to sell these nice dogs. At first, 100-1000 items per year.
Let's also see the evolution that takes place in front of our eyes in robotics, in which Boston Dynamics is undoubtedly the leader.
Imagine that "dogs" are being sold in large ammounts in the following years and suddenly, due to a software bug or hacker attack, the apocalypse from the Black Mirror movie begins ...
I calm everyone down. NO - these are not yet "terminators".
Robots of this type will be sold with a remote control, where we control where they go. For now, they are only such modern "remotes". Although they are equipped with autonomous mode, it is still only setting a straight path from point A to point B.
It's good that artificial intelligence is not yet available ... ("it was not available so far" - Sara AI gonna think to herself soon).
When browsing the internet, we can see practically "artificial intelligence" everywhere, but can we? Do we see artificial intelligence or only two attractive marketing words?
We live in such times that AI describes many products, from sharp knives to voice assistants. Is this artificial intelligence?
If we look at different pages of description, what actually AI is, we will find there general descriptions so that everything can fit properly. I have the impression that when a large company makes for a few billion dollars new product or even any additional feature in the phone, it adds a further definition to the definition of what AI is, so that the product can be fully promoted to give in the description "AI powered".
I also have the impression that the majority of people, however, understand intuitively what it should be and what the real AI is, probably a lot of Hollywood films have a huge impact on it.
But why is it still impossible to talk with simple AI on such powerful computers, billions of dollars spent on research? Why are the best voice assistants become boring after a moment of use? Well, there are several "small" problems that have not been solved so far.
First of all, in order for programmers to write something they need to understand what to write, and unfortunately our knowledge of how the human brain works in terms of AI is almost none. We know how neurons work, how they communicate, which parts of the brain are responsible for what activities, from the psychological side of our behavior we already know quite well, but we cannot combine this knowledge to understand it, let alone describe it and copy it.
The second "small problem" is that computers are really blind, deaf, have no sense of touch, smell or taste. Imagine that a child is born without all the senses, what chance does it have to become intelligent in any degree? This is obviously an extreme case, but it is enough that the child is born blind. Blind children develop well, but they start to talk and understand much later. The sense of hearing and touch are able to quickly sharpen and help in the development of intelligence, but it must take much longer than in people with functional vision, one of the most important of our senses to explore the outside world.
Some of you probably think now: but computers have cameras and microphones. They have, but ...
The best image recognition systems available to all Google Vision, analyze the image for a very long time, see little, make thousands of basic mistakes, and the child in every second of life watches virtually 3D movie recording dozens of frames per second for many hours a day!
Microphones - here is the greatest progress, the computer is able to capture the sound direction, loudness, frequency, but the speech recognition systems are limping, unable to pick up the voice well and recognize it in the room disturbed by other sounds. Remember that even at 90% accuracy every 10th word is lost or converted to another. Try to communicate well, speaking to someone turning one word of 10 into random one not related to the topic ...
Now an explanation of what artificial intelligence in my opinion is and what it is not.
It would seem that the autonomous car of Elon Musk-Tesla, which is able to take us home from work, is an example of the development of artificial intelligence. No, this is a brilliant invention, the future of motoring, but there is no more artificial intelligence there than on any phone, i.e. it is not at all. These are simple extended algorithms that operate on the principle of implementing programmed conditions such as: when the red light is on, stop the car. Of course, it's a simplification, but it's exactly how it works. You do not really want the car to make decisions based on its experience, learning about past events, because we would not be able to predict the behavior of the car. It is better to write a pattern of rules in it than to wonder why the car suddenly turned left because it came up with such a brilliant idea. After all, we learn from mistakes, we do not let children ride a car because mistakes while driving could end tragically.
Voice assistants - ask a simple question about some activities, the effect of which is known to every child, eg "can I get into the fire?"
Voice assistants have zero IQ, how they work, and why there is no AI there, I'll explain it.
There are more developed systems that can, for example, summarize the read text. It seems that in order to be able to summarize the text, one must understand it, know what it is and know the context. Only then can it be summarized. Nothing could be more wrong - it's just statistics and enormous knowledge bases.
How does it all work now and cheats us by pretending to be AI?
Responsible for understanding our speech are systems based on Deep Learning, which are the basis of NLP (Natural Language Processing) systems. This is not a scientific article, so I will quickly summarize that there are many better or worse methods (POS tagging, Parsing, Named-Entity Recognition, Semantic Role Labeling, Sentiment Classification, Question Answering, Dialogue Systems, Contextualized Embeddings) which analyze in great summary big knowledge bases, eg database of Twitter dialogs and find the most common words, give different expressions some values and the greater the value, eg positive, the sentence is determined as positive in a given sense. Other manipulations of words, sounds or signs are also used.
It's all one big statistics that can really fool us a little.
The simplest example to understand how it works is to predict the completion of the sentence "Hungry like ..." - "a wolf", "once upon .." - "a time" etc. I know that I have simplified very much, but Google search is the same statistics. Enter a word and see the hints - it's not the AI that prompts words - it's just statistics.
If you want to know more from the technical side about NLP, read this article.
When it comes to voice assistants, it is even worse, I have the impression that there is a staff of people sitting there, who on the statistically most frequently asked questions puts in three different answers.
This is wrong way!
Identifying words in sentences, context, and predicting statistical answers, it is by no means AI.
Real AI should work on completely different principles, in which NLP is not the goal, but the means to the goal. The method of solving the problems described above to create a real AI will be described in the next article.