If you submitted the above pair of images to a computer equipped for visual processing, the machine should be able to determine that the thing on the left is a hedgehog, whereas the object on the right is a wire brush. Initially, the computer would exploit the hints provided by the image captions, "animal" and "tool". Then it would search through its databases to examine hypotheses about various wiry-looking animals and wiry-looking tools. But suppose we were to submit the images without any captions, as follows:
In this case, the computer would no doubt find it much more difficult to identify correctly the two images.
The point I'm trying to make is that computers have a hard job trying to perform various tasks that are trivially easy for humans. And tasks that depend solely on visual cues, with no linguistic hints whatsoever, are particularly difficult for the computer. A simple glance informs us that the thing on the left looks like an animal, in that the dark "holes" are surely eyes, and the rectangular bit that protrudes in the foreground is surely a snout. As for the object on the right, its sharply-defined contours reveal instantly that it's a manufactured artifact. However, it still remains an extremely arduous task to try to instill this kind of common-sense visual approach in computers.
As a child, I used to see my father shaving with an old-fashioned steel razor. One day, while being driven through the outback countryside by my grandparents, I saw a hillside whose trees had suffered recently in a bushfire, and I had the sudden impression that the dark stumps could be likened to my father's face when he was in need of a shave. I imagined that it might be possible to attach a giant steel razor to our automobile, enabling us to shave down the burnt trees. Some people might say kindly that I had a vivid imagination. In fact, I was reacting like a poorly-programmed computer, incapable of making instantly a clear distinction between hedgehogs and wire brushes. Since evolving into an experienced adult (?), I'm no longer inclined to associate an unshaved face with a hillside of scorched tree stumps. You might say that my childhood power of imagination has disappeared. On the other hand, I don't confuse hedgehogs with wire brushes.
When I started work in the research department of the ORTF [French Broadcasting System] in 1970, a TV producer asked me whether it would be possible to develop a computer program enabling the machine to "watch" old movie sequences, for days and nights on end, with the aim of extracting all the top-quality images, say, of trees. The fellow was concerned with the use of TV as an educational medium in certain African nations, and he felt that tons of existing images could be recycled to make excellent documentaries for African audiences. I disappointed him by pointing out that a computer would be incapable of separating images of people into males and females, so it was premature to talk about software capable of selecting attractive images of trees.
Google has been working for a long time on image-recognition algorithms, and two of their engineers have just presented a paper on this subject at a web conference in Beijing. Their experimental tool named VisualRank attempts to weight and rank web images that look alike... in much the same way that the familiar Google tool weights and ranks websites that would appear to talk about a given subject. It goes without saying that the practical exploitation of such a tool would be immense and profitable. Users interested in a certain kind of object or article (such as clothes they would like to purchase, for example) could expect to start with a rough visual outline of their goal, and go on to access pictures of relevant items supplied by the search engine.
I have the impression however that Google still has a long way to go before reaching that point. So, if you intend to use the web to purchase either a hedgehog or a wire brush, be wary about supplying nothing more than a vague image of what you're looking for. For the moment, it would be wiser to write down in words exactly you want.