Search Engines

The internet now represents a huge database of information, much of it irrelevant. In order to divine the best information most people are using Search Engines, or Content managers like Yahoo. These tools are now our dictionaries for the world. Since 1997, the explosion of the net has been incredible. A term that might have had 100 hits on Altavista back then, maybe will return 6,000 now in Altavista - an increase of 60 times. Google is now the number one seach enginer, because it has an algorithm that more close matches our own thinking.

Total Number of Pages in the Internet

Total size in search engines over the 1997-2003 period may have grown from around 50 million pages, to around 3 billion pages (again a ratio of 60 times). That is one web page for every two people on earth. Of course, not everyone on the planet has access to the internet, maybe 10%, or 600 million people, so this means we have 5 web pages for every user of the internet.

Will there be 60 times more in 2010 - a total of 180 billion?

Some simple calculations can be made: We can assume that most people using a search engine only look at the first few (3?) entries returned. Also Google is very popular (maybe greater than 50% of the searches? If there are 500,000 possible single word search terms, then this means a maximum of 1.5 million pages hit, which is 1 in 2000. So for every 1 page that is indexed, maybe there are 1999 that are not indexed and not found through search engines.

Generally users only read the first few hundred words of a web page. If we estimate that this number is 16 sentences, at 15 words per sentence, then the total number is 320 words per page. 320 words x 1.5 million pages hit means that whatever the exponential growth of the internet might be. the quality of content delivered up by a search engine like Google or Altavista is likely to remain at about 24 million sentences, or 360 million words. This is still a fair amount of course.

Now we know that a lot of word searches are two-word or three-word combinations, which opens up the number of possibilities, but only if these hits are likely to get pages outside of the 1.5 million in the original sample.

Total Number of Web Searches

There are about 500 million searches each day. This means the average page gets hit once every 6 days, which is 5 times a month.

Search Engines

http://www.google.com/

http://www.google-watch.org/

http://www.altavista.com/

http://searchenginewatch.com/

What content should a search engine hold?

As there is so much rubbish on the internet, the aim of a search engine is to wade through the rubbish and find what people really are looking for. Essentially we are looking for relevant information, so our search through the databases must somehow work out what people are looking for. It is fairly clear that one word searches are never going to be ideal since there is so much confusion over what word mean. For example, if I seach for the word table, what kind of table am I looking for?

Definition of the web

The web consists of a series of links which interconnect pieces of information. In the extreme case, the web is simply a series of links and nothing more. In this case, we can look at the links themselves and the popularity of the links, which is what Google does with the pagerank algorithm.

Dynamic Web pages change

Web pages can be static, like basic HTML or dynamic. Static web pages over time can be dynamic, hence the continuous indexing by robots of search engines.

Ideal Search Engine

Our ideal is another copy of ourselves, something that would have the time to search through all the web and find out what we want to know. This is the difference between simplicity and complexity. If our question, and search was simple, then we would not need any help. However, many questions are not simple. We need to be able to convert our search from being a complex one to being a simple one by building a machine that can do the complex work for us. The only way that this can happen is that we have a machine that really understands us and hence, can look for exactly what we are looking for. Such a machine would have to be based on input from us in order to build up a mind map of how we think so that when we put in what we see as an obvious simple search, it can try and interpret the search for us. Some work on the interactive GUI has already taken place on this. The adsense in Google is also another aspect of the approach.



2003-09-23

Altavista Worldwide All Languages

Very High
"a" 276,643,546
a 173,265,406
"us" 110,644,276
"computer" 45,589,742
"usa" 32,990,816
"thread" 29,984,783
"communication" 16,069,543
"london" 15,455,565
"interface" 12,387,039
"java" 11,233,643
"chicago" 10,310,925

High
"wireless" 7,387,789
"sony" 6,180,319
"XML" 4,305,951
"cambridge" 4,073,873
"oxford" 3,755,141
"socket" 2,679,130
"pizza" 2,380,706
"web services" 1,960,697
"hypertext" 1,246,522
"bluetooth" 1,012,234

Medium
"workflow" 813,979
"corba" 512,798
"SGML" 485,620
"RDF" 310,173
"merlot" 298,736
"encapsulation" 296,704
"namespaces" 294,690
"RMI" 253,277
"otago" 230,177
"haskell" 222,594
"operations management" 188,274
"zaurus" 135,082

Low
"semantic web" 79,752
"formal methods" 73,251
"ebXML" 62,318
"bytecode" 60,998
"grid computing" 57,534
"petri nets" 45,385
"blackbirds" 44,482
"requirements engineering" 30,273

Very Low
"lambda calculus" 19,606
"lingard" 14,166
"nonmonotonic" 13,859
"woolacombe" 9,366
"zymurgy" 6,628

Very Very Low
"agent communication language" 4,314
"web ontology language" 2,199
"defeasible reasoning" 1,663
"agile programming" 1,138
"kakistocracy" 897

Very Very Very Low
"pointed model" 72
"term interpretations" 60
"pointed interpretation" 11

None
jomajel
jomajeli