ICS 101: Lecture 4

Searching the World-Wide Web

Part 1 Part 2 Part 3 Lecture Index
Web Searches

We're going to look at a very important aspect of the World-Wide Web; web searching.

This is going to be quite a comprehensive examination. I'm assuming that you've already searched the web before. Anyone using the web has had to do that. I want to build on that experience and show you how you can improve your performance.

We'll go beyond that, however. I want to show you how you can anticipate that people will be searching for you web pages. You'll want to make sure that your pages are found. There are things that you can do to help.

(No Title)

The problem with many web searches is that you get a huge number of web-references returned. This is a problem because you're not sure if the best ones are at the start of this list. Have the most appropriate sites "gone to the top" of the list?

Often, they haven't. There are some reasons why.

Much of the web remains unhindered. That is, the web engines that build the indexes have found only a relatively small number of the total web sites. For quite a while we thought that the indexing was relatively complete. A recent study (Summer 1999) found that this wasn't the case. Only about 16% of the web pages are indexed.

Many of the pages that have recently been added to the web aren't in the indexes. Studies have shown that there may be a delay as long as six months before pages get into the indexes.

Sometimes the good pages are bumped off the top of the search lists by pages that have "spammed" the index engines. What has happened is that a page loads information that will improve their ranking for specific search terms. They repeat the same keyword over and over in an attempt to get a higher ranking. Sometimes it works. That's unfortunate because it harms pages that might be ranked higher for legitimate reasons.

Some of the search engines are not very good. It is hard to tell their quality since this is a new field and there aren't any established standards.

And finally, a poor search list may simply be the result of a person not knowing how to use the search engine. Be warned: they are not quite as easy as they appear.

General Types of Web Indexes

I've been talking about search engines. They are actually just one of the three general ways to find web sites.

Search engines are automated processes that handle both the collection of pages to be indexed and which do the actual searches using these huge databases that they have collected.

Web directories are manually produced lists of web sites. All the sites are well arranged in an organized set of categories. The people who do this tend to be librarians. They are trained to do indexing following a well-tested system of organization. The problem with such an index is that it takes a lot of work. It is hard to keep up with the new sites being added to the web, and it is equally hard to revise the directory as web sites change.

However, these are very good indexes for many areas of the web.

"Favorites" lists are the third category of web indexes. These are lists of web sites that individuals collect and post on their own web pages. Usually such lists are not very interesting. But occasionally you come across someone who shares your interests, who has put a lot of effort into finding good sites. These lists become invaluable.

In the early days of the web, the "favorites" lists were the only indexes we had. People were encouraged to put their "finds" on their own web pages. This practice continues to this day. Many people have web pages filled with their own "favorite" sites.

Search Engines: Indexing

Many of the pages are added to a search engine's database come from the search engine looking at web traffic. As pages move across the web, computers at various junctions in the internet can look at which pages are being requested. These page requests can be compared to the pages known in the search-engine database.

New pages (those that aren't in the database) have their addresses stored for later processing.

Later, often at night at the host site, the search engine asks for the new pages. This is just like a person asking for the page, but in this case a computer does it automatically. Each of these retrieved pages is automatically analyzed, with the appropriate words being added to the search-engine's database.

One of the things examined on these new pages are the links to other pages. These are run through the same process. If one of the links is new, it gets stored so that at some point this new page can be retrieved and added to the database.
Search Engines: Crawling

This process of finding a page and looking at its links is called "crawling." 

You can see that as the search engine gets more and more pages, it is also getting many new links. In many cases, if all the links are followed and all the new links get followed, and so on, most of the pages on the web will be found. Since this is an automated process, all you need are a lot of computers to do the crawling.

This is, in fact, how most of the large web databases are built. They can start with a modest collection of sites and just follow all the links.

Major Search Engines: Coverage

The result of a search engine crawling across the web, collecting and indexing web pages, is a large database. How large depends on a number of factors.

This graphic shows the size estimates, expressed as the estimated percent coverage, for December 1997 for some of the major engines. It shows a great variation in relative coverage.

A research study in the Summer of 1999 showed that these were overestimates of the overall coverage. The relative sizes, however, are probably right.

When you use a particular search engine, it is good to know how well it covers the web. A larger database is better.

Major Search Engines: Freshness

The freshness of the database is also important. It is a measure of how frequently the web crawlers revisit pages. Revisiting is important because pages change.

Crawlers visit some pages frequently, while other pages are no examined again very often. This graphic shows the range in a crawler's visits. For example, Alta Vista revisits some web pages every day, while others are seen only once a month.

In this case, a smaller value is better. It would be good to revisit each page as often as possible.

This gets harder as the database grows. Note that Web Crawler revisits each site once a week. That's very good, but remember that this is a relatively small database.
Web Directories

Web directories are the "hand built" indexes. Either the author of a page, or a trained editor, looks at web pages and creates a brief summary. This person also looks at the contents and figures out where the page fits into the immense set of index categories.

Yahoo was the first such index and is the largest.

These indexes work very well, particularly for web pages that don't change much. 

Also, they are very useful when a keyword or two doesn't capture the concept of what you are looking for.

An example would be a search for all the public colleges and universities that are in a region, such as a state. Inside Yahoo, you would click on the following chain of index categories

Education > Higher Education > Colleges and Universities > United States > Public > Hawaii

This is an efficient way to get to such well-categorized information.

The problem with this approach is that it requires a lot of manual effort to get web pages indexed. That makes it expensive to build and difficult to maintain.

"Favorites" Lists

From the earliest days of the web, people have been building their own web pages that include set of links to sites that they find useful. As you would expect, these are highly personal lists.

Some people have done an excellent job in compiling lists on narrow topics. When there are only a few dozen key web sites covering a particular topic, such a "favorites" list may be the best resource.

The point to remember is that there are different approaches to finding good web sites. You need to use them all.

Part 1 Part 2 Part 3 Lecture Index

Last Updated: 02/13/00

© 2000 by K. W. Bridges