Better engines in red. Suggestions welcome.
From the issue dated January 21, 2000
Researchers now have it all on the World Wide Web: facts on virtually any topic, available from the far corners of the globe, unfiltered by reporters, editors, or publishers, and usually free. But sometimes we feel that we have too much information -- often way too much -- and that it may not be correct.
Despite the latest flurry of prime-time ads by search-engine vendors boasting that they can find anything you want online, search engines can't distinguish among Web pages based on their contents. The only way researchers can pinpoint information on the Web is if they learn how to do efficient Web searches, and which engines are best for which purposes.
One important lesson is to understand the range of search tools now available. Many researchers don't realize that they can use hierarchical indexes, standard search engines, alternative search engines, meta search engines, and databases -- and that those tools are not all the same.
In a hierarchical index -- probably the best known is Yahoo (http://www.yahoo.com) -- people trained to categorize information, such as librarians and indexers, examine Web sites and put them in categories and subcategories. Thus, when you do a search on a hierarchical index, it is much more likely that what you find will be relevant to what you are looking for.
The drawback to hierarchical indexes is that they are extremely selective. Because they are created by human beings rather than by computers, they can include only a tiny portion of what is available on the Web. Of course, in these days of abundant information, that may not be such a bad thing.
Yahoo uses a standard search engine as well. For that reason, the results of a search on Yahoo are split into several sections. "Category matches" inform you if your topic matches one of Yahoo's existing categories. "Site matches" are the sites that have been indexed and categorized. "Web pages" provide links to pages located by the search engine. Yahoo also groups results into two other sections: "related news," for any news item it locates on your subject, and "Net events," which are mostly chat sites.
Yahoo is by no means the only hierarchical index, and some of the many others are aimed specifically at academic users. The latter group includes: AlphaSearch (http://www.calvin.edu/library/as), BUBL Link (http://www.bubl.ac.uk/link), and Infomine (http://infomine.ucr.edu).
Then there are the standard search engines. Popular ones include AltaVista (http://www.altavista.com), Excite (http://www.excite.com), Go Network (http://infoseek.go.com), and HotBot (http://hotbot.lycos.com). Unlike hierarchical indexes, standard search engines send out software "robots" or "spiders" to search the Web and index the pages in each site they encounter. The engines then calculate mathematically how relevant the pages are to your search terms; each engine uses its own algorithm to rank pages. Factors in the calculation include the frequency and placement of your keywords on a page, and their occurrence in the descriptions that owners write of their pages, which are invisible to users. The search engine puts the pages that get the highest score at the top of the list of results.
Savvy researchers will avoid standard search engines when they have a very broad subject. Instead, they will use a hierarchical index, to find just a few relevant, well-cataloged sites.
Alternative search engines, which take various approaches to ranking and sorting the pages that they find, are often more helpful than standard engines. Northern Light (http://www.northernlight.com), for instance, ranks Web pages as a standard search engine does. But instead of displaying all of its results in a single listing, it sorts pages into categories and groups the results into folders. As an example, a search for "alternative energy" creates folders with labels such as "solar power," "air pollution," and "National Technical Information Service," which includes documents from that agency. And the folders contain subfolders. Within the solar-power folder, for instance, are folders for "photovoltaic systems" and "government sites." That arrangement of material can help you determine which groups of pages are most likely to be relevant to your needs.
Ask Jeeves (http://www.askjeeves.com) takes an altogether different approach. You don't enter keywords, but type a question in plain English -- perhaps "Is there evidence of life on Mars?" Ask Jeeves has recorded millions of questions that users have asked it, and has found Web sites that answer those questions.
The first thing that Ask Jeeves does after getting your query is to scan its database of questions and answers. It then gives you a list of questions that it "thinks" you want the answer to. If you select one of them, it lists sites that contain the answers. Ask Jeeves doesn't always work, but it can save you time, and it is fun to use.
Google (http://www.google.com) takes yet another tack. Like other search engines, it first matches up your keywords to the pages it has collected in its index. Then, however, it ranks each page based on how many other pages link to it -- and how many link to those pages in turn. The pages you see at the top of your list of results are those with the highest number of links to other pages. The idea is that such popularity is meaningful, just as a diner that has many trucks parked in front probably serves better food than the diner whose parking lot is empty. The approach works. After several years of being a loyal AltaVista user, I am now a "googler."
Oingo (http://www.oingo.com) has an even more radical approach. The site's slogan is "We know what you mean," and Oingo conducts a "conceptual search" to make sure that it understands your request. Ask it to search for "china," for example, and it will ask you to choose "porcelain" or any of the various geographical Chinas. Once you make a selection, Oingo will display "directory hits" and "Web hits." The site combines a hierarchical index and a search engine (it uses AltaVista), although the conceptual search applies only to its directory results.
Search engines that search other engines are called meta search engines. Among the popular ones are Dogpile (http://www.dogpile.com), Inference Find (http://www.inferencefind.com), and MetaCrawler (http://www.metacrawler.com). The concept here is that because no single search engine indexes the entire Web, using a meta search engine allows a researcher to scan more sites. The downside is that such an engine needs to use a "lowest common denominator" search statement, so that all of the search engines that it searches understand the request. Therefore, meta search engines are not a very good choice for complex searches, involving, say, Boolean logic. (Dogpile does include some Boolean-search capabilities.)
A completely different strategy is to search a database on the Web. Hundreds of databases originally searchable on CD-ROM or through proprietary online dial-up services are now available on the Web, and new databases are continually being born there as well. That makes it possible to search rich databases with a standard Web browser, although in many cases, the researcher must pay a fee or be affiliated with a university that subscribes to the database. The fee-based sites typically filter the data they contain, increasing the likelihood that the results will be relevant to a search; many also offer superior search capabilities, so requests can be more precise.
The many new, free databases on the Web can also be helpful. A site that does an excellent job of identifying and sorting free databases is The BigHub (http://www.thebighub.com). Through its "specialty search categories," it allows you to search more than 1,500 databases on the Web, many of which are oriented toward academics.
What new tools for searching the Web are on the horizon? At a recent conference, I heard about "vortals," vertical portals that provide information from only a designated slice of the Web. For example, a vortal might search only those sites and pages that have to do with health care. VerticalNet (http://www.verticalnet.com) offers portals to industries including communications and advanced technologies. Although the concept is a good one, the jury is still out on vortals' usefulness.
Farther down the road are visual representations of search results. Those search tools display their results graphically, allowing you to see at a glance which items are the most relevant. A service called NewsMaps (http://www.newsmaps.com), for example, displays the results of your search as a thematic map. Topographical markers indicate clusters of similar documents -- the most similar ones are piled up into little hills. According to Cartia, the company behind the technology, the maps are created automatically by an algorithm that "reads documents, extracts the content, and organizes the collection into a map." You can view some sample maps at the site.
No matter which search tool you choose, you will get the best results if you know what information you need, know the advantages and disadvantages of the various ways to search the Web, and regularly practice doing research online. Despite technological innovation, the best research tool remains the human brain.
Robert Berkman is a member of the faculty of the graduate
media-studies program at the New School University, and conducts
workshops on searching the Internet. He is the author of Find it
Fast: How to Uncover Expert Information on Any Subject, the fifth
edition of which will be published by HarperCollins in May.