Diving Deep - Richard K Miller

The internet is a big place. Search engines like Google and Yahoo are the best tools we have for knowing what’s out there, but even they don’t capture everything.

A little background: Think of a search engine as an automated browser that clicks on every link it can find and saves every page it can find. Together all the saved pages make an “index”. Google claims to have an index of 8 billion pages, meaning it has saved 8 billion pages from the internet. And it re-saves them every week or two. When you search Google for “iPod earphones”, Google looks in its own index for those terms, then lets you know where the original content was. (So with Google, or any search engine, you’re not really searching the internet but searching a “copy” of the internet. For Google that’s an 8-billion-page copy, but still just a subset of the entire internet.)

While Google claims to index 8 billion pages, Yahoo claims 20 billion pages. Recent news pieces have asked if Yahoo’s larger index makes it a better search engine, but they’ve found that Google gives more relevant results slightly more often, despite having fewer pages in its index. The challenge for them is to add more pages to their indexes without losing efficacy. No search engine comes even close to finding everything on the internet.

For instance, take your local library website. At pac.provo.lib.ut.us you can search the Provo library for thousands of books. But type “site:pac.provo.lib.ut.us” into Google (that’s how you see what pages Google has indexed for that “site”) and you won’t find any books — just a couple hundred garbage pages. That means that while Google can help you find the library, it can’t help you find library books (maybe you already noticed).

Another example is the LDS Church‘s “Gospel Library”: at library.lds.org you can browse or search hundreds of volumes of Church magazines and books, but when you type “site:library.lds.org” into Google, you get just 39 hits. And those 39 aren’t the least bit useful.

Tons of data is inaccessible to search engines because its found on sites like these — real estate listings on MLS websites, legal proceedings on court websites, and job listings on some company websites.

A startup company called Glenbrook Networks is hoping to change this. It is developing a search engine to dive into the “deep web”. I look forward to when Glenbrook or Google will help us find information from these previously unavailable sources. It will mean billions more pages of relevant information available to the world.

In the meantime, websites like the LDS Gospel Library can use “rewrite engines” (for example, Apache’s mod_rewrite) to make themselves more accessible to search engines.

One reply on “Diving Deep”