The Invisible Web

a presentation by Marylaine Block for LACONI-RASS on November 14, 2003 at Batavia Public Library.

NOTE: The most valuable single resource for understanding this topic is The Invisible Web, Information Today, 2001, by Gary Price and Chris Sherman.

Some Questions Not Answered by General Search Engines

  • Is my plane on time?
  • Track the path of an approaching hurricane
  • check current road conditions
  • check hotel availability and prices


  • Find a recent webcast.
  • Watch birds hatching and the mothers feeding them, or other processes observable through continuous webcams
  • What does a machine gun sound like?
  • Get a visual display of your search results that shows amount of data and relationships between ideas.


  • Does any local library have a copy of American Infidel: Robert G. Ingersoll?
  • Find out if anybody has patented your nifty invention
  • How much did this stock I inherited cost at the time my uncle purchased it in 1956?
  • Get a list of stocks that match my personal criteria?
  • What concerts are going to be available in Boston next March?
  • where can I buy an out of print book called Angels and Spaceships? Or, I have a copy of this book; is it worth anything?
  • find out how to handle a chemical that may be toxic
  • compare nursing homes in my area
  • search for unclaimed property
  • find the author of a quote and verify its exact wording and source
  • What are usenet groups saying about a particular topic?


  • find articles in magazines and newspapers and reference sources that aren't published free on the web
  • find historical archives of magazine and newspaper articles
  • find internal corporate documents
  • search a national database for local obituaries and death notices


  • Find a snapshot of a web site that doesn't exist anymore


    Why Is It Invisible?

  • Data is too current, real-time, constantly changing -- current stock price for specific companies, news (failure of search engines -- or searchers -- on Sept. 11). Google now respiders millions of sites daily; All the Web (http://alltheweb.com/) automatically displays two most recent stories first.

  • Format is difficult for crawlers -- post script documents, flash, audio, streaming video, etc. (though search engines are rapidly adding these capabilities -- Google access to Usenet groups, other search engines increasing the kinds of file formats they index)

  • Data is generated on the fly when question is asked; to get the answer you have to fill in the form on a specific database -- patent-searching, trip directions, etc.

  • Access is proprietary, forbidden to crawlers and/or passworded -- NY Times, commercial databases, intranets. "On the web" vs. "by way of the web"

  • Nobody linked to it

  • Sites that have the information may not be crawled in depth -- how many pages of the EPA's web site have been indexed by engines? Try a search by topic plus domain -- "acid rain" site: epa.gov, "guru interview" site:marylaine.com

  • Search engines limit the number of viewable results. [Google only displays two results from any one site unless you specifically click on MORE]. When completeness counts, use multiple search engines.


    When To Use Invisible Web Resources

  • When you need real-time information -- flight-tracker, news, etc.

  • When you need dynamically-generated information from a database -- trip directions from here to there, sources for an out of print book, phone numbers and addresses for people and businesses.

  • When you need highly authoritative information from journals and other specialized sources (FindArticles.com http://www.findarticles.com/PI/index.jhtml, Bartleby http://bartleby.com/, EbscoHost, Making of America http://moa.umdl.umich.edu/, etc.)

  • When you need more control over the ways to limit the search -- in a history database, restrict by period or continent, for instance,

  • When you need a particular kind of content search engines don't do well with -- images, streaming video, etc.

  • When you already know who's likely to have produced the info you need and want to go directly there.

  • When you need to search a narrower more selective universe -- kids' sites, news sites, science sites, gov docs, a detailed bibliography, info on Samuel Johnson (which one?) etc.

  • When it's not publicly available unless you use a particular software -- RSS, peer-to-peer systems

  • You need to do a particular kind of searching -- Research Index, for instance, allows citation searching, or even browse in alphabetic order -- just try to search for The Nation in a keyword-based system


    Finding Tools

    General

  • Complete Planet - discover and search 103,000 databases and specialty search engines http://www.completeplanet.com/

  • Direct Search http://www.freepint.com/gary/direct.htm

  • The Invisible Web -- companion site to the book by Chris Sherman and Gary Price http://www.invisible-web.net/

  • ProFusion http://www.profusion.com/ -- "Target your search by drilling into one of these vertical search groups"

    For more info on the scope and content of the invisible web, read: Bergman, M. K. (2001) The deep Web: surfacing hidden information. The Journal of Electronic Publishing, 7 (1) http://www.press.umich.edu/jep/07-01/bergman.html (18 January 2003)


    Search Engines Inside the Search Engines

  • Google: Google Uncle Sam http://www.google.com/unclesam, Google Groups, Google News, Google Images, directory, catalogs, Linux sites, university search, Google Answers http://answers.google.com/ [see http://www.google.com/advanced_search?hl=en], and more

  • Alltheweb: http://alltheweb.com/ news, pictures, video, audio, ftp;

  • AltaVista http://altavista.com/ Images, MP3/audio, video, directory, news, yellow pages.

  • Lycos http://www.lycos.com/ -- Images, shopping, yellow pages, Lycos topics like blogs, kids, family zone, etc.


    Directories and Specialized Search Engines

  • FIND DISCONTINUED SITES:

    Use cache command on Google to find discontinued pages, e.g., cache:www.____.____

    Internet Archive http://www.archive.org/ to search by URL; for topical search, still in beta, http://recall.archive.org/



  • FIND IMAGES, VIDEOS, WEBCAMS, WEBCASTS, ETC.

    Finding Images and Sounds on the Web http://marylaine.com/images.html

    Kartoo http://kartoo.com/ -- a visual display search engine.

    FindSounds http://www.findsounds.com/

    Google Directory - Webcams - Directories http://directory.google.com/Top/Computers/Internet/On_the_Web/Webcams/Directories/?tc=1/

    WebCam Central http://www.camcentral.com/

    Classical Music Search http://la.znet.com/~iwamura/page2.html -- "When you know a melody and you do not know its title or composer, this melody search engine will help you."

    Singingfish - Find Audio and Video http://www.singingfish.com/

    Streaming News and Video http://www.freepint.com/gary/audio.htm -- another Gary Price gem.



  • SEARCH FOR BOOKS:

    AddALL Book Search and Price Comparison http://www.addall.com/ -- defaults to searching in print titles; click on Used and Out of Print for OOP

    Finding Out of Print Books http://marylaine.com/bookbyte/getbooks.html

    RedLightGreen http://www.redlightgreen.com/ -- RLG's shared catalog of the 126 million item records of its member libraries



  • SOME MISCELLANEOUS FINDING TOOLS:

    Epinions http://www.epinions.com/ -- product reviews. Reviews themselves are rated by other readers, and the best rated become trusted reviewers

    Daypop http://www.daypop.com/ -- search blogs, RSS feeds and news

    Kids Click Search http://sunsite.berkeley.edu/KidsClick!/

    FindLaw LawCrawler http://lawcrawler.findlaw.com/ -- one of several legal search engines

    Medlineplus http://www.medlineplus.gov/ -- great starting place for vetted medical info.

    Search Systems - Largest Free Public Records Database Collection http://www.searchsystems.net/




  • Use Two-step Searching

    Use general search engine to search for the likely source or database, then search inside that page. Sample search statements:

  • streaming video + search engine
  • diabetes + association (sometimes your best info comes from the primary professional or charitable association involved with the topic)
  • patents + database
  • "rock music" + encyclopedia -- the only way you're going to find anything about the artist known simply as E
  • Hispanics + demographics
  • word 6.0 + tutorial
  • plumbing + "how to"
  • "hp deskjet 5550" + product review
  • cataloging + listserv (or discussion)
  • whales + webcam

    There are many ways to approach the needle in the haystack problem:

  • A known needle in a known haystack
  • A known needle in an unknown haystack
  • An unknown needle in an unknown haystack
  • Any needle in a haystack
  • The sharpest needle in a haystack
  • Most of the sharpest needles in a haystack
  • All the needles in a haystack
  • Affirmation of no needles in a haystack
  • Things like needles in any haystack
  • Let me know whenever a new needle shows up
  • Where are the haystacks?
  • Needles, haystacks -- whatever

    Matthew Koll. "Information Retrieval." http://www.asis.org/Bulletin/Jan-00/track_3.html

  • model of search queries






    Becoming More Visible All the Time

    Search engines are changing constantly, trying to give access to the invisible web. To keep up with search engine improvements you can have all of these mailed to you:

  • Research Buzz http://www.researchbuzz.com/

  • Resource Shelf http://www.resourceshelf.com/ -- daily tips from Gary Price.

  • Search Day http://searchenginewatch.com/searchday/

  • Search Engine Watch http://searchenginewatch.com/

  • Search Engine Showdown http://www.searchengineshowdown.com/

    For tips on getting the most out of Google, read Tara Calishain's Google Hacks, O'Reilly, 2003.

    For more on blogs, RSS, and site minders, read Steven Cohen's Keeping Current, ALA 2003.

    To find library weblogs to keep you current on developments in librarianship and technology, see Peter Scott's Library Weblogs http://www.libdex.com/weblogs.html




  • Marylaine Block: Writer, Internet Trainer Help

    marylaine.net v 4_3