Google just works magically. Searching is a simple magic
tool that works without any effort or thought—at least that’s what the
magicians who develop search engines want you to believe. Realistically, there
are some skills that you can learn to help your Web site be a better partner
with search engines and help users find the information that they need.
In this article, you’ll learn the basics of how to control
searching and techniques to both ensure that all of your pages get indexed and
how to make indexing more valuable for the users trying to find your
Search engines, whether developed internally using a
commercial search tool designed to be deployed with a Web site or with a public
indexing and searching service such as Google, obey the same basic rules.
Developing a site to return meaningful results for the internal search can also
provide more relevant results for external searches as well.
Meta tag it
The most basic thing that you have to do to control a search
engine, whether internal or external, is to write a Meta tag with the name
attribute of ROBOT and a content attribute which contains INDEX or NO INDEX and
FOLLOW or NO FOLLOW. This simple tag tells a search engine what it should to do
with your page. Both internal and external search engines obey this META tag’s
instructions on what to do with the page.
<META NAME="robots" CONTENT=" noindex">
INDEX means to include the page in the index the search
engine is creating. NO INDEX tells the search engine to not include the page in
the index. It is this index that the search engine uses to find user search
results. If the page isn’t added to the index then it will not be found. A good
example of where you might want to use the NO INDEX setting for the ROBOT Meta
tag is when you have a discontinued product on your eCommerce site. You still
need to keep the product in the catalog so that users can review their orders;
however, you don’t want anyone just randomly stumbling across the product.
Products in the catalog which have not been discontinued would typically have
an INDEX setting.
FOLLOW indicates the search engine should follow the links
on your page, and NO FOLLOW tells the search engine not to follow links found
on the page. The NO FOLLOW setting can be used to prevent search engines from
following links that you don’t want them to follow such as cases where you’re
indexing a discussion forum and you don’t want your internal search engine to
go off and index the links to other sites that might be contained within
postings. In other situations, found below, the whole purpose of the page is to
provide a set of links for the search engine. In this case, the content will
likely be NO INDEX, FOLLOW so that the search engine doesn’t index the page
itself but does follow the provided links.
Make a list
One of the key challenges for creating a search friendly
site is helping the search index know what pages it needs to add to its index.
Traditionally a search engine is pointed at the root page in a site and is
allowed to wind its way through the site until it has followed every link. This
works well for sites that are always outputting their links as anchor (A) tags
links to connect one page to another. The result of this is that the search
crawler won’t be able to follow the links in the site. So the search index may
get only a handful of links, which it was able to pick up from normal links on
the home page.
The solution is to create a page which contains all of the
links that you would like the search engine to follow. This page might include
links to all of the products for an eCommerce site or every discussion in a
community site. The singular purpose of the page is to try to create a HTML
page which contains a large number of anchor (A) tags which lead to all of the
content on the site. The page is not special in that it must be written in some
specific scripting language. It is only special in that it attempts to quickly
provide all of the links needed.
In some cases this technique can be a quick and dirty way of
enabling a site index even if the structure of the site itself doesn’t lend
itself to that. It is possible to create a program which creates a listing of
all of the files on the site which you want indexed by literally walking
through the file system or through the IIS virtual directories. By providing a
link to each it’s possible to add every page to the search index. This has the
negative effect of causing the search index to include pages and files which
may have been orphaned from the main site a long time ago.
The search crawler start page is set with a META ROBOT tag
which tells the search engine to follow links but not to index the page itself.
As we saw above, the contents would be NO INDEX, FOLLOW. Because of this page,
the search engine will be able to index every listed page in the entire site.
Some search engines, particularly internal ones, will allow
you to point the search engine directly at this page of links. However, there
are cases where you don’t have the luxury of controlling the starting point for
the crawler. In this case, you need only create a link to your search crawler
page on your home page. Because your intent is to allow a search engine to
follow the link, you do not need to put any text in the anchor tag. The end
result is a link that will help search engines reach your search crawler
indexing page without users even knowing that it is present since there is no
text to highlight inside of the tags. This might look something like:
Keep it off the page
Once you have managed to get all of your pages indexed, it’s
time to focus on making the search results more meaningful. The first step in
this process is to eliminate items on the page which are distracting for the
search engine. For instance, menus are not useful to a search engine since they
will appear on every page and will contain the same words. Another example is a
promotional item which is being recommended to customers based on their interest
in a specific product but whose text is not directly related to the focus of
the page. Inclusion of these things into the search index only makes searching
more difficult because the search term used may occur on every page where the
By inspection of the user agent, which is coming in and
doing a case-insensitive search for the string ROBOT in the user agent string,
it’s possible to determine whether a request is coming from a search engine or
not. Although there are some search engines that do not include ROBOT in their
user agent strings, most of them do. Once you’ve identified a request as coming
from a search engine, you can simply not draw the menus, promotions, and other
non-related information found on the page by preventing it from being indexed.
The net result is that the search engine gets only the
information that is relevant to the page and doesn’t accidentally index
information which isn’t core to the page and therefore return search results to
the user which are not useful.
As a parting note on search engine optimization be sure to
include a specific correct title on all of your pages. Most search engines
display the title of the page when it’s in a search results set. Similarly, the
use of a META tag with a name of KEYWORDS is useful in encouraging higher ranks
for pages when the user searches for those keywords.