Google just works magically. Searching is a simple magic tool that works without any effort or thought—at least that's what the magicians who develop search engines want you to believe. Realistically, there are some skills that you can learn to help your Web site be a better partner with search engines and help users find the information that they need.
In this article, you'll learn the basics of how to control searching and techniques to both ensure that all of your pages get indexed and how to make indexing more valuable for the users trying to find your information.
Search engines, whether developed internally using a commercial search tool designed to be deployed with a Web site or with a public indexing and searching service such as Google, obey the same basic rules. Developing a site to return meaningful results for the internal search can also provide more relevant results for external searches as well.
Meta tag it
The most basic thing that you have to do to control a search engine, whether internal or external, is to write a Meta tag with the name attribute of ROBOT and a content attribute which contains INDEX or NO INDEX and FOLLOW or NO FOLLOW. This simple tag tells a search engine what it should to do with your page. Both internal and external search engines obey this META tag's instructions on what to do with the page.
<META NAME="robots" CONTENT=" noindex">
INDEX means to include the page in the index the search engine is creating. NO INDEX tells the search engine to not include the page in the index. It is this index that the search engine uses to find user search results. If the page isn't added to the index then it will not be found. A good example of where you might want to use the NO INDEX setting for the ROBOT Meta tag is when you have a discontinued product on your eCommerce site. You still need to keep the product in the catalog so that users can review their orders; however, you don't want anyone just randomly stumbling across the product. Products in the catalog which have not been discontinued would typically have an INDEX setting.
FOLLOW indicates the search engine should follow the links on your page, and NO FOLLOW tells the search engine not to follow links found on the page. The NO FOLLOW setting can be used to prevent search engines from following links that you don't want them to follow such as cases where you're indexing a discussion forum and you don't want your internal search engine to go off and index the links to other sites that might be contained within postings. In other situations, found below, the whole purpose of the page is to provide a set of links for the search engine. In this case, the content will likely be NO INDEX, FOLLOW so that the search engine doesn't index the page itself but does follow the provided links.
Make a list
The solution is to create a page which contains all of the links that you would like the search engine to follow. This page might include links to all of the products for an eCommerce site or every discussion in a community site. The singular purpose of the page is to try to create a HTML page which contains a large number of anchor (A) tags which lead to all of the content on the site. The page is not special in that it must be written in some specific scripting language. It is only special in that it attempts to quickly provide all of the links needed.
In some cases this technique can be a quick and dirty way of enabling a site index even if the structure of the site itself doesn't lend itself to that. It is possible to create a program which creates a listing of all of the files on the site which you want indexed by literally walking through the file system or through the IIS virtual directories. By providing a link to each it's possible to add every page to the search index. This has the negative effect of causing the search index to include pages and files which may have been orphaned from the main site a long time ago.
The search crawler start page is set with a META ROBOT tag which tells the search engine to follow links but not to index the page itself. As we saw above, the contents would be NO INDEX, FOLLOW. Because of this page, the search engine will be able to index every listed page in the entire site.
Some search engines, particularly internal ones, will allow you to point the search engine directly at this page of links. However, there are cases where you don't have the luxury of controlling the starting point for the crawler. In this case, you need only create a link to your search crawler page on your home page. Because your intent is to allow a search engine to follow the link, you do not need to put any text in the anchor tag. The end result is a link that will help search engines reach your search crawler indexing page without users even knowing that it is present since there is no text to highlight inside of the tags. This might look something like:
Keep it off the page
Once you have managed to get all of your pages indexed, it's time to focus on making the search results more meaningful. The first step in this process is to eliminate items on the page which are distracting for the search engine. For instance, menus are not useful to a search engine since they will appear on every page and will contain the same words. Another example is a promotional item which is being recommended to customers based on their interest in a specific product but whose text is not directly related to the focus of the page. Inclusion of these things into the search index only makes searching more difficult because the search term used may occur on every page where the menu exists.
By inspection of the user agent, which is coming in and doing a case-insensitive search for the string ROBOT in the user agent string, it's possible to determine whether a request is coming from a search engine or not. Although there are some search engines that do not include ROBOT in their user agent strings, most of them do. Once you've identified a request as coming from a search engine, you can simply not draw the menus, promotions, and other non-related information found on the page by preventing it from being indexed.
The net result is that the search engine gets only the information that is relevant to the page and doesn't accidentally index information which isn't core to the page and therefore return search results to the user which are not useful.
As a parting note on search engine optimization be sure to include a specific correct title on all of your pages. Most search engines display the title of the page when it's in a search results set. Similarly, the use of a META tag with a name of KEYWORDS is useful in encouraging higher ranks for pages when the user searches for those keywords.