eG Monitoring
 

Measures reported by FSWebCrawColTest

The FAST Search Web crawler collects content from a set of defined Web sites, which can be internal or external.

The FAST Search Web crawler works, in many ways, like a Web browser downloading content from Web servers. But unlike a Web browser that responds only to user input via mouse clicks or keyboard, the FAST Search Web crawler works from a set of configurable rules it must follow when it requests Web items. This includes, for example, how long to wait between requests for items, and how long to wait before checking for new or updated items.

The main configuration concept in the FAST Search Web crawler is a “collection”. Each crawl collection contains the configuration applicable to the particular collection, such as which start addresses and crawl rules to apply. A typical solution might have crawl collections such as Extranet or Blogs. The FAST Search Web crawler starts by comparing the start URL list against include and exclude rules specified in parameters in the XML file containing the configuration of a crawl collection. The start URL list is specified with either the start_uris or start_uri_files setting, and the rules via the include_domains and exclude_domains setting. Valid URLs are then requested from their Web servers at a rate determined by the request rate that is configured in the delay setting.

If fetched successfully, the Web item is parsed for hyperlinks and other meta-information, usually by a HTML parser built into the FAST Search Web crawler. The Web item's meta-information is stored in the FAST Search Web crawler meta-database, and the Web item content (the HTML body) is stored in the FAST Search Web crawler store. The hyperlinks are filtered against the crawl rules, and used as the next set of URLs to be downloaded. This process continues until all reachable content has been gathered, until the refresh interval (refresh setting) is complete or until another configuration parameter limiting the scope of the crawl is reached.

To determine how efficiently the Web crawler functions, you need to understand the current load generated by each crawl collection in terms of the number and size of documents that are crawled per collection and the speed with which these documents are downloaded by the crawler. The FSWebCrawColTest test provides you with these useful insights and helps assess the Web Crawler's efficiency.

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
FSActSitCraw Indicates the number of websites or web links that are currently crawled with this crawl collection. Number The sum of the value of this measure across collections will serve as a good indicator of the current workload of the crawler.

If the number of Web sites or the total number of Web items to be crawled is large, the FAST Search Web crawler can be scaled by distributing it across multiple servers.

Compare the value of this measure across crawlers to know which collection is generating the highest load.

FSCurDocDowRate Indicates the rate at which the documents are currently downloaded with this crawl collection. Downloads/min The crawler's overall download rate depends on the number of active sites that are busy.
FSAvgDocSiz Indicates the average size of the documents downloaded with this crawl collection. MB This is another good measure of the current load on the crawler.
FSDocCrawSto Indicates the number of documents downloaded with this crawl collection that are currently stored in the Web Crawler store. Number The FAST Search Web crawler stores crawled content locally on disk during crawling. The content is divided into two types; Web item content and meta data.
FSDocDel Indicates the number of documents downloaded with this crawl collection that are currently deleted from the Web Crawler store. Number  
FSDocDow Indicates the number of documents that are currently downloaded with this crawl collection. Number  
FSDocStoMod Indicates the number of stored documents that were currently modified with this crawl collection. Number The crawler periodically looks for changes in the Web Sites/Web pages configured for crawling and writes these changes to the crawled items that pre-exists in the store.

The FSDocStoMod measure reports the number of items in the store that were currently updated with changes. The FSDocWriCraw measure on the other hand reveals how many such changes were written to the store.

FSDocWriCraw Indicates the number of current document writes to the FAST Search Web Crawler store. Number