Oracle Ultra Search Online Documentation
Release 9.2


Home	Book List	Contents	Master Index	Feedback

About the Ultra Search Crawler and Data Sources

	Related Topics
About the Administration Tool Crawler Page Web Access Page Sources Page Schedules Page

The Oracle Ultra Search crawler is a shared server Java application that gathers documents from the data sources you specify during configuration. The crawler stores the documents in a local file system cache. With the cached data, Oracle Ultra Search creates the index required for querying.

Crawler Settings

Before you can use the crawler, you must set its operating parameters, such as number of crawler threads, crawler timeout threshold, database connect string, and default character set. Some parameters, like the log file directory and the temporary directory, have no default value, so you must set them before crawling. To do so, use the Crawler Settings Page in the administration tool.

Crawler Data Sources

In addition to the Web access parameters, you can define specific data sources on the Data Sources Page in the administration tool. You can define one or more of the following data sources:

Web sites

Database tables

Files

Mailing lists

Oracle9iAS Portal page groups

User-defined data sources (requires crawler agent)

Display URL and Access URL

A display URL is the URL used to present the document on a browser. For some data sources, the access of the document can be different from the display URL (for example, to bypass access control). In that case, an access URL is provided. By default, the display URL is treated as the access URL unless a separate access URL is provided. The display URL must be unique in a data source, so two different access URLs cannot have the same display URL.

About User-Defined Data Sources and Crawler Agents

If you are defining a user-defined data source to crawl and index a proprietary document repository or management system, such as Lotus Notes or Documentum, you must implement a crawler agent as a Java class. The agent collects document URLs and associated meta data from the proprietary document source and returns the information to the Ultra Search crawler, which enques it for later crawling. For more information on defining a new data source type, see the User-Defined sub-tab in Data Sources Page in the administration tool.

Synchronizing Data Sources

Data sources are used by synchronization schedules you create. A synchronization schedule has one or more data sources attached to it. Synchronization schedules define the frequency at which the Ultra Search index is kept up to date with existing information in the associated data sources. To define a synchronization schedule, use the Schedules Page in the administration tool.

Document Attributes

Document attributes, or metadata, describe document properties. In the document is a Web page, the attribute can come from the HTTP header or embedded inside the HTML in metatags. Document attributes can be used for many things, including document management, access control, or version control. Different data sources can have attributes of different names but used for the same purpose; for example., "version" and "revision". It can also have the same attribute name for different purposes; for example, "language" as in natural language in one data source but as "programming language" in another.

Crawling Process for the Schedule

The first time the crawler runs, it must fetch Web pages, table rows, files, and so on based on the data source. It then adds the document to the Ultra Search index. The crawling process for the schedule is broken into two phases:

Queuing and Caching Documents

Indexing Documents

Queuing and Caching Documents

Figures 2a and 2b illustrate an instance of the crawling cycle in a sequence of nine steps. The example uses a Web data source, although the crawler can also crawl other data source types.

Figure 2a illustrates how the crawler and its crawling threads are activated. It also shows how the crawler queues hypertext links to control its navigation. This figure corresponds to Steps 1 to 5.

Figure 2b illustrates how the crawler caches Web pages. This figure correspond to Steps 6 to 8.

The steps are the following:

Oracle spawns the crawler according to the schedule you specify with the administration tool. When crawling is initiated for the first time, the URL queue is populated with the seed URLs. See Figure 2a.

Crawler initiates multiple crawling threads.

Crawler thread removes the next URL in the queue.

Crawler thread fetches the document from the Web. The document is usually an HTML file containing text and hypertext links.

Crawler thread scans the HTML file for hypertext links and inserts new links into the URL queue. Duplicate links already in the document table are discarded.

Crawler caches the HTML file in the local file system. See Figure 2b.

Crawler registers URL in the document table.

Crawler thread starts over by repeating Step 3.

Fetching a document, as shown in Step 4, can be time-consuming because of network traffic or slow Web sites. For maximum throughput, multiple threads fetch pages at any given time.

Figure 2a: Queuing URLs

Text description of the illustration isrch005.gif

Figure 2b: Caching URLs

Text description of the illustration isrch006.gif

Indexing Documents

When the file system cache is full (default maximum size is 20 megabytes), document caching stops and indexing begins. In this phase, Ultra Search augments the Oracle9i Text index using the cached files referred to by the document table. See Figure 3.

Figure 3: Indexing Documents

none
Text description of the illustration isrch004.gif

Data Synchronization

After the initial crawl, a URL page is only crawled and indexed if it has changed since the last crawl. The crawler determines if it has changed with the http If-Modified-Since header field or with the checksum of the page. URLs that no longer exist are marked so and removed from the index.

To update changed documents, the crawler uses an internal checksum to compare new Web pages with cached Web pages. Changed Web pages are cached and marked for reindexing.

The steps involved in data synchronization are the following:

Oracle spawns the crawler according to the synchronization schedule you specify with the administration tool. The URL queue is populated with the data source URLs assigned to the schedule.

Crawler initiates multiple crawling threads.

Crawler thread removes the next URL in the queue.

Crawler thread fetches the document from the Web. The page is usually an HTML file containing text and hypertext links.

Crawler thread calculates a checksum for the newly retrieved page and compares it with the checksum of the cached page. If the checksum is the same, then the page is discarded and crawler goes to step 3. Otherwise, the crawler moves to the next step.

Crawler thread scans the document for hypertext links and inserts new links into the URL queue. Duplicate links already in the document table are discarded.

Crawler caches the document in the local file system. See Figure 2b.

Crawler registers URL in the document table.

If the file system cache is full or if the URL queue is empty, then Web page caching stops and indexing begins. Otherwise, the crawler thread starts over by repeating Step 3.

Remote Crawler

To increase crawling performance, you can set up the Ultra Search crawler to run on one or more machines separate from your database. These machines are called remote crawlers. However, each machine must share cache, log, and mail archive directories with the database machine.

To configure a remote crawler, you must first install the Ultra Search middle tier components module on a machine other than the database host. During installation, the remote crawler is registered with the Ultra Search system, and a profile is created for the remote crawler. After installing the Ultra Search middle tier components module, you must log on to the Ultra Search administration tool and edit the remote crawler profile. You can then assign a remote crawler to a crawling schedule. To edit remote crawler profiles, use the Crawler Settings Page in the administration tool.


Home	Book List	Contents	Master Index	Feedback