Skip Headers

Oracle Text Application Developer's Guide
Release 9.2

Part Number A96517-01
Go To Documentation Library
Home
Go To Product List
Book List
Go To Table Of Contents
Contents
Go To Index
Index

Master Index

Feedback

Go to previous page Go to next page

1
Introduction to Oracle Text

This chapter introduces the main features of Oracle Text. It helps you to get started with indexing, querying, and document presentation.

The following topics are covered:

What is Oracle Text?

Oracle Text is a tool that enables you to build text query applications and document classification applications. Oracle Text provides indexing, word and theme searching, and viewing capabilities for text.

Types of Query Applications

You can build two types of applications with Oracle Text, discussed in the following sections:

Text Query Applications

The purpose of a text query application is to enable users to find text that contains one or more search terms. The text is usually a collection of documents. A good application can index and search common document formats such as plain text, HTML, XML, or Microsoft Word. For example, an application with a browser interface might enable users to query a company Web site consisting of HTML files, returning those files that match a query.

To build a text query application, you create either a CONTEXT or CTXCAT index and query the index with CONTAINS or CATSEARCH operators respectively.

See Also:

"Indexing Your Documents" in this chapter for more information about these indexes.



Document Classification Applications

A document classification application is one that classifies an incoming stream of documents based on its content. They are also known as document routing or filtering applications. For example, an online news agency might need to classify its incoming stream of articles as they arrive into categories such as politics, crime, and sports.

Oracle Text enables you to build these applications with the CTXRULE index. This index indexes the rules (queries) that define each class. When documents arrive, the MATCHES operator can be used to match each document with the rules that select it.

You can use the CTX_CLS.TRAIN procedure to generate rules on a document set. You can then create a CTXRULE index using the output from this procedure.

See Also:

Oracle Text Reference for more information on CTX_CLS.TRAIN.


Note:

Oracle Text supports document classification for only plain text, HTML, and XML documents.


Supported Document Formats

For text query applications, Oracle Text supports most document formats for indexing and querying, including plain text, HTML and formatted documents such as Microsoft Word.

For document classification application, Oracle Text supports classifying plain text, HTML, and XML documents.

Theme Capabilities

With Oracle Text, you can search on document themes if your language is English and French. To do so, you use the ABOUT operator in a CONTAINS query. For example, you can search for all documents that are about the concept politics. Documents returned might be about elections, governments, or foreign policy. The documents need not contain the word politics to score hits.

Theme information is derived from the supplied knowledge base, which is a hierarchical listing of categories and concepts. As the supplied knowledge base is a general view of the world, you can add to it new industry-specific concepts. With an augmented knowledge base, the system can process document themes more intelligently and so improve the accuracy of your theme searching.

With the supplied PL/SQL packages, you can also obtain document themes programatically.

See Also:

Oracle Text Reference to learn more about the ABOUT operator.

Themes in Other Languages

You can enable theme capabilities such as ABOUT queries in other languages besides English and French by loading a language-specific knowledge base.

See Also:

Adding a Language-Specific Knowledge Base in Chapter 7, "Working With a Thesaurus".

Query Language and Operators

To query, you use the SQL SELECT statement. Depending on your index, you can query text with either the CONTAINS operator, which is used with the CONTEXT index, or the CATSEARCH operator, which is used with the CTXCAT index.

You use these in the WHERE clause of the SELECT statement. For example, to search for all documents that contain the word oracle, you use CONTAINS as follows:

SELECT SCORE(1) title FROM news WHERE CONTAINS(text, 'oracle', 1) > 0;

To classify single documents, use the MATCHES operator with a CTXRULE index.

For text querying with the CONTAINS operator, Oracle Text provides a rich query language with operators that enable you to issue variety of queries including simple word queries, ABOUT operator queries, logical queries, and wildcard and thesaural expansion queries.

The CATSEARCH operator also supports some of the operations available with CONTAINS.

See Also:

Chapter 3, "Querying"

Document Services and Using a Thesaurus

You can use the supplied Oracle Text PL/SQL packages for advanced features such as document presentation and thesaurus maintenance. Document presentation is how your application presents to the user documents in a query result set. You can maintain a thesaurus to expand queries and enhance your application.

See Also:

Prerequisites For Building Your Query Application

To build an Oracle Text query application, you must have the following:

The following sections describe these prerequisites and also describe the main features of a generic text query application.

Introduction to Loading Your Text Table

The basic prerequisite for an Oracle Text query application is to have a populated text table. The text table is where you store information about your document collection and is required for indexing.

You can populate rows in your text table with one of the following elements:

Figure 1-1, "Different Ways of Storing Text" illustrates these different methods.

Figure 1-1 Different Ways of Storing Text

Text description of ccapp002.gif follows
Text description of the illustration ccapp002.gif


By default, the indexing operation expects your document text to be directly loaded in your text table, which is the first method above.

However, you can specify the other ways of identifying your documents such as with filenames or with URLs by using the corresponding data storage indexing preference.

Storing Text in the Text Table

You can store documents in your text table in different ways.

You can store documents in one column using the DIRECT_DATASTORE data storage type or over a number of columns using the MULTI_COLUMN_DATASTORE type. When your text is stored over a number of columns, Oracle concatenates the columns into a virtual document for indexing.

You can also create master-detail relationships for your documents, where one document can be stored across a number of rows. To create master-detail index, use the DETAIL_DATASTORE data storage type.

In your text table, you can also store short text fragments such as names, descriptions, and addresses over a number of columns and create a CTXCAT index. A CTXCAT index improves performance for mixed queries.

You can also store your text in a nested table using the NESTED_DATASTORE type.

Oracle Text supports the indexing of the XMLType datatype which you use to store XML documents.

Storing File Path Names

In your text table, you can store path names to files stored in your file system. When you do so, use the FILE_DATASTORE preference type during indexing.

Storing URLs

You can store URL names to index web-sites. When you do so, use the URL_DATASTORE preference type during indexing.

Storing Associated Document Information

In your text table, you can create additional columns to store structured information that your query application might need, such as primary key, date, description, or author.

Format and Character Set Columns

If your documents are of mixed formats or of mixed character sets, you can create the following additional columns:

When you create your index, you must specify the name of the format or character set column in the parameter clause of CREATE INDEX.

Supported Column Types

With Oracle Text, you can create a CONTEXT index with columns of type VARCHAR2, CLOB, BLOB, CHAR, BFILE, XMLType, and URIType.


Note:

The column types NCLOB, DATE and NUMBER cannot be indexed.


Supported Document Formats

Because the system can index most document formats including HTML, PDF, Microsoft Word, and plain text, you can load any supported type into the text column.

When you have mixed formats in your text column, you can optionally include a format column to help filtering during indexing. With the format column you can specify whether a document is binary (formatted) or text (non-formatted such as HTML).

See Also:

Oracle Text Reference for more information about the supported document formats.

Loading Methods

The following sections describe different methods of loading information into a text column.

Loading Text with the INSERT Statement

You can use the SQL INSERT statement to load text to a table.

The following example creates a table with two columns, id and text, by using the CREATE TABLE statement. This example makes the id column the primary key. The text column is VARCHAR2:

CREATE TABLE docs (id NUMBER PRIMARY KEY, text VARCHAR2(80));

To populate this table, use the INSERT statement as follows:

INSERT into docs values(1, 'this is the text of the first document');
INSERT into docs values(12, 'this is the text of the second document');

Loading Text from File System

In addition to the INSERT statement, Oracle enables you to load text data (this includes documents, pointers to documents, and URLs) into a table from your file system by using other automated methods, including:

Indexing Your Documents

To query your document collection, you must first index the text column of your text table. Indexing breaks your text into tokens, which are usually words separated by spaces. Tokens can also be numbers, acronyms and other strings that are whitespace separated in the document.

A CONTEXT index records each token and the documents that contain it. An inverted index as such allows for querying on words and phrases. Figure 1-2 shows a text table within Oracle9i and its associated Oracle Text index.

Figure 1-2 Text table and associated Oracle Text index

Text description of ccapp003.gif follows
Text description of the illustration ccapp003.gif


Type of Index

Oracle Text supports the creation of three types of indexes depending on your application and text source. You use the CREATE INDEX statement to create all types of Oracle Text indexes.

The following table describes these indexes and the type of applications you can build with them. The third column shows which query operator to use with the index.

Type of Index Description Query Operator

CONTEXT

Use this index to build a text retrieval application when your text consists of large coherent documents. You can index documents of different formats such as Microsoft Word, HTML, XML, or plain text.

You can customize your index in a variety of ways.

CONTAINS

CTXCAT

Use this index type to improve mixed query performance. Suitable for querying small text fragments with structured criteria like dates, item names, and prices that are stored across columns.

CATSEARCH

CTXRULE

Use to build a document classification application. You create this index on a table of queries, where each query has a classification.

Single documents (plain text, HTML, or XML) can be classified by using the MATCHES operator.

MATCHES



When to Create a CONTEXT Index

Once your text data is loaded in a table, you can use the CREATE INDEX statement to create a CONTEXT index. When you create an index and specify no parameter clause, an index is created with default parameters.

For example, the following command creates a CONTEXT index with default parameters called myindex on the text column in the docs table:

CREATE INDEX myindex ON docs(text) INDEXTYPE IS CTXSYS.CONTEXT;

Defaults for All Languages

When you use CREATE INDEX to create a context index without explicitly specifying parameters, the system does the following for all languages by default:

You can always change the default indexing behavior by creating your own preferences and specifying these custom preferences in the parameter clause of CREATE INDEX.

Customizing Your CONTEXT Index

By using the parameter clause with CREATE INDEX, you can customize your CONTEXT index. For example, in the parameter clause, you can specify where your text is stored, how you want it filtered for indexing, and whether sections should be created.

For example, to index a set of HTML files loaded in the text column htmlfile, you can issue the CREATE INDEX statement, specifying datastore, filter and section group parameters as follows:

CREATE INDEX myindex ON doc(htmlfile) INDEXTYPE IS ctxsys.context PARAMETERS 
('datastore ctxsys.default_datastore filter ctxsys.null_filter section group 
ctxsys.html_section_group');
See Also:

When to Create a CTXCAT Index

A CTXCAT index is an index optimized for mixed queries. You can create this type of index when you store small documents or text fragments with associated structured information. To query this index, you use the CATSEARCH operator and specify a structured clause, if any. Query performance with a CTXCAT index is usually better for structured queries than with a CONTEXT index. To achieve better performance, your CTXCAT index must be configured correctly.

See Also:

"Creating a CTXCAT Index" in Chapter 2, "Indexing" for a complete example

When to Create a CTXRULE Index

You create a CTXRULE index to build a document classification application in which an incoming stream of documents is routed according content. You define the classification rules as queries which you index. You use the MATCHES operator to classify single documents.

See Also:

"Creating a CTXRULE Index" in Chapter 2, "Indexing" for a complete example

Index Maintenance

Index maintenance is necessary after your application inserts, updates, or deletes documents in your base table. Index maintenance involves synchronizing and optimizing your index.

If your base table is static, that is, your application does no updating, inserting or deleting of documents after your initial index, you do not need to synchronize your index.

However, if your application performs DML operations (inserts, updates, or deletes) on your base table, you must synchronize your index. You can synchronize your index manually with the CTX_DDL.SYNC_INDEX PL/SQL procedure.

The following example synchronizes the index myindex with 2 megabytes of memory:

begin


ctx_ddl.sync_index('myindex', '2M');
end;

If you synchronize your index regularly, you might also consider optimizing your index to reduce fragmentation and to remove old data.

See Also:

"Managing DML Operations for a CONTEXT Index" in Chapter 2, "Indexing" for more information about synchronizing and optimizing the index.

Simple Text Query Application

A typical query application enables the user to enter a query. The application executes the query and returns a list of documents, called a hit list usually ranked by relevance, that satisfy the query. The application enables the user to view one or more documents in the returned hitlist.

For example, an application might index URLs (HTML files) on the World Wide Web and provide query capabilities across the set of indexed URLs. Hit lists returned by the query application are composed of URLs that the user can visit.

Figure 1-3 Flowchart of a simple query application

Text description of cdefaul6.gif follows
Text description of the illustration cdefaul6.gif


Figure 1-3 illustrates the flowchart of how a user interacts with a simple query application. The figure shows the steps required to enter the query and to view the results. Oval boxes indicate user-tasks and rectangular boxes indicate application tasks.

As shown, a query application can be modeled according to the following steps:

  1. User enters query
  2. Application executes query
  3. Application presents hitlist
  4. User selects document from hitlist
  5. Application presents document to user for viewing

The rest of this chapter explains how you can accomplish these steps with Oracle Text.

See Also:

Appendix A, "CONTEXT Query Application" for a description of a simple Web query application.



Understanding How to Query Your Index

With Oracle Text, you use the CONTAINS operator to query a CONTEXT index. This is the most common operator and index used to build query applications.

For more advanced applications, you use the CATSEARCH operator to query a CTXCAT index, and you use the MATCHES operator to query the CTXRULE index.

Understanding How to Query with CONTAINS

You can use the CONTAINS operator to retrieve documents that contain a word or phrase. Your document must be indexed before you can issue a CONTAINS query.

Use the CONTAINS operator in a SELECT statement. With CONTAINS, you can issue two types of queries:

You can also optimize queries for better response time to obtain a specified number of the highest ranking (top n) hits. The following sections give an overview of these query scenarios.

Understanding Word Queries

A word query is a query on the exact word or phrase you enter between the single quotes in the CONTAINS or CATSEARCH operator.

The following example finds all the documents in the text column that contain the word oracle. The score for each row is selected with the SCORE operator by using a label of 1:

SELECT SCORE(1) title FROM news WHERE CONTAINS(text, 'oracle', 1) > 0;

In your query expression, you can use text operators such as AND and OR to achieve different results. You can also add structured predicates to the WHERE clause.

See Also:

Oracle Text Reference for more information about the different operators you can use in queries

You can count the hits to a query by using the SQL COUNT(*) statement, or the CTX_QUERY.COUNT_HITS PL/SQL procedure.

Understanding ABOUT Queries

You issue ABOUT queries with the ABOUT operator in the CONTAINS clause.

In all languages, ABOUT queries increases the number of relevant documents returned by a query.

In English and French, ABOUT queries can use the theme component of the index, which is created by default. As such, this operator returns documents based on the concepts of your query, not only the exact word or phrase you specify.

For example, the following query finds all the documents in the text column that are about the subject politics, not just the documents that contain the word politics:

SELECT SCORE(1) title FROM news WHERE CONTAINS(text, 'about(politics)', 1) > 0;
See Also:

Oracle Text Reference to learn more about the ABOUT operator



Optimizing Query for Response Time

You can optimize any CONTAINS query (word or ABOUT) for response time in order to retrieve the highest ranking hits in a result set in the shortest possible time. Optimizing for response time is useful in a Web-based search application.

See Also:

"Optimizing Queries for Response Time" in Chapter 5, "Performance Tuning".

Understanding Structured Field Searching

Your application interface can give the user the option of querying on structured fields related to the text such as item description, author, or date as a means of further limiting the search criteria.

You can issue structured searches with the CONTAINS operator by using a structured clause in the SELECT statement. However, for optimal performance, consider creating a CTXCAT index, which gives better performance for structured queries with the CATSEARCH operator.

Your application can also present the structured information related to each document in the hit list.

See Also:

"Creating a CTXCAT Index" in Chapter 2, "Indexing" for more information about creating a CTXCAT index to improve structured queries with CATSEARCH.

Thesaural Queries

Oracle Text enables you to define a thesaurus for your query application.

Defining a custom thesaurus enables you to process queries more intelligently. Since users of your application might not know which words represent a topic, you can define synonyms or narrower terms for likely query terms. You can use the thesaurus operators to expand your query to include thesaurus terms.

See Also:

Chapter 7, "Working With a Thesaurus"

Document Section Searching

Section searching enables you to narrow text queries down to sections within documents.

Section searching can be implemented when your documents have internal structure, such as HTML and XML documents. For example, you can define a section for the <H1> tag that enables you to query within this section using the WITHIN operator.

You can set the system to automatically create sections from XML documents.

You can also define attribute sections to search attribute text in XML documents.


Note:

Section searching is supported for only word queries with a CONTEXT index.


See Also:

Chapter 6, "Document Section Searching"

Other Query Features

In your query application, you can use other query features such as proximity searching. Table 1-1 lists some of these features.

Table 1-1 Oracle Text Query Features
Feature Description Implement With

Case Sensitive Searching

Case-sensitive searches.

BASIC_LEXER when you create the index

Base Letter Conversion

Queries words with or without diacritical marks such as tildes, accents, and umlauts. For example, with a Spanish base-letter index, a query of energía matches documents containing both energía and energia.

BASIC_LEXER when you create the index

Word Decompounding

(German and Dutch)

Enables searching on words that contain specified term as sub-composite.

BASIC_LEXER when you create the index

Alternate Spelling

(German, Dutch, and Swedish)

Searches on alternate spellings of words

BASIC_LEXER when you create the index

Proximity Searching

Searches for words near one another

NEAR operator when you issue the query

Stemming

Searches for words with same root as specified term

$ operator at when you issue the query

Fuzzy Searching

Searches for words that have similar spelling to specified term

FUZZY operator when you issue the query

Query Explain Plan

Generates query parse information

CTX_QUERY.EXPLAIN PL/SQL procedure after you index

Hierarchical Query Feedback

Generates broader term, narrower term and related term information for a query

CTX_QUERY.HFEEDBACK PL/SQL procedure after you index.

Browse index

Browses the words around a seed word in the index

CTX_QUERY.BROWSE_WORDS PL/SQL after you index.

Count hits

Counts the number of hits in a query

CTX_QUERY.COUNT_HITS PL/SQL procedure after you index.

Stored Query Expression

Stores a query expression

CTX_QUERY.STORE_SQE PL/SQL procedure after you index.

Thesaural Queries

Uses a thesaurus to expand queries.

Thesaurus operators such as SYN and BT as well as the ABOUT operator.

Use CTX_THES package to maintain thesaurus.

Presenting the Hit List

After executing the query, query applications typically present a hit list of all documents that satisfy the query along with a relevance score. This list can be a list of document titles or URLs depending on your document set.

Your application presents a hitlist in one or more of the following ways:

Hitlist Example

Figure 1-4 is a screen shot of a query application presenting the hit list to the user.

Figure 1-4 Query Application Presenting Hit List

Text description of fig3.gif follows.

Text description of the illustration fig3.gif

Presenting Structured Fields

Structured columns related to the text column can help to identify documents. When you present the hit list, you can show related columns such as document titles or author or any other combination of fields that identify the document.

You specify the name of the structured columns that you want to include in the hit list in the SELECT statement.

Ordering the Hit List

When you issue either a text query or a theme query, Oracle returns the hit list of documents that satisfy the query with a relevance score for each document returned. You can use these scores to order the hitlist to show the most relevant documents first.

The score for each document is between 1 and 100. The higher the score, the more relevant the document to the query.

Oracle calculates scores when you use the CONTAINS or CATSEARCH operators. You obtain scores using the SCORE operator.

See Also:

Chapter 3, "Querying"

Presenting Document Hit Count

You can present the number of hits the query returned alongside the hit list, using SELECT COUNT(*). For example:

SELECT COUNT(*) FROM docs WHERE CONTAINS(text, 'oracle', 1) > 0;

To count hits in PL/SQL, you can also use the CTX_QUERY.COUNT_HITS procedure.

Document Presentation and Highlighting

Typically, a query application enables the user to view the documents returned by a query. The user selects a document from the hit list and then the application presents the document in some form.

With Oracle Text, you can display a document in different ways. For example, you can present documents with query terms highlighted. Highlighted query terms can be either the words of a word query or the themes of an ABOUT query in English.

You can also obtain gist (document summary) and theme information from documents with the CTX_DOC PL/SQL package.

Table 1-2 describes the different output you can obtain and which procedure to use to obtain each type.

Table 1-2 CTX_DOC Output
Output Procedure

Plain text version, no highlights

CTX_DOC.FILTER

HTML version of document, no highlights

CTX_DOC.FILTER

Highlighted document, plain text version

CTX_DOC.MARKUP

Highlighted document, HTML version

CTX_DOC.MARKUP

Highlight offset information for plain text version

CTX_DOC.HIGHLIGHT

Highlight offset information for HTML version

CTX_DOC.HIGHLIGHT

Theme summaries and gist of document.

CTX_DOC.GIST

List of themes in document.

CTX_DOC.THEMES

See Also:

Chapter 4, "Document Presentation"

Highlighting Example

Figure 1-5 is a screen shot of a query application presenting a document with the query terms heart and lungs highlighted.

Figure 1-5 Query Application Presenting Highlighted Document

Text description of highligh.gif follows.

Text description of the illustration highligh.gif

Document List of Themes Example

Figure 1-6 is a screen shot of a query application presenting a list of themes for a document.

Figure 1-6 Query Application Displaying Document Themes

Text description of themes.gif follows.

Text description of the illustration themes.gif

Gist Example

Figure 1-7 is a screen shot of a query application presenting a gist for a document.

Figure 1-7 Query Application Presenting Document Gist

Text description of gist.gif follows.

Text description of the illustration gist.gif


Go to previous page Go to next page
Oracle
Copyright © 2000, 2002 Oracle Corporation.

All Rights Reserved.
Go To Documentation Library
Home
Go To Product List
Book List
Go To Table Of Contents
Contents
Go To Index
Index

Master Index

Feedback