Skip Headers

Oracle Text Reference
Release 9.2

Part Number A96518-01
Go To Documentation Library
Home
Go To Product List
Book List
Go To Table Of Contents
Contents
Go To Index
Index

Master Index

Feedback

Go to previous page Go to next page

6
CTX_CLS Package

This chapter provides reference information for using the CTX_CLS PL/SQL package to generate CTXRULE rules for a set of documents.

Name Description

TRAIN

Generates rules that define document categories. Output based on input training document set.




TRAIN

Use this procedure to generate query rules that select document categories. You must supply a training set consisting of categorized documents. Each document must belong to one or more categories. This procedure generates the queries that define the categories and then writes the results to a table.

This procedure requires that your document table have an associated populated context index. For best results, the index should be synchronized before running this procedure.

You must also have a document table and a category table. The documents can be in any format supported by Oracle Text.

For example your document and category tables can be defined as:

create table trainingdoc(


docid number primary key,
text varchar2(4000));

create table category (
docid CONSTRAINT fk_id REFERENCES trainingdoc(docid),
categoryid number);

Syntax

CTX_CLS.TRAIN(


index_name in varchar2,
doc_id in varchar2,
cattab in varchar2,
catdocid in varchar2,
catid in varchar2,
restab in varchar2,
rescatid in varchar2,
resquery in varchar2,
resconfid in varchar2,
preference_name in varchar2 DEFAULT NULL
);
index_name

Specify the name of the context index associated with your document training set.

doc_id

Specify the name of the document id column in the document table. This column must contain unique document ids. This column must a NUMBER.

cattab

Specify the name of the category table. You must have SELECT privilege on this table.

catdocid

Specify the name of the document id column in the category table. The document ids in this table must also exist in the document table. This column must a NUMBER.

catid

Specify the name of the category ID column in the category table. This column must a NUMBER.

restab

Specify the name of the result table. You must have INSERT privilege on this table.

rescatid

Specify the name of the category ID column in the result table. This column must a NUMBER.

resquery

Specify the name of the query column in the result table. This column must be VARACHAR2, CHAR CLOB, NVARCHAR2, or NCHAR.

The queries generated in this column connects terms with AND or NOT operators, such as:

'T1 & T2 ~ T3'

Terms can also be theme tokens and be connected with the ABOUT operator, such as:

'about(T1) & about(T2) ~ about(T3)'

resconfid

Specify the name of the confidence column in result table. This column contains the estimated probability from training data that a document is relevant if that document satisfies the query.

preference_name

Specify the name of the preference. For attributes, see "Classifier Types" in Chapter 2, "Indexing".

Example

The CTX_CLS.TRAIN procedure requires that your document table have an associated context index. For example your document table can be defined and populated as follows:

set serverout on
exec dbms_output.put_line(TO_CHAR(SYSDATE,'MM-DD-YYYY HH24:MI:SS')||':start');

create table doc (id number primary key, text varchar2(2000));
insert into doc values(1,'In 2002, Europe changed its currency to the EURO');
insert into doc values(2,'The NASDAQ rose today in heavy stock trading.');
insert into doc values(3,'The EURO lost 1 cent today against the US dollar');
insert into doc values(4,'Salt Lake City hosts the winter Olympic games');
insert into doc values(5,'ESPN broadcasts World Cup Soccer games.');
insert into doc values(6,'Soccer champion Diego Maradona retires.');

Create the CONTEXT index:

exec ctx_ddl.drop_preference('my_lexer');
exec ctx_ddl.create_preference('my_lexer','BASIC_LEXER');
exec ctx_ddl.set_attribute('my_lexer','INDEX_THEMES','NO');
exec ctx_ddl.set_attribute('my_lexer','INDEX_TEXT','YES');
CREATE INDEX docx on doc(text) INDEXTYPE IS ctxsys.context
PARAMETERS('LEXER my_lexer');

You must also create a category table as follows to relate the documents to categories:

create table category (doc_id number, cat_id number, cat_name varchar2(100));
insert into category values (1,1,'Finance');
insert into category values (2,1,'Finance');
insert into category values (3,1,'Finance');
insert into category values (4,2,'Sports');
insert into category values (5,2,'Sports');
insert into category values (6,2,'Sports');

CTX_CLS.TRAIN writes to result table that can be defined like:

create table restab (cat_id number, query VARCHAR2(400), conf number);

To populate the result table for later CTXRULE indexing, set your RULE_CLASSIFIER preference attributes and call CTX_CLS.TRAIN as follows:

exec ctx_ddl.drop_preference('my_classifier');
exec ctx_ddl.create_preference('my_classifier','RULE_CLASSIFIER');
exec ctx_ddl.set_attribute('my_classifier','MAX_TERMS','20');
exec ctx_ddl.set_attribute('my_classifier','THRESHOLD','40');
exec ctx_ddl.set_attribute('my_classifier','NT_THRESHOLD','0.02');
exec ctx_ddl.set_attribute('my_classifier','MEMORY_SIZE','200');
exec ctx_ddl.set_attribute('my_classifier','TERM_THRESHOLD','20');
exec ctx_output.start_log('mylog');

exec
ctx_cls.train('docx','id','category','doc_id','cat_id','restab','cat_ 
id','query', 'conf','my_classifier');
exec ctx_output.end_log();

create table catname as (select distinct cat_id, cat_name from category);

set termout on
select rpad(id,6) doc_id , rpad(cat_name,8) cat_name, rpad(text,50) text
 from doc, category where id=doc_id;
select rpad(a.cat_id,8) cat_id, rpad(cat_name,8) cat_name,  rpad(query,30) rule
 from restab a, catname b where b.cat_id=a.cat_id;


The training set is:

DOC_ID CAT_NAME TEXT
------ -------- --------------------------------------------------
1      Finance  In 2002, Europe changed its currency to the EURO
2      Finance  The NASDAQ rose today in heavy stock trading.
3      Finance  The EURO lost 1 cent today against the US dollar
4      Sports   Salt Lake City hosts the winter Olympic games
5      Sports   ESPN broadcasts World Cup Soccer games.
6      Sports   Soccer champion Diego Maradona retires.

6 rows selected.

The generated rules for the categories of FINANCE and SPORTS are as follows:


CAT_ID   CAT_NAME RULE
-------- -------- ------------------------------
1        Finance  EURO
1        Finance  TODAY ~ EURO
2        Sports   GAMES
2        Sports   SOCCER ~ GAMES



Go to previous page Go to next page
Oracle
Copyright © 1998, 2002 Oracle Corporation.

All Rights Reserved.
Go To Documentation Library
Home
Go To Product List
Book List
Go To Table Of Contents
Contents
Go To Index
Index

Master Index

Feedback