Apache Solr tutorial in a relational way

Many questions to be answered when starting with Scotas products is the lack of knowledge on how to implement typical Solr operations in a SQL way or vice-versa.

The idea of this tutorial is to do the same thing explained into the Apache Solr tutorial but using relational concepts using Scotas OLS.

5 minutes tutorial with ElasticSearch Push Connector

Continuing with the 5 minutes tutorial started in previous post this is about how to start working with Scotas ElasticSearch Push Connector technology.

After you install the Push Connector and your Elastic Search (ES) server, you could do a simple test using examples included into the script located at your distribution directory – db/tutorial/esTutorial.sql script.

Assuming that the ES server is installed by default we can start them using -f flag to see the log output at the console, for example:

[mochoa@localhost elasticsearch-0.19.4]$ export JAVA_HOME=/usr/local/jdk1.6
[mochoa@localhost elasticsearch-0.19.4]$ export PATH=$JAVA_HOME/bin:$PATH
[mochoa@localhost elasticsearch-0.19.4]$ bin/elasticsearch -f
[2012-06-22 11:00:40,089][INFO ][node ] [Grim Hunter] {0.19.4}[4061]: initializing …

[2012-06-22 11:00:46,899][INFO ][http ] [Grim Hunter] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.100.55:9200]}
[2012-06-22 11:00:46,900][INFO ][node ] [Grim Hunter] {0.19.4}[4061]: started
[2012-06-22 11:00:46,907][INFO ][gateway ] [Grim Hunter] recovered [0] indices into cluster_state

As you see above, ES server is listening on 9200 TCP/IP port which is the default value used by the Push Connector (PC).

First, on another console we need to connect with an user with DBA role to grant the PCUSER role to scott. Then, we are able to start working with the connector examples using the traditional scott user, let see:

-bash-4.2$ sqlplus “/ as sysdba”

SQL*Plus: Release 11.2.0.3.0 Production on Fri Jun 22 11:01:10 2012

Copyright (c) 1982, 2011, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 – Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> grant PCUSER TO scott;

Grant succeeded.

Now we are ready to start creating a table to store a simple tweet structure as is shown in many ES examples.

CREATE TABLE ES_TUTORIAL (
USER_ID VARCHAR2(30) PRIMARY KEY,
MESSAGE VARCHAR2(140),
POST_DATE TIMESTAMP);

and populate it with some rows:

insert into ES_TUTORIAL values (
‘kimchy’,
‘trying out Elastic Search’,
TO_TIMESTAMP(‘2009-11-15T14:12:12′,’YYYY-MM-DD”T”HH24:MI:SS’));

insert into ES_TUTORIAL (
select ‘kimchy-‘||rownum,’trying out Elastic Search ‘||rownum,TO_TIMESTAMP(sysdate+rownum,’YYYY-MM-DD”T”HH24:MI:SS’) from dual connect by rownum <= 100);

insert into ES_TUTORIAL (
select ‘kimchy-‘||(rownum+100),’trying out Elastic Search ‘||rownum,TO_TIMESTAMP(sysdate+rownum,’YYYY-MM-DD”T”HH24:MI:SS’) from dual connect by rownum <= 100);

insert into ES_TUTORIAL (
select ‘kimchy-‘||(rownum+200),’trying out Elastic Search ‘||rownum,TO_TIMESTAMP(sysdate+rownum,’YYYY-MM-DD”T”HH24:MI:SS’) from dual connect by rownum <= 100);

commit; 

Note that we are using TO_TIMESTAMP SQL function to format an Oracle timestamp value to a Java compatible timestamp format.

Now we are ready to index the above table:

[mochoa@localhost pc]$ sqlplus scott/tiger@test

SQL*Plus: Release 11.2.0.3.0 Production on Fri Jun 22 11:38:21 2012

Copyright (c) 1982, 2011, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 – Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> set define off timing on serverout on
SQL> CREATE INDEX TWEET_ESIDX ON ES_TUTORIAL(MESSAGE) INDEXTYPE IS PC.ES
PARAMETERS(‘{Settings:{“number_of_shards” : 1},SyncMode:OnLine,LockMasterTable:false,IncludeMasterColumn:false,ExtraCols:”user_id \”user\”,to_char(post_date,”YYYY-MM-DD\”T\”HH24:MI:SS”) \”post_date\”,message \”message\””}’);

Index created.

Elapsed: 00:00:00.74

we can see the output at the ES console telling us that a new index has been created:

[2012-06-22 11:38:47,837][INFO ][cluster.metadata ] [Grim Hunter] [scott.tweet_esidx] creating index, cause [api], shards [1]/[1], mappings []
[2012-06-22 11:38:48,404][INFO ][cluster.metadata ] [Grim Hunter] [scott.tweet_esidx] update_mapping [row] (dynamic)

Note that the index was created at ES server with the sintax [schema].[index_name] in lower case.

Here the explanation of the parameters used during index creation time:

  • Settings, any ES settings parameter in the example the Sharding configuration
  • SyncMode, OnLine means Near Real Time (NRT) mode for inserts/updates and Real Time (RT) mode for deletes, if you need information about this configuration please read the Scotas OLS White Paper, section Working modes replacing Solr by ES.
  • LockMasterTable, false means that during the table scan for selecting rows there is no row level locking, when this parameter is true rows are selected with “for update” mode.
  • IncludedMasterColumn, false means that the values of the master column (defined in create index syntax, MESSAGE) will not be included in the index structure, please see next parameter.
  • ExtraCols, is the list of the columns, formats and field names used at ES index structure. Above example shows that ES index will include three fields user, post_date and message. Note that field are case sensitive and now we are lower casing it.

Now the index is ready to use and has been populated with the 301 rows which are into the table structure. To check if all rows were properly indexed we can query for the rowid field, this field is always indexed and works as id into the ES index, let see:

SQL> select ElasticSearch.countHits(‘TWEET_ESIDX’,’rowid:*’) from dual;
301
Elapsed: 00:00:00.04

also we can select rows that match with a specific string, for example look for the 53 string:

SQL> select escore(1),message from ES_TUTORIAL where econtains(message,’message:53‘,1)>0;
2.327857                      trying out Elastic Search 53
2.327857                      trying out Elastic Search 53
2.327857                      trying out Elastic Search 53
Elapsed: 00:00:00.14

a few comments about the above syntax for querying ES:

  • econtains, operator is used to query the ES using the Push Connector,
    first argument is the master column of the index, the one which is included into the create index syntax
    second argument is the query string using ES q parameter in the _search? url.
    third argument is the ancillary operator identifier (used by escore(), see next item)
    return value is >0 when the row match against the input query
  • escore, is used to get the score computed by econtains() operator at the specific row, note the input parameter must be correlated to the last argument in econtains().

Now we are trying to explain what the NRT means in a few queries, let see what happens during an update:

SQL> update ES_TUTORIAL set message=message||’-mod‘ where rownum<=10;
10 rows updated.

the RDBMS tell us that 10 rows where updated but NRT deferred this update into the ES server until we commit the transaction, we can easily check using above select to see if it returns mod as positive hit:

SQL> select escore(1),message from ES_TUTORIAL where econtains(message,’message:mod‘,1)>0;
no rows selected

now what happens when we commit pending changes at the current transaction:

SQL> commit;
Commit complete.
SQL> select escore(1),message from ES_TUTORIAL where econtains(message,’message:mod‘,1)>0;
no rows selected
SQL> select escore(1),message from ES_TUTORIAL where econtains(message,’message:mod‘,1)>0;

1.6282115     trying out Elastic Search 76-mod
1.6282115     trying out Elastic Search 77-mod
1.6282115     trying out Elastic Search 78-mod
1.6282115     trying out Elastic Search 79-mod
1.6282115     trying out Elastic Search 80-mod
1.6282115     trying out Elastic Search 81-mod
1.6282115     trying out Elastic Search 82-mod
1.6282115     trying out Elastic Search 83-mod
1.6282115     trying out Elastic Search 84-mod
1.6282115     trying out Elastic Search 85-mod
10 rows selected.

after a few milliseconds the ES layer receives the update and rows are returned as positive hits.

Now we can test RT deletes, for example:

SQL> delete from es_tutorial where rowid in (select rowid from ES_TUTORIAL where econtains(message,’message:mod‘,1)>0);
10 rows deleted.
SQL> select escore(1),message from ES_TUTORIAL where econtains(message,’message:mod‘,1)>0;
no rows selected

as you can see after 10 rows where deleted, these rows where not returned as positive hits immediately (Real Time), for the current transaction these 10 rows are not available except if we do a rollback.

To checks what happens at the ES side we can open another concurrent transaction in parallel and check if these rows are visible for other users (remember that the transaction is not committed yet), let see a second connection:

SQL> select escore(1),message from ES_TUTORIAL where econtains(message,’message:mod’,1)>0;
1.6282115     trying out Elastic Search 76-mod
…..
1.6282115     trying out Elastic Search 85-mod
10 rows selected.

now we will commit deleted rows at  first connection:

SQL> commit;
Commit complete.
SQL> select escore(1),message from ES_TUTORIAL where econtains(message,’message:mod’,1)>0;
no rows selected

and see what happens at the second connection:

SQL> select escore(1),message from ES_TUTORIAL where econtains(message,’message:mod’,1)>0;
no rows selected

Second connection gets no hits because rows where deleted definitely and the ES server receives the changes.

Finally by dropping the index, the ES server drops the index structure too, here the SQL command:

SQL> drop index TWEET_ESIDX;
Index dropped.

and the ES server console:

[2012-06-22 14:52:59,460][INFO ][cluster.metadata ] [Grim Hunter] [scott.tweet_esidx] deleting index

and that’s all, as you can see Elastic Search Push Connector works as all Scotas searching products, functionality not covered into this blog post where faceting, sorting using score or others fields, detecting changes on multiples columns, etc. but they are supported by the connector.

 

5 minutes tutorial with Scotas OLS

If you already have Scotas OLS installed this is a 5 minutes guide for using Oracle-Solr integration.

Scotas OLS binary distribution have a sample configuration files located at db/tutorial directory which are used to simulate all operations described into Solr Tutorial but using Scotas OLS.

Here a complete list of steps:

1) grant LUCENEUSER role to SCOTT (this step could be avoided if was already done):

-bash-4.2$ sqlplus “/ as sysdba”
SQL*Plus: Release 11.2.0.3.0 Production on Thu May 10 08:07:51 2012
Copyright (c) 1982, 2011, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 – Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> grant LUCENEUSER to SCOTT;
Grant succeeded.

2) upload Solr related files to XMLDB repository (assuming that we will create an index named SCOTT.TUTORIAL_SIDX):

# cd db/tutorial/LUCENE.TUTORIAL_SIDX/conf
# ftp localhost 2100
Connected to localhost.localdomain.
220- localhost.localdomain
Unauthorised use of this FTP server is prohibited and may be subject to civil and criminal prosecution.
220 localhost.localdomain FTP Server (Oracle XML DB/Oracle Database) ready.
Name (localhost:mochoa): scott
331 pass required for SCOTT
Password: *****
230 SCOTT logged in
Remote system type is Unix.
ftp> cd public/solr
250 CWD Command successful
ftp> mkdir SCOTT.TUTORIAL_SIDX
250 CWD Command successful
ftp> cd SCOTT.TUTORIAL_SIDX
250 CWD Command successful
ftp> mkdir conf
250 CWD Command successful
ftp> cd conf
250 CWD Command successful
ftp> mput *
mput elevate.xml [anpqy?]? a
………
ftp> dir
227 Entering Passive Mode (127,0,0,1,249,27)
150 ASCII Data Connection
-rw-r–r– 1 SCOTT oracle 1274 MAY 10 11:15 elevate.xml
-rw-r–r– 1 SCOTT oracle 78514 MAY 10 11:15 mapping-FoldToASCII.txt
-rw-r–r– 1 SCOTT oracle 2868 MAY 10 11:15 mapping-ISOLatin1Accent.txt
-rw-r–r– 1 SCOTT oracle 367 MAY 10 11:15 mb_attributes.txt
-rw-r–r– 1 SCOTT oracle 121 MAY 10 11:15 mb_letterBuckets.txt
-rw-r–r– 1 SCOTT oracle 873 MAY 10 11:15 protwords.txt
-rw-r–r– 1 SCOTT oracle 30717 MAY 10 11:15 schema.xml
-rw-r–r– 1 SCOTT oracle 921 APR 13 13:46 scripts.conf
-rw-r–r– 1 SCOTT oracle 13267 MAY 10 11:15 solrconfig.xml
-rw-r–r– 1 SCOTT oracle 13 MAY 10 11:15 spellings.txt
-rw-r–r– 1 SCOTT oracle 1171 MAY 10 11:15 stopwords.txt
-rw-r–r– 1 SCOTT oracle 1132 MAY 10 11:15 synonyms.txt
226 ASCII Transfer Complete
ftp> pwd
257 “/public/solr/SCOTT.TUTORIAL_SIDX/conf” is current directory.
ftp> quit

3) Create and populate a sample table. This table have a relational version of the Solr tutorial documents (these SQL statements are at db/tutorial/schemaTutorial.sql file).

[mochoa@localhost tutorial]$ sqlplus scott/tiger@orcl
SQL*Plus: Release 11.2.0.3.0 Production on Thu May 10 08:22:44 2012
Copyright (c) 1982, 2011, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 – Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> set define off
SQL> CREATE TABLE OLS_TUTORIAL ( ID VARCHAR2(30) PRIMARY KEY,
NAME VARCHAR2(400),
MANU VARCHAR2(4000),
CAT VARCHAR2(400),
FEATURES CLOB,
INCLUDES VARCHAR2(4000),
WEIGHT NUMBER,
PRICE NUMBER,
POPULARITY NUMBER,
INSTOCK CHAR(5), — true or false
MANUFACTUREDATE_DT TIMESTAMP,
PAYLOADS VARCHAR2(4000),
STORE VARCHAR2(200)) ;
Table created.
SQL> insert into ols_tutorial values ( ‘SP2514N’, ‘Samsung SpinPoint P120 SP2514N – hard drive – 250 GB – ATA-133′, ‘Samsung Electronics Co. Ltd.’, ‘electronics hard-drive’, ‘7200RPM, 8MB cache, IDE Ultra ATA-133 NoiseGuard, SilentSeek technology, Fluid Dynamic Bearing (FDB) motor’, NULL, NULL, 92, 6, ‘true’, TO_TIMESTAMP(‘2006-02-13T15:26:37Z’,’YYYY-MM-DD”T”HH24:MI:SS”Z”‘), NULL, ‘35.0752,-97.032′) ;
…….
1 row created.
SQL> commit;
Commit complete.
SQL> select count(*) from OLS_TUTORIAL;
COUNT(*)
———-
16

4) Indexing using Scotas OLS.

SQL> CREATE INDEX TUTORIAL_SIDX ON OLS_TUTORIAL(ID) INDEXTYPE IS LUCENE.SOLRINDEX
parameters(‘Updater:localhost@9099;Searcher:localhost@9099;SyncMode:OnLine;LockMasterTable:false;IncludeMasterColumn:false;LogLevel:ALL;MltColumn:title;HighlightColumn:name,features;DefaultColumn:text;ExtraCols:id “id”,cat “cat”,name “name”,features “features”,manu “manu”,includes “includes”,price “price”,popularity “popularity”,inStock “inStock”,to_char(manufacturedate_dt,”YYYY-MM-DD”T”HH24:MI:SS”Z””) “manufacturedate_dt”‘);
Index created.

5) Doing some queries (equivalent to the section  …and now you can search for all sorts of Solr Tutorial, there are more query examples at db/tutorial/testTutorial.xml file in a form o SQLUnit syntax):

SQL> SELECT /*+ DOMAIN_INDEX_SORT */ id FROM OLS_TUTORIAL T where scontains(id,’video’)>0;
ID
——————————
MA147LL/A
100-435805
EN7800GTX/2DHTV/256M

SQL> SELECT id FROM OLS_TUTORIAL T WHERE SCONTAINS(ID,’name:video’)>0;
ID
——————————
MA147LL/A
100-435805

SQL> SELECT id,price FROM OLS_TUTORIAL WHERE SCONTAINS(ID,’video AND price:[* TO 400]’)>0;
ID PRICE
—————————— ———-
MA147LL/A                           399

6) Other complex queries (Faceted Search example):

SQL> SELECT FIELD,SJOIN(T.FACETS) F FROM TABLE(SFACETS(USER||’.TUTORIAL_SIDX’,’*:*’,’facet.field=cat&facet.field=inStock’)) T;
cat
electronics(14),memory(3),connector(2),graphics-card(2),hard-drive(2),monitor(2) ,search(2),software(2),camera(1),copier(1),multifunction-printer(1),music(1),pri nter(1),scanner(1)
inStock
true(12),false(4)

7) More queries (Text Analysis)

SQL> SELECT /*+ DOMAIN_INDEX_SORT */ SSCORE(1),name FROM OLS_TUTORIAL T WHERE SCONTAINS(ID,’power-shot‘,1)>0;
.683216989
Canon PowerShot SD500

SQL> SELECT /*+ DOMAIN_INDEX_SORT */ SSCORE(1),NAME,features FROM OLS_TUTORIAL T WHERE SCONTAINS(ID,’features:recharging‘,1)>0;
.336813927
Apple 60 GB iPod with Video Playback Black iTunes, Podcasts, Audiobooks Stores up to 15,000 songs, 25,000 photos, or 150 hours of video 2.5-inch, 320×240 color TFT LCD display with LED backlight Up to 20 hours of battery life Plays AAC, MP3, WAV, AIFF, Audible, Apple Lossless, H.264 video Notes, Calendar, Phone book, Hold button, Date display, Photo wallet, Built-in g ames, JPEG photo playback, Upgradeable firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level indication

And that’s all. Here some notes on the index parameters defined above:

  • IncludeMasterColumn:false, defined that the column ID which is the master column of the index defined when the SQL DDL is used as CREATE INDEX TUTORIAL_SIDX ON OLS_TUTORIAL(ID)… is not included during index construction as Solr field.
  • MltColumn:title, defines which column is used with More Like This ancillary operator mlt(), not showed at the examples.
  • HighlightColumn:name,features, columns used for highlighting with shighlight() ancillary operator, not showed at the examples.
  • DefaultColumn:text, which column is used as default for searching when using Solr Query Syntax at scontains() operator, note that text is not table column, is a virtual column defined at schema.xml with the Solr functionality copyField and works as bag including all Solr fields (cat, name, manu, features, includes).
  • ExtraCols:id “id“,cat “cat“,name “name“,features “features“,manu “manu“,includes “includes“,price “price“,popularity “popularity“,inStock “inStock“,to_char(manufacturedate_dt,”YYYY-MM-DD”T”HH24:MI:SS”Z””) “manufacturedate_dt“, these are table columns defined as Solr field, note that we are using double quoted (“) to defined case sensitive Solr Field names and an special to_char mask to format dates includes in manufacterdate_dt column in Solr date format.