<?xml version="1.0"?>
<document>
<header>
<title>
Apache Lucene - Basic Demo Sources Walk-through
</title>
</header>
<properties>
<author email="acoliver@apache.org">Andrew C. Oliver</author>
</properties>
<body>
<section id="About the Code"><title>About the Code</title>
<p>
In this section we walk through the sources behind the command-line Lucene demo: where to find them,
their parts and their function. This section is intended for Java developers wishing to understand
how to use Lucene in their applications.
</p>
</section>
<section id="Location of the source"><title>Location of the source</title>
<p>
NOTE: to examine the sources, you need to download and extract a source checkout of
Lucene: (lucene-{version}-src.zip).
</p>
<p>
Relative to the directory created when you extracted Lucene, you
should see a directory called <code>lucene/contrib/demo/</code>. This is the root for the Lucene
demo. Under this directory is <code>src/java/org/apache/lucene/demo/</code>. This is where all
the Java sources for the demo live.
</p>
<p>
Within this directory you should see the <code>IndexFiles.java</code> class we executed earlier.
Bring it up in <code>vi</code> or your editor of choice and let's take a look at it.
</p>
</section>
<section id="IndexFiles"><title>IndexFiles</title>
<p>
As we discussed in the previous walk-through, the <a
href="api/contrib-demo/org/apache/lucene/demo/IndexFiles.html">IndexFiles</a> class creates a Lucene
Index. Let's take a look at how it does this.
</p>
<p>
The <code>main()</code> method parses the command-line parameters, then in preparation for
instantiating <a href="api/core/org/apache/lucene/index/IndexWriter.html">IndexWriter</a>, opens a
<a href="api/core/org/apache/lucene/store/Directory.html">Directory</a> and instantiates
<a href="api/module-analysis-common/org/apache/lucene/analysis/standard/StandardAnalyzer.html"
>StandardAnalyzer</a> and
<a href="api/core/org/apache/lucene/index/IndexWriterConfig.html">IndexWriterConfig</a>.
</p>
<p>
The value of the <code>-index</code> command-line parameter is the name of the filesystem directory
where all index information should be stored. If <code>IndexFiles</code> is invoked with a
relative path given in the <code>-index</code> command-line parameter, or if the <code>-index</code>
command-line parameter is not given, causing the default relative index path "<code>index</code>"
to be used, the index path will be created as a subdirectory of the current working directory
(if it does not already exist). On some platforms, the index path may be created in a different
directory (such as the user's home directory).
</p>
<p>
The <code>-docs</code> command-line parameter value is the location of the directory containing
files to be indexed.
</p>
<p>
The <code>-update</code> command-line parameter tells <code>IndexFiles</code> not to delete the
index if it already exists. When <code>-update</code> is not given, <code>IndexFiles</code> will
first wipe the slate clean before indexing any documents.
</p>
<p>
Lucene <a href="api/core/org/apache/lucene/store/Directory.html">Directory</a>s are used by the
<code>IndexWriter</code> to store information in the index. In addition to the
<a href="api/core/org/apache/lucene/store/FSDirectory.html">FSDirectory</a> implementation we are using,
there are several other <code>Directory</code> subclasses that can write to RAM, to databases, etc.
</p>
<p>
Lucene <a href="api/core/org/apache/lucene/analysis/Analyzer.html">Analyzer</a>s are processing pipelines
that break up text into indexed tokens, a.k.a. terms, and optionally perform other operations on these
tokens, e.g. downcasing, synonym insertion, filtering out unwanted tokens, etc. The <code>Analyzer</code>
we are using is <code>StandardAnalyzer</code>, which creates tokens using the Word Break rules from the
Unicode Text Segmentation algorithm specified in <a href="http://unicode.org/reports/tr29/">Unicode
Standard Annex #29</a>; converts tokens to lowercase; and then filters out stopwords. Stopwords are
common language words such as articles (a, an, the, etc.) and other tokens that may have less value for
searching. It should be noted that there are different rules for every language, and you should use the
proper analyzer for each. Lucene currently provides Analyzers for a number of different languages (see
the javadocs under
<a href="api/all/org/apache/lucene/analysis/"
>lucene/contrib/analyzers/common/src/java/org/apache/lucene/analysis</a>).
</p>
<p>
The <code>IndexWriterConfig</code> instance holds all configuration for <code>IndexWriter</code>. For
example, we set the <code>OpenMode</code> to use here based on the value of the <code>-update</code>
command-line parameter.
</p>
<p>
Looking further down in the file, after <code>IndexWriter</code> is instantiated, you should see the
<code>indexDocs()</code> code. This recursive function crawls the directories and creates
<a href="api/core/org/apache/lucene/document/Document.html">Document</a> objects. The
<code>Document</code> is simply a data object to represent the text content from the file as well as
its creation time and location. These instances are added to the <code>IndexWriter</code>. If
the <code>-update</code> command-line parameter is given, the <code>IndexWriter</code>
<code>OpenMode</code> will be set to <code>OpenMode.CREATE_OR_APPEND</code>, and rather than
adding documents to the index, the <code>IndexWriter</code> will <strong>update</strong> them
in the index by attempting to find an already-indexed document with the same identifier (in our
case, the file path serves as the identifier); deleting it from the index if it exists; and then
adding the new document to the index.
</p>
</section>
<section id="Searching Files"><title>Searching Files</title>
<p>
The <a href="api/contrib-demo/org/apache/lucene/demo/SearchFiles.html">SearchFiles</a> class is
quite simple. It primarily collaborates with an
<a href="api/core/org/apache/lucene/search/IndexSearcher.html">IndexSearcher</a>,
<a href="api/modules-analysis-common/org/apache/lucene/analysis/standard/StandardAnalyzer.html"
>StandardAnalyzer</a> (which is used in the
<a href="api/contrib-demo/org/apache/lucene/demo/IndexFiles.html">IndexFiles</a> class as well)
and a <a href="api/core/org/apache/lucene/queryParser/QueryParser.html">QueryParser</a>. The
query parser is constructed with an analyzer used to interpret your query text in the same way the
documents are interpreted: finding word boundaries, downcasing, and removing useless words like
'a', 'an' and 'the'. The <a href="api/core/org/apache/lucene/search/Query.html">Query</a>
object contains the results from the
<a href="api/core/org/apache/lucene/queryParser/QueryParser.html">QueryParser</a> which is passed
to the searcher. Note that it's also possible to programmatically construct a rich
<a href="api/core/org/apache/lucene/search/Query.html">Query</a> object without using the query
parser. The query parser just enables decoding the <a href="queryparsersyntax.html">Lucene query
syntax</a> into the corresponding <a href="api/core/org/apache/lucene/search/Query.html">Query</a>
object.
</p>
<p>
<code>SearchFiles</code> uses the <code>IndexSearcher.search(query,n)</code> method that returns
<a href="api/core/org/apache/lucene/search/TopDocs.html">TopDocs</a> with max <code>n</code> hits.
The results are printed in pages, sorted by score (i.e. relevance).
</p>
</section>
</body>
</document>