Summarizer.java revision 1185
561N/A * Copyright 2005 The Apache Software Foundation 561N/A * Licensed under the Apache License, Version 2.0 (the "License"); 561N/A * you may not use this file except in compliance with the License. 919N/A * You may obtain a copy of the License at 919N/A * Unless required by applicable law or agreed to in writing, software 919N/A * distributed under the License is distributed on an "AS IS" BASIS, 919N/A * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 919N/A * See the License for the specific language governing permissions and 919N/A * limitations under the License. 919N/A// modified by Lubos Kosco 2010 to upgrade lucene to 3.0.0 919N/A// TODO : rewrite this to use Highlighter from lucene contrib ... 561N/A/** Implements hit summarization. */ 965N/A /** The number of context terms to display preceding and following matches.*/ 561N/A /** The total number of terms to display in a summary.*/ /** Converts text to tokens. */ * Class Excerpt represents a single passage found in the * document, with some appropriate regions highlit. * Return how many unique toks we have * How many fragments we have. * Add a frag to the list. * Return an Enum for all the fragments /** Returns a summary for the given pre-tokenized text. */ // Simplistic implementation. Finds the first fragments in the document // containing any query terms. // @TODO: check that phrases in the query are matched in the fragment // Create a SortedSet that ranks excerpts according to // how many query terms are present. An excerpt is // a List full of Fragments and Highlights // Iterate through all terms in the document // If we find a term that's in the query... // Start searching at a point SUM_CONTEXT terms back, // and move SUM_CONTEXT terms into the future. // Iterate from the start point to the finish, adding // terms all the way. The end of the passage is always // SUM_CONTEXT beyond the last query-term. // Iterate through as long as we're before the end of // the document and we haven't hit the max-number-of-items // Now grab the hit-element, if present // We found the series of search-term hits and added // them (with intervening text) to the excerpt. Now // we need to add the trailing edge of text. // So if (j < tokens.length) then there is still trailing // text to add. (We haven't hit the end of the source doc.) // Add the words since the last hit-term insert. // Remember how many terms are in this excerpt // Store the excerpt for later sorting // Start SUM_CONTEXT places away. The next // search for relevant excerpts begins at i-SUM_CONTEXT // If the target text doesn't appear, then we just // excerpt the first SUM_LENGTH words from the document. // Now choose the best items from the excerpt set. // Stop when our Summary grows too large. // Don't add fragments if it takes us over the max-limit //FIXME somehow integrate below cycle to getSummary to save the cloning and memory, //also creating Tokens is suboptimal with 3.0.0 , this whole class could be replaced by highlighter * Get the terms from a query and adds them to hightlite