Summarizer.java revision 1318
0N/A * Copyright 2005 The Apache Software Foundation 0N/A * Licensed under the Apache License, Version 2.0 (the "License"); 0N/A * you may not use this file except in compliance with the License. 0N/A * You may obtain a copy of the License at 0N/A * Unless required by applicable law or agreed to in writing, software 0N/A * distributed under the License is distributed on an "AS IS" BASIS, 0N/A * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 0N/A * See the License for the specific language governing permissions and 0N/A * limitations under the License. 928N/A// modified by Lubos Kosco 2010 to upgrade lucene to 3.0.0 928N/A// TODO : rewrite this to use Highlighter from lucene contrib ... 0N/A/** Implements hit summarization. */ 0N/A /** The number of context terms to display preceding and following matches.*/ 0N/A /** The total number of terms to display in a summary.*/ 0N/A /** Converts text to tokens. */ 0N/A * Class Excerpt represents a single passage found in the 0N/A * document, with some appropriate regions highlit. 0N/A * Return how many unique toks we have 0N/A * How many fragments we have. 0N/A * Add a frag to the list. 0N/A * Return an Enum for all the fragments 0N/A /** Returns a summary for the given pre-tokenized text. */ 0N/A // Simplistic implementation. Finds the first fragments in the document 0N/A // containing any query terms. 460N/A // @TODO: check that phrases in the query are matched in the fragment 0N/A // Create a SortedSet that ranks excerpts according to 0N/A // how many query terms are present. An excerpt is 421N/A // a List full of Fragments and Highlights 0N/A // Iterate through all terms in the document 0N/A // If we find a term that's in the query... 0N/A // Start searching at a point SUM_CONTEXT terms back, 0N/A // and move SUM_CONTEXT terms into the future. 0N/A // Iterate from the start point to the finish, adding 0N/A // terms all the way. The end of the passage is always 0N/A // SUM_CONTEXT beyond the last query-term. 0N/A // Iterate through as long as we're before the end of 0N/A // the document and we haven't hit the max-number-of-items 0N/A // Now grab the hit-element, if present 0N/A // We found the series of search-term hits and added 0N/A // them (with intervening text) to the excerpt. Now 0N/A // we need to add the trailing edge of text. 0N/A // So if (j < tokens.length) then there is still trailing 0N/A // text to add. (We haven't hit the end of the source doc.) 0N/A // Add the words since the last hit-term insert. 0N/A // Remember how many terms are in this excerpt 0N/A // Store the excerpt for later sorting 0N/A // Start SUM_CONTEXT places away. The next 0N/A // search for relevant excerpts begins at i-SUM_CONTEXT 0N/A // If the target text doesn't appear, then we just 0N/A // excerpt the first SUM_LENGTH words from the document. 0N/A // Now choose the best items from the excerpt set. 0N/A // Stop when our Summary grows too large. 0N/A // Don't add fragments if it takes us over the max-limit 938N/A //FIXME somehow integrate below cycle to getSummary to save the cloning and memory, 928N/A //also creating Tokens is suboptimal with 3.0.0 , this whole class could be replaced by highlighter 0N/A * Get the terms from a query and adds them to hightlite 0N/A * a stream of tokens