Summarizer.java revision 434
1384N/A * Copyright 2005 The Apache Software Foundation 1384N/A * Licensed under the Apache License, Version 2.0 (the "License"); 1384N/A * you may not use this file except in compliance with the License. 1384N/A * You may obtain a copy of the License at 1384N/A * Unless required by applicable law or agreed to in writing, software 1384N/A * distributed under the License is distributed on an "AS IS" BASIS, 1384N/A * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 1384N/A * See the License for the specific language governing permissions and 1384N/A * limitations under the License. 1384N/A/** Implements hit summarization. */ 1384N/A /** The number of context terms to display preceding and following matches.*/ 1384N/A /** The total number of terms to display in a summary.*/ 1384N/A /** Converts text to tokens. */ 1384N/A * Class Excerpt represents a single passage found in the 1384N/A * document, with some appropriate regions highlit. 1384N/A * Return how many unique toks we have 1384N/A * How many fragments we have. 1384N/A * Return an Enum for all the fragments 1384N/A /** Returns a summary for the given pre-tokenized text. */ 1384N/A // Simplistic implementation. Finds the first fragments in the document 1384N/A // containing any query terms. 1384N/A // TODO: check that phrases in the query are matched in the fragment 1384N/A // Simplistic implementation. Finds the first fragments in the document 1384N/A // containing any query terms. 1384N/A // TODO: check that phrases in the query are matched in the fragment 1384N/A // Create a SortedSet that ranks excerpts according to 1384N/A // how many query terms are present. An excerpt is 1384N/A // a List full of Fragments and Highlights 1384N/A // Iterate through all terms in the document 1384N/A // If we find a term that's in the query... 1384N/A // Start searching at a point SUM_CONTEXT terms back, 1384N/A // and move SUM_CONTEXT terms into the future. 1384N/A // Iterate from the start point to the finish, adding 1384N/A // terms all the way. The end of the passage is always 1384N/A // SUM_CONTEXT beyond the last query-term. 1384N/A // Iterate through as long as we're before the end of 1384N/A // the document and we haven't hit the max-number-of-items 1384N/A // Now grab the hit-element, if present 1384N/A // We found the series of search-term hits and added 1384N/A // them (with intervening text) to the excerpt. Now 1384N/A // we need to add the trailing edge of text. 1384N/A // So if (j < tokens.length) then there is still trailing 1384N/A // text to add. (We haven't hit the end of the source doc.) 1384N/A // Add the words since the last hit-term insert. 1384N/A // Remember how many terms are in this excerpt 1384N/A // Store the excerpt for later sorting 1384N/A // Start SUM_CONTEXT places away. The next // search for relevant excerpts begins at i-SUM_CONTEXT // If the target text doesn't appear, then we just // excerpt the first SUM_LENGTH words from the document. // Now choose the best items from the excerpt set. // Stop when our Summary grows too large. // Don't add fragments if it takes us over the max-limit * Get the terms from a query and adds them to hightlite * Tests Summary-generation. User inputs the name of a * text file and a query string System.
out.
println(
"Usage: java org.apache.nutch.searcher.Summarizer <textfile> <queryStr>");
// Load the text file into a single string. // Convert the query string into a proper Query