Package | Description |
---|---|
org.apache.nutch.analysis.lang |
Text document language identifier.
|
org.apache.nutch.indexer |
Index content, configure and run indexing and cleaning jobs to
add, update, and delete documents from an index.
|
org.apache.nutch.indexer.anchor |
An indexing plugin for inbound anchor text.
|
org.apache.nutch.indexer.basic |
A basic indexing plugin, adds basic fields: url, host, title, content, etc.
|
org.apache.nutch.indexer.html |
Index raw HTML content.
|
org.apache.nutch.indexer.jsoup.extractor |
Indexing filter for jsoup-extractor plugin
|
org.apache.nutch.indexer.metadata |
Indexing filter to add document metadata to the index.
|
org.apache.nutch.indexer.more |
A more indexing plugin, adds "more" index fields:
last modified date, MIME type, content length.
|
org.apache.nutch.indexer.subcollection |
Indexing filter to assign documents to subcollections.
|
org.apache.nutch.indexer.tld |
Top Level Domain Indexing plugin.
|
org.apache.nutch.microformats.reltag |
A microformats Rel-Tag
Parser/Indexer/Querier plugin.
|
org.apache.nutch.parse |
The
Parse interface and related classes. |
org.apache.nutch.parse.html |
An HTML document parsing plugin.
|
org.apache.nutch.parse.js |
Parser and parse filter plugin to extract all (possible) links
from JavaScript files and embedded JavaScript code snippets.
|
org.apache.nutch.parse.jsoup.extractor |
Parse filter based on Jsoup
|
org.apache.nutch.parse.metatags |
Parse filter to extract meta tags: keywords, description, etc.
|
org.apache.nutch.parse.tika |
Parse various document formats with help of
Apache Tika.
|
org.apache.nutch.protocol |
Classes related to the
Protocol interface,
see also org.apache.nutch.net.protocols . |
org.apache.nutch.protocol.file |
Protocol plugin which supports retrieving local file resources.
|
org.apache.nutch.protocol.ftp |
Protocol plugin which supports retrieving documents via the ftp protocol.
|
org.apache.nutch.protocol.http |
Protocol plugin which supports retrieving documents via the http protocol.
|
org.apache.nutch.protocol.http.api |
Common API used by HTTP plugins (
http ,
httpclient ) |
org.apache.nutch.protocol.sftp |
Protocol plugin which supports retrieving documents via the sftp protocol.
|
org.apache.nutch.scoring |
The
ScoringFilter interface. |
org.apache.nutch.scoring.link |
Scoring filter
|
org.apache.nutch.scoring.opic |
Scoring filter implementing a variant of the Online Page Importance Computation
(OPIC) algorithm.
|
org.apache.nutch.scoring.tld |
Top Level Domain Scoring plugin.
|
org.creativecommons.nutch |
Sample plugins that parse and index Creative Commons medadata.
|
Modifier and Type | Class and Description |
---|---|
class |
HTMLLanguageParser
Adds metadata identifying language of document if found We could also run
statistical analysis here but we'd miss all other formats
|
class |
LanguageIndexingFilter
An
IndexingFilter that adds a
lang (language) field to the document. |
Modifier and Type | Interface and Description |
---|---|
interface |
IndexCleaningFilter
Extension point for indexing.
|
interface |
IndexingFilter
Extension point for indexing.
|
Modifier and Type | Class and Description |
---|---|
class |
AnchorIndexingFilter
Indexing filter that offers an option to either index all inbound anchor text
for a document or deduplicate anchors.
|
Modifier and Type | Class and Description |
---|---|
class |
BasicIndexingFilter
Adds basic searchable fields to a document.
|
Modifier and Type | Class and Description |
---|---|
class |
HtmlIndexingFilter
Add raw HTML content of a document to the index.
|
Modifier and Type | Class and Description |
---|---|
class |
JsoupIndexingFilter |
Modifier and Type | Class and Description |
---|---|
class |
MetadataIndexer
Indexer which can be configured to extract metadata from the crawldb, parse
metadata or content metadata.
|
Modifier and Type | Class and Description |
---|---|
class |
MoreIndexingFilter
Add (or reset) a few metaData properties as respective fields (if they are
available), so that they can be accurately used within the search index.
|
Modifier and Type | Class and Description |
---|---|
class |
SubcollectionIndexingFilter |
Modifier and Type | Class and Description |
---|---|
class |
TLDIndexingFilter
Adds the Top level domain extensions to the index
|
Modifier and Type | Class and Description |
---|---|
class |
RelTagIndexingFilter
An
IndexingFilter that adds tag
field(s) to the document. |
class |
RelTagParser
Adds microformat rel-tags of document if found.
|
Modifier and Type | Interface and Description |
---|---|
interface |
ParseFilter
Extension point for DOM-based parsers.
|
interface |
Parser
A parser for content generated by a
Protocol implementation. |
Modifier and Type | Class and Description |
---|---|
class |
HtmlParser |
Modifier and Type | Class and Description |
---|---|
class |
JSParseFilter
This class is a heuristic link extractor for JavaScript files and code
snippets.
|
Modifier and Type | Class and Description |
---|---|
class |
JsoupHtmlParser |
Modifier and Type | Class and Description |
---|---|
class |
MetaTagsParser
Parse HTML meta tags (keywords, description) and store them in the parse
metadata so that they can be indexed with the index-metadata plugin with the
prefix 'metatag.'.
|
Modifier and Type | Class and Description |
---|---|
class |
TikaParser
Wrapper for Tika parsers.
|
Modifier and Type | Interface and Description |
---|---|
interface |
Protocol
A retriever of url content.
|
Modifier and Type | Class and Description |
---|---|
class |
File
This class is a protocol plugin used for file: scheme.
|
Modifier and Type | Class and Description |
---|---|
class |
Ftp
This class is a protocol plugin used for ftp: scheme.
|
Modifier and Type | Class and Description |
---|---|
class |
Http |
Modifier and Type | Class and Description |
---|---|
class |
HttpBase |
Modifier and Type | Class and Description |
---|---|
class |
Sftp
This class uses the Jsch package to fetch content using the Sftp protocol.
|
Modifier and Type | Interface and Description |
---|---|
interface |
ScoringFilter
A contract defining behavior of scoring plugins.
|
Modifier and Type | Class and Description |
---|---|
class |
ScoringFilters
Creates and caches
ScoringFilter implementing plugins. |
Modifier and Type | Class and Description |
---|---|
class |
LinkAnalysisScoringFilter |
Modifier and Type | Class and Description |
---|---|
class |
OPICScoringFilter
This plugin implements a variant of an Online Page Importance Computation
(OPIC) score, described in this paper:
Abiteboul, Serge and Preda, Mihai and Cobena, Gregory (2003), Adaptive
On-Line Page Importance Computation .
|
Modifier and Type | Class and Description |
---|---|
class |
TLDScoringFilter
Scoring filter to boost tlds.
|
Modifier and Type | Class and Description |
---|---|
class |
CCIndexingFilter
Adds basic searchable fields to a document.
|
class |
CCParseFilter
Adds metadata identifying the Creative Commons license used, if any.
|
Copyright © 2019 The Apache Software Foundation