See: Description
Package | Description |
---|---|
org.apache.nutch.api |
REST API to run and control crawl jobs.
|
org.apache.nutch.api.impl |
Implementations of REST API interfaces.
|
org.apache.nutch.api.impl.db | |
org.apache.nutch.api.misc | |
org.apache.nutch.api.model.request | |
org.apache.nutch.api.model.response | |
org.apache.nutch.api.resources | |
org.apache.nutch.api.security | |
org.apache.nutch.core.jsoup.extractor |
core package of jsoup-extractor containing XML configuration parser, document structure
|
org.apache.nutch.core.jsoup.extractor.normalizer |
Normalizers for jsoup-extractor
|
org.apache.nutch.crawl |
Crawl control code and tools to run the crawler.
|
org.apache.nutch.fetcher |
The Nutch robot.
|
org.apache.nutch.host |
Host database to store metadata per host.
|
org.apache.nutch.indexer |
Index content, configure and run indexing and cleaning jobs to
add, update, and delete documents from an index.
|
org.apache.nutch.indexer.html |
Index raw HTML content.
|
org.apache.nutch.indexer.jsoup.extractor |
Indexing filter for jsoup-extractor plugin
|
org.apache.nutch.indexer.solr | |
org.apache.nutch.indexwriter.elastic |
Index writer plugin for Elasticsearch.
|
org.apache.nutch.indexwriter.hbase |
Index writer plugin for Apache HBase.
|
org.apache.nutch.metadata |
A Multi-valued Metadata container, and set
of constant fields for Nutch Metadata.
|
org.apache.nutch.net |
Web-related interfaces: URL
filters
and normalizers . |
org.apache.nutch.net.protocols |
Helper classes related to the
Protocol
interface, sea also org.apache.nutch.protocol . |
org.apache.nutch.parse |
The
Parse interface and related classes. |
org.apache.nutch.parse.jsoup.extractor |
Parse filter based on Jsoup
|
org.apache.nutch.plugin |
The Nutch
Plugin System. |
org.apache.nutch.protocol |
Classes related to the
Protocol interface,
see also org.apache.nutch.net.protocols . |
org.apache.nutch.scoring |
The
ScoringFilter interface. |
org.apache.nutch.storage |
Representation (
web pages ,
host metadata ) of data in abstracted storage. |
org.apache.nutch.tools |
Miscellaneous tools.
|
org.apache.nutch.tools.arc |
Tools to read the
Arc file format.
|
org.apache.nutch.tools.proxy |
Proxy to
benchmark the crawler. |
org.apache.nutch.util |
Miscellaneous utility classes.
|
org.apache.nutch.util.domain |
Classes for domain name analysis.
|
org.apache.nutch.webui |
Provides classes and interfaces for Web UI
|
org.apache.nutch.webui.client |
Provides client classes and interfaces for Web UI
|
org.apache.nutch.webui.client.impl |
Contains implementation of client classes and interfaces for Web UI
|
org.apache.nutch.webui.client.model |
Contains model classes of client for Web UI
|
org.apache.nutch.webui.config |
Contains config classes for Web UI
|
org.apache.nutch.webui.model |
Contains model classes for Web UI
|
org.apache.nutch.webui.pages |
Provides classes and interfaces of pages for Web UI
|
org.apache.nutch.webui.pages.assets |
Contains asset classes for Web UI
|
org.apache.nutch.webui.pages.auth |
Contains authorization classes for Web UI
|
org.apache.nutch.webui.pages.components |
Contains component classes for Web UI
|
org.apache.nutch.webui.pages.crawls |
Contains crawl page classes for Web UI
|
org.apache.nutch.webui.pages.instances |
Contains instances pages classes for Web UI
|
org.apache.nutch.webui.pages.menu |
Contains menu page classes for Web UI
|
org.apache.nutch.webui.pages.seed |
Contains seed pages' classes for Web UI
|
org.apache.nutch.webui.pages.settings |
Contains settings page classes for Web UI
|
org.apache.nutch.webui.service |
Provides service classes and interfaces for Web UI
|
org.apache.nutch.webui.service.impl |
Contains service implementation classes for Web UI
|
Package | Description |
---|---|
org.apache.nutch.protocol.http.api |
Common API used by HTTP plugins (
http ,
httpclient ) |
org.apache.nutch.urlfilter.api |
Generic
URL filter library,
abstracting away from regular expression implementations. |
Package | Description |
---|---|
org.apache.nutch.protocol.file |
Protocol plugin which supports retrieving local file resources.
|
org.apache.nutch.protocol.ftp |
Protocol plugin which supports retrieving documents via the ftp protocol.
|
org.apache.nutch.protocol.http |
Protocol plugin which supports retrieving documents via the http protocol.
|
org.apache.nutch.protocol.httpclient |
Protocol plugin which supports retrieving documents via the HTTP and
HTTPS protocols, optionally with Basic, Digest and NTLM authentication
schemes for web server as well as proxy server.
|
org.apache.nutch.protocol.sftp |
Protocol plugin which supports retrieving documents via the sftp protocol.
|
Package | Description |
---|---|
org.apache.nutch.urlfilter.automaton |
URL filter plugin based on
dk.brics.automaton Finite-State
Automata for JavaTM.
|
org.apache.nutch.urlfilter.domain |
URL filter plugin to include only URLs which match an element in a given list of
domain suffixes, domain names, and/or host names.
|
org.apache.nutch.urlfilter.prefix |
URL filter plugin to include only URLs which match one of a given list of URL prefixes.
|
org.apache.nutch.urlfilter.regex |
URL filter plugin to include and/or exclude URLs matching Java regular expressions.
|
org.apache.nutch.urlfilter.suffix |
URL filter plugin to either exclude or include only URLs which match
one of the given (path) suffixes.
|
org.apache.nutch.urlfilter.validator |
URL filter plugin that validates given urls.
|
Package | Description |
---|---|
org.apache.nutch.net.urlnormalizer.basic |
URL normalizer performing basic normalizations: remove default ports
and dot segments in path.
|
org.apache.nutch.net.urlnormalizer.pass |
URL normalizer dummy which does not change URLs.
|
org.apache.nutch.net.urlnormalizer.regex |
URL normalizer with configurable rules based on regular expressions
(
Pattern ). |
Package | Description |
---|---|
org.apache.nutch.scoring.link |
Scoring filter
|
org.apache.nutch.scoring.opic |
Scoring filter implementing a variant of the Online Page Importance Computation
(OPIC) algorithm.
|
org.apache.nutch.scoring.tld |
Top Level Domain Scoring plugin.
|
Package | Description |
---|---|
org.apache.nutch.parse.html |
An HTML document parsing plugin.
|
org.apache.nutch.parse.js |
Parser and parse filter plugin to extract all (possible) links
from JavaScript files and embedded JavaScript code snippets.
|
org.apache.nutch.parse.tika |
Parse various document formats with help of
Apache Tika.
|
Package | Description |
---|---|
org.apache.nutch.parse.metatags |
Parse filter to extract meta tags: keywords, description, etc.
|
Package | Description |
---|---|
org.apache.nutch.indexer.anchor |
An indexing plugin for inbound anchor text.
|
org.apache.nutch.indexer.basic |
A basic indexing plugin, adds basic fields: url, host, title, content, etc.
|
org.apache.nutch.indexer.metadata |
Indexing filter to add document metadata to the index.
|
org.apache.nutch.indexer.more |
A more indexing plugin, adds "more" index fields:
last modified date, MIME type, content length.
|
org.apache.nutch.indexer.subcollection |
Indexing filter to assign documents to subcollections.
|
org.apache.nutch.indexer.tld |
Top Level Domain Indexing plugin.
|
Package | Description |
---|---|
org.apache.nutch.indexwriter.solr |
Index writer plugin for Apache Solr.
|
Package | Description |
---|---|
org.apache.nutch.analysis.lang |
Text document language identifier.
|
org.apache.nutch.collection |
Subcollection is a subset of an index.
|
org.apache.nutch.microformats.reltag |
A microformats Rel-Tag
Parser/Indexer/Querier plugin.
|
org.creativecommons.nutch |
Sample plugins that parse and index Creative Commons medadata.
|
Apache Nutch 2.X is a branch of the Apache Nutch open source web-search software project. It builds on Apache Gora for data persistence and Apache Solr for indexing adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and an array other document formats.
Copyright © 2019 The Apache Software Foundation