Overview (apache-nutch 2.4 API)

Core
Package	Description
org.apache.nutch.api	REST API to run and control crawl jobs.
org.apache.nutch.api.impl	Implementations of REST API interfaces.
org.apache.nutch.api.impl.db
org.apache.nutch.api.misc
org.apache.nutch.api.model.request
org.apache.nutch.api.model.response
org.apache.nutch.api.resources
org.apache.nutch.api.security
org.apache.nutch.core.jsoup.extractor	core package of jsoup-extractor containing XML configuration parser, document structure
org.apache.nutch.core.jsoup.extractor.normalizer	Normalizers for jsoup-extractor
org.apache.nutch.crawl	Crawl control code and tools to run the crawler.
org.apache.nutch.fetcher	The Nutch robot.
org.apache.nutch.host	Host database to store metadata per host.
org.apache.nutch.indexer	Index content, configure and run indexing and cleaning jobs to add, update, and delete documents from an index.
org.apache.nutch.indexer.html	Index raw HTML content.
org.apache.nutch.indexer.jsoup.extractor	Indexing filter for jsoup-extractor plugin
org.apache.nutch.indexer.solr
org.apache.nutch.indexwriter.elastic	Index writer plugin for Elasticsearch.
org.apache.nutch.indexwriter.hbase	Index writer plugin for Apache HBase.
org.apache.nutch.metadata	A Multi-valued Metadata container, and set of constant fields for Nutch Metadata.
org.apache.nutch.net	Web-related interfaces: URL `filters` and `normalizers`.
org.apache.nutch.net.protocols	Helper classes related to the `Protocol` interface, sea also `org.apache.nutch.protocol`.
org.apache.nutch.parse	The `Parse` interface and related classes.
org.apache.nutch.parse.jsoup.extractor	Parse filter based on Jsoup
org.apache.nutch.plugin	The Nutch `Plugin` System.
org.apache.nutch.protocol	Classes related to the `Protocol` interface, see also `org.apache.nutch.net.protocols`.
org.apache.nutch.scoring	The `ScoringFilter` interface.
org.apache.nutch.storage	Representation (`web pages`, `host metadata`) of data in abstracted storage.
org.apache.nutch.tools	Miscellaneous tools.
org.apache.nutch.tools.arc	Tools to read the Arc file format.
org.apache.nutch.tools.proxy	Proxy to `benchmark` the crawler.
org.apache.nutch.util	Miscellaneous utility classes.
org.apache.nutch.util.domain	Classes for domain name analysis.
org.apache.nutch.webui	Provides classes and interfaces for Web UI
org.apache.nutch.webui.client	Provides client classes and interfaces for Web UI
org.apache.nutch.webui.client.impl	Contains implementation of client classes and interfaces for Web UI
org.apache.nutch.webui.client.model	Contains model classes of client for Web UI
org.apache.nutch.webui.config	Contains config classes for Web UI
org.apache.nutch.webui.model	Contains model classes for Web UI
org.apache.nutch.webui.pages	Provides classes and interfaces of pages for Web UI
org.apache.nutch.webui.pages.assets	Contains asset classes for Web UI
org.apache.nutch.webui.pages.auth	Contains authorization classes for Web UI
org.apache.nutch.webui.pages.components	Contains component classes for Web UI
org.apache.nutch.webui.pages.crawls	Contains crawl page classes for Web UI
org.apache.nutch.webui.pages.instances	Contains instances pages classes for Web UI
org.apache.nutch.webui.pages.menu	Contains menu page classes for Web UI
org.apache.nutch.webui.pages.seed	Contains seed pages' classes for Web UI
org.apache.nutch.webui.pages.settings	Contains settings page classes for Web UI
org.apache.nutch.webui.service	Provides service classes and interfaces for Web UI
org.apache.nutch.webui.service.impl	Contains service implementation classes for Web UI

Plugins API
Package	Description
org.apache.nutch.protocol.http.api	Common API used by HTTP plugins (`http`, `httpclient`)
org.apache.nutch.urlfilter.api	Generic `URL filter` library, abstracting away from regular expression implementations.

Protocol Plugins
Package	Description
org.apache.nutch.protocol.file	Protocol plugin which supports retrieving local file resources.
org.apache.nutch.protocol.ftp	Protocol plugin which supports retrieving documents via the ftp protocol.
org.apache.nutch.protocol.http	Protocol plugin which supports retrieving documents via the http protocol.
org.apache.nutch.protocol.httpclient	Protocol plugin which supports retrieving documents via the HTTP and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.
org.apache.nutch.protocol.sftp	Protocol plugin which supports retrieving documents via the sftp protocol.

URL Filter Plugins
Package	Description
org.apache.nutch.urlfilter.automaton	URL filter plugin based on dk.brics.automaton Finite-State Automata for Java^TM.
org.apache.nutch.urlfilter.domain	URL filter plugin to include only URLs which match an element in a given list of domain suffixes, domain names, and/or host names.
org.apache.nutch.urlfilter.prefix	URL filter plugin to include only URLs which match one of a given list of URL prefixes.
org.apache.nutch.urlfilter.regex	URL filter plugin to include and/or exclude URLs matching Java regular expressions.
org.apache.nutch.urlfilter.suffix	URL filter plugin to either exclude or include only URLs which match one of the given (path) suffixes.
org.apache.nutch.urlfilter.validator	URL filter plugin that validates given urls.

URL Normalizer Plugins
Package	Description
org.apache.nutch.net.urlnormalizer.basic	URL normalizer performing basic normalizations: remove default ports and dot segments in path.
org.apache.nutch.net.urlnormalizer.pass	URL normalizer dummy which does not change URLs.
org.apache.nutch.net.urlnormalizer.regex	URL normalizer with configurable rules based on regular expressions (`Pattern`).

Scoring Plugins
Package	Description
org.apache.nutch.scoring.link	Scoring filter
org.apache.nutch.scoring.opic	Scoring filter implementing a variant of the Online Page Importance Computation (OPIC) algorithm.
org.apache.nutch.scoring.tld	Top Level Domain Scoring plugin.

Parse Plugins
Package	Description
org.apache.nutch.parse.html	An HTML document parsing plugin.
org.apache.nutch.parse.js	Parser and parse filter plugin to extract all (possible) links from JavaScript files and embedded JavaScript code snippets.
org.apache.nutch.parse.tika	Parse various document formats with help of Apache Tika.

Parse Filter Plugins
Package	Description
org.apache.nutch.parse.metatags	Parse filter to extract meta tags: keywords, description, etc.

Indexing Filter Plugins
Package	Description
org.apache.nutch.indexer.anchor	An indexing plugin for inbound anchor text.
org.apache.nutch.indexer.basic	A basic indexing plugin, adds basic fields: url, host, title, content, etc.
org.apache.nutch.indexer.metadata	Indexing filter to add document metadata to the index.
org.apache.nutch.indexer.more	A more indexing plugin, adds "more" index fields: last modified date, MIME type, content length.
org.apache.nutch.indexer.subcollection	Indexing filter to assign documents to subcollections.
org.apache.nutch.indexer.tld	Top Level Domain Indexing plugin.

Indexer Plugins
Package	Description
org.apache.nutch.indexwriter.solr	Index writer plugin for Apache Solr.

Misc. Plugins
Package	Description
org.apache.nutch.analysis.lang	Text document language identifier.
org.apache.nutch.collection	Subcollection is a subset of an index.
org.apache.nutch.microformats.reltag	A microformats Rel-Tag Parser/Indexer/Querier plugin.
org.creativecommons.nutch	Sample plugins that parse and index Creative Commons medadata.

Apache Nutch 2.X is a branch of the Apache Nutch open source web-search software project. It builds on Apache Gora for data persistence and Apache Solr for indexing adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and an array other document formats.