Skip navigation links

apache-nutch 2.4 API

Apache Nutch 2.X is a branch of the Apache Nutch open source web-search software project.

See: Description

Core 
Package Description
org.apache.nutch.api
REST API to run and control crawl jobs.
org.apache.nutch.api.impl
Implementations of REST API interfaces.
org.apache.nutch.api.impl.db  
org.apache.nutch.api.misc  
org.apache.nutch.api.model.request  
org.apache.nutch.api.model.response  
org.apache.nutch.api.resources  
org.apache.nutch.api.security  
org.apache.nutch.core.jsoup.extractor
core package of jsoup-extractor containing XML configuration parser, document structure
org.apache.nutch.core.jsoup.extractor.normalizer
Normalizers for jsoup-extractor
org.apache.nutch.crawl
Crawl control code and tools to run the crawler.
org.apache.nutch.fetcher
The Nutch robot.
org.apache.nutch.host
Host database to store metadata per host.
org.apache.nutch.indexer
Index content, configure and run indexing and cleaning jobs to add, update, and delete documents from an index.
org.apache.nutch.indexer.html
Index raw HTML content.
org.apache.nutch.indexer.jsoup.extractor
Indexing filter for jsoup-extractor plugin
org.apache.nutch.indexer.solr  
org.apache.nutch.indexwriter.elastic
Index writer plugin for Elasticsearch.
org.apache.nutch.indexwriter.hbase
Index writer plugin for Apache HBase.
org.apache.nutch.metadata
A Multi-valued Metadata container, and set of constant fields for Nutch Metadata.
org.apache.nutch.net
Web-related interfaces: URL filters and normalizers.
org.apache.nutch.net.protocols
Helper classes related to the Protocol interface, sea also org.apache.nutch.protocol.
org.apache.nutch.parse
The Parse interface and related classes.
org.apache.nutch.parse.jsoup.extractor
Parse filter based on Jsoup
org.apache.nutch.plugin
The Nutch Plugin System.
org.apache.nutch.protocol
Classes related to the Protocol interface, see also org.apache.nutch.net.protocols.
org.apache.nutch.scoring
The ScoringFilter interface.
org.apache.nutch.storage
Representation (web pages, host metadata) of data in abstracted storage.
org.apache.nutch.tools
Miscellaneous tools.
org.apache.nutch.tools.arc
Tools to read the Arc file format.
org.apache.nutch.tools.proxy
Proxy to benchmark the crawler.
org.apache.nutch.util
Miscellaneous utility classes.
org.apache.nutch.util.domain
Classes for domain name analysis.
org.apache.nutch.webui
Provides classes and interfaces for Web UI
org.apache.nutch.webui.client
Provides client classes and interfaces for Web UI
org.apache.nutch.webui.client.impl
Contains implementation of client classes and interfaces for Web UI
org.apache.nutch.webui.client.model
Contains model classes of client for Web UI
org.apache.nutch.webui.config
Contains config classes for Web UI
org.apache.nutch.webui.model
Contains model classes for Web UI
org.apache.nutch.webui.pages
Provides classes and interfaces of pages for Web UI
org.apache.nutch.webui.pages.assets
Contains asset classes for Web UI
org.apache.nutch.webui.pages.auth
Contains authorization classes for Web UI
org.apache.nutch.webui.pages.components
Contains component classes for Web UI
org.apache.nutch.webui.pages.crawls
Contains crawl page classes for Web UI
org.apache.nutch.webui.pages.instances
Contains instances pages classes for Web UI
org.apache.nutch.webui.pages.menu
Contains menu page classes for Web UI
org.apache.nutch.webui.pages.seed
Contains seed pages' classes for Web UI
org.apache.nutch.webui.pages.settings
Contains settings page classes for Web UI
org.apache.nutch.webui.service
Provides service classes and interfaces for Web UI
org.apache.nutch.webui.service.impl
Contains service implementation classes for Web UI
Plugins API 
Package Description
org.apache.nutch.protocol.http.api
Common API used by HTTP plugins (http, httpclient)
org.apache.nutch.urlfilter.api
Generic URL filter library, abstracting away from regular expression implementations.
Protocol Plugins 
Package Description
org.apache.nutch.protocol.file
Protocol plugin which supports retrieving local file resources.
org.apache.nutch.protocol.ftp
Protocol plugin which supports retrieving documents via the ftp protocol.
org.apache.nutch.protocol.http
Protocol plugin which supports retrieving documents via the http protocol.
org.apache.nutch.protocol.httpclient
Protocol plugin which supports retrieving documents via the HTTP and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.
org.apache.nutch.protocol.sftp
Protocol plugin which supports retrieving documents via the sftp protocol.
URL Filter Plugins 
Package Description
org.apache.nutch.urlfilter.automaton
URL filter plugin based on dk.brics.automaton Finite-State Automata for JavaTM.
org.apache.nutch.urlfilter.domain
URL filter plugin to include only URLs which match an element in a given list of domain suffixes, domain names, and/or host names.
org.apache.nutch.urlfilter.prefix
URL filter plugin to include only URLs which match one of a given list of URL prefixes.
org.apache.nutch.urlfilter.regex
URL filter plugin to include and/or exclude URLs matching Java regular expressions.
org.apache.nutch.urlfilter.suffix
URL filter plugin to either exclude or include only URLs which match one of the given (path) suffixes.
org.apache.nutch.urlfilter.validator
URL filter plugin that validates given urls.
URL Normalizer Plugins 
Package Description
org.apache.nutch.net.urlnormalizer.basic
URL normalizer performing basic normalizations: remove default ports and dot segments in path.
org.apache.nutch.net.urlnormalizer.pass
URL normalizer dummy which does not change URLs.
org.apache.nutch.net.urlnormalizer.regex
URL normalizer with configurable rules based on regular expressions (Pattern).
Scoring Plugins 
Package Description
org.apache.nutch.scoring.link
Scoring filter
org.apache.nutch.scoring.opic
Scoring filter implementing a variant of the Online Page Importance Computation (OPIC) algorithm.
org.apache.nutch.scoring.tld
Top Level Domain Scoring plugin.
Parse Plugins 
Package Description
org.apache.nutch.parse.html
An HTML document parsing plugin.
org.apache.nutch.parse.js
Parser and parse filter plugin to extract all (possible) links from JavaScript files and embedded JavaScript code snippets.
org.apache.nutch.parse.tika
Parse various document formats with help of Apache Tika.
Parse Filter Plugins 
Package Description
org.apache.nutch.parse.metatags
Parse filter to extract meta tags: keywords, description, etc.
Indexing Filter Plugins 
Package Description
org.apache.nutch.indexer.anchor
An indexing plugin for inbound anchor text.
org.apache.nutch.indexer.basic
A basic indexing plugin, adds basic fields: url, host, title, content, etc.
org.apache.nutch.indexer.metadata
Indexing filter to add document metadata to the index.
org.apache.nutch.indexer.more
A more indexing plugin, adds "more" index fields: last modified date, MIME type, content length.
org.apache.nutch.indexer.subcollection
Indexing filter to assign documents to subcollections.
org.apache.nutch.indexer.tld
Top Level Domain Indexing plugin.
Indexer Plugins 
Package Description
org.apache.nutch.indexwriter.solr
Index writer plugin for Apache Solr.
Misc. Plugins 
Package Description
org.apache.nutch.analysis.lang
Text document language identifier.
org.apache.nutch.collection
Subcollection is a subset of an index.
org.apache.nutch.microformats.reltag
A microformats Rel-Tag Parser/Indexer/Querier plugin.
org.creativecommons.nutch
Sample plugins that parse and index Creative Commons medadata.

Apache Nutch 2.X is a branch of the Apache Nutch open source web-search software project. It builds on Apache Gora for data persistence and Apache Solr for indexing adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and an array other document formats.

Skip navigation links

Copyright © 2019 The Apache Software Foundation