p4-search Copyright (c) 2013, Perforce Software
---
p4-search is built on the following technology stack:
Apache Solr - combines Apache Lucene (fast indexing) and Apache
Tika (file content/metadata parsing) to form the
base of the index service
p4-search - Java code that
* optionally scans an existing Perforce Server for
content on startup
* via Solr indexes change commit content as it
happens (via a server trigger)
* optionally indexes all file revisions (except
deletes) and tracks which is the most recent
* provides an API for searching the Solr index via
GET or POST requests
* provides a minimal web UI for searching via a web
browser
jetty/Tomcat/etc. web application server - hosts both the Solr
indexer application and the p4-search application
Operational theory
---
Initialization: p4-search will walk through any depots it can,
produce lists of files to index for a queue, and pull things off
of the queue for indexing. The service can be configured to scan
all versions or just the head revision, skip certain file types,
only index metadata, etc. The amount of time will depend on the
size of your installation. The service is still usable for
searching what it has while it is scanning. Once scanning is
complete a key is set on the server to prevent attempting future
scans, or to resume interrupted scans.
Changelist trigger: When the search-queue.sh script is installed
on the Perforce Server, it will send a curl request with the
changelist number to p4-search. The data is written into a
temporary location for processing later. Changelist processing
is largely the same as the initialization phase: turn the
changelist into a list of files and queue them up for later
processing.
Indexing: A p4 print of the file is performed and sent to the
Apache Solr indexing service. It pulls whatever data it can
and indexes it. p4-search specific metadata that is indexed is
as follows (see the schema.xml):
Search: First p4-search queries Apache Solr for the results, then
it runs p4 files as the supplied user to limit results to what
that user would normally be able to see in the Perforce Server
via any other Perforce client application.
Build and install
---
JDK 1.7 is required to build the project, and gradle (www.gradle.org)
is required in the $PATH. JDK 1.7, jetty, and Apache solr are
required to run the service, and having the p4 command line client
in the path is required when using install.sh. The only platform
this process has been tested on is Ubuntu Linux.
$ROOT/search/build.sh - builds and tests the project (war file
and tgz package)
Once the project is successfully built, the $ROOT/search/tmp
directory contains both a build output directory ($OUTDIR) and
a tgz containing all of its contents. At the root of the output
directory is the install script (./install.sh) which will
* prompt for some basic installation information
* download jetty (8.1) and solr (4.5.1) if the required
tarballs are not in the directory
* unpack them and configure both for use by p4-search
* create a start.sh and stop.sh for starting and stopping
the services
To remove the installation simply run "rm -rf $OUTDIR/install".
The installation location is $OUTDIR/install by default. You can
run $OUTDIR/install/start.sh to start both services and stop.sh to
stop them.
The p4-search tool uses a Perforce Server trigger (see scripts/
search-queue.sh) to receive new changelists, so the trigger must be
installed on the Perforce Server or no indexing past the initial
scan will take place.
Configuration
---
The install script creates a search.config properties file in
the $OUTDIR/install/jetty-.../resources directory. Configuration
changes currently require the service to be restarted. Except for
server configuration (serverProtocol, serverHost, serverPort,
indexerUser, indexerPassword), reasonable defaults are assumed
for the other properties:
com.perforce.search...
(general Solr configuration)
searchEngine: URL of the solr installation, e.g. http://localhost:8983
searchEngineToken: magic token matching the Perforce Server trigger
collectionName: solr collection that contains the indexed data
(general processing configuration)
queueLocation: location of queued changelist files to be indexed
maxSearchResults: maximum results returned by the service
maxStreamFileSize: largest file size to attempt to index content
ignoredExtensions: file with a CRLF list of extensions to skip
content processing
neverProcessList: file with a CRLF list of extensions to never
index
indexMethod: ALL_REVISIONS | HEAD_REVISIONS, HEAD... means to
only keep the index up to date with the latest revision
blackFstatRegex: for Perforce attributes, which p4 attr to
skip (empty means do not index fstat data)
whiteFstatRegex: for Perforce attributes, which p4 attr to
include (empty means anything not in the blacklist)
changelistCatchupKey: key name where the latest processed changelist
is located. On startup the service will try to "catch up" based
on this value.
(file scanner config)
fileScannerTokenKey: key name to indicate when the initialization
is complete, empty implies "do not scan"
fileScannerTokenValue: key value to indicate when the initialization
is complete, empty implies "do not scan"
scanPaths: CSV paths in the Perforce server to scan
fileScannerThreads: number of threads handling the processing
fileScannerSleep: used to throttle the scanner back
fileScannerDepth: when scanning how many revisions down to go, 0
implies all revisions, 1, is head only, etc.
maxScanQueueSize: used to throttle the amount of files in
the scan queue
(GUI config)
commonsURL, swarmURL, p4webURL: URLs of services that can show
the files via links in the web browser. Swarm and P4Web
settings are mutually exclusive, with a preference to the Swarm
URL.
API
---
See the API documention included in the installation for how
to gain programmatic access to search results and how to send
specific queries to the search service. The p4-search web UI
exercises the underlying http API using javascript XHR.
Notes
---
* If you want to restrict the Apache Solr access to certain IP
addresses, you must add a handler to the etc/jetty.xml file in
the Solr installation, e.g.
org.eclipse.jetty.server.handler.IPAccessHandler. See the
install.sh script for a hint on how to do this.
* If you suspect that the service has missed some changelists,
one easy way to fix it is to script sending a number of curl
requests to the server to force it to re-index files in those
changelists. Re-indexing files will not corrupt the integrity
of the index, at worst it will simply duplicate the work.
* While Apache Solr has the ability to parse many types of files,
you may find it useful to look through the Solr logs and
determine if you need additional extractors, e.g. xmpcore.jar for
media files.
* This version installs with Apache Solr 4.5.1 by default. When
testing against mp3 files I found that some mp3 files cause
the Apache Tika parser in this version to hang the CPU. If
you have similar problems or expect to index a lot of media
files you might consider an earlier version of Solr (Solr 4.3.1
used Apache Tika 1.3 which worked for me), replacing the Tika
jar with version 1.5 (unreleased) or using the ...ignoredExtensions
configurable to exclude problematic files.
* If you're curious on how the initial scan is doing or about other
things the server is doing, the easiest way to check what is
happening is to tail the log file, e.g. run
"tail -f start.log"