README #2

P4Search Copyright (c) 2014, Perforce Software
---

P4Search is built on the following technology stack:

Apache Solr - combines Apache Lucene (fast indexing) and Apache
              Tika (file content/metadata parsing) to form the
              base of the index service
P4Search -   Java code that
              * optionally scans an existing Perforce Server for
                content on startup
              * via Solr indexes change commit content as it
                happens (via a server trigger)
              * optionally indexes all file revisions (except
                deletes) and tracks which is the most recent 
              * provides an API for searching the Solr index via
                GET or POST requests
              * provides a minimal web UI for searching via a web
                browser
jetty/Tomcat/etc. web application server - hosts both the Solr
              indexer application and the P4Search application

Operational theory
---
Initialization: P4Search will walk through any depots it can,
produce lists of files to index for a queue, and pull things off
of the queue for indexing.  The service can be configured to scan
all versions or just the head revision, skip certain file types,
only index metadata, etc.  The amount of time will depend on the
size of your installation.  The service is still usable for
searching what it has while it is scanning.  Once scanning is
complete a key is set on the server to prevent attempting future
scans, or to resume interrupted scans.

Changelist trigger: When the search-queue.sh script is installed
on the Perforce Server, it will send a curl request with the
changelist number to P4Search.  The data is written into a
temporary location for processing later.  Changelist processing
is largely the same as the initialization phase: turn the 
changelist into a list of files and queue them up for later 
processing.

Indexing: A p4 print of the file is performed and sent to the 
Apache Solr indexing service.  It pulls whatever data it can 
and indexes it.  P4Search specific metadata that is indexed is
as follows (see the schema.xml):

   <field name="depotrevision" type="string" indexed="false" stored="true"/>
   <field name="filename" type="text_en" indexed="true" stored="true"/>
   <!-- path broken into parts, so you can efficiently search on e.g. a valued folder name -->
   <field name="depotpath" type="string" indexed="true" stored="true" multiValued="true" />
   <!-- for indexing all revisions, if this doc is the head rev -->
   <field name="headrevision" type="boolean" indexed="true" stored="true" multiValued="false" />
   <field name="modifiedby" type="string" indexed="true" stored="true" />
   <field name="modifiedtime" type="date" indexed="true" stored="true" />
   <field name="filesize" type="long" indexed="true" stored="true" />
   <!-- if you are configured to index attributes, digest comes along for free -->
   <field name="digest" type="string" indexed="true" stored="true" />
   <!-- file attributes go here, p4attr_ + the raw name to avoid potential conflicts -->
   <dynamicField name="p4attr_*" type="text_general" indexed="true" stored="true" />

Search: First P4Search queries Apache Solr for the results, then
it runs p4 files as the supplied user to limit results to what
that user would normally be able to see in the Perforce Server
via any other Perforce client application.

Build and install
---
JDK 1.7 is required to build the project, and gradle (www.gradle.org)
is required in the $PATH.  JDK 1.7, jetty, and Apache solr are 
required to run the service, and having the p4 command line client
in the path is required when using install.sh.  The only platform
this process has been tested on is Ubuntu Linux.

$ROOT/search/build.sh - builds and tests the project (war file
                        and tgz package)

Once the project is successfully built, the $ROOT/search/tmp
directory contains both a build output directory ($OUTDIR) and
a tgz containing all of its contents.  At the root of the output
directory is the install script (./install.sh) which will 
* prompt for some basic installation information
* download jetty (8.1) and solr (4.5.1) if the required
  tarballs are not in the directory
* unpack them and configure both for use by P4Search
* create a start.sh and stop.sh for starting and stopping
  the services

To remove the installation simply run "rm -rf $OUTDIR/install".

The installation location is $OUTDIR/install by default.  You can 
run $OUTDIR/install/start.sh to start both services and stop.sh to
stop them.

The P4Search tool uses a Perforce Server trigger (see scripts/
search-queue.sh or install/search-queue.sh) to receive new
changelists, so the trigger must be installed on the Perforce
Server or no indexing past the initial scan will take place.  

Configuration
---
The install script creates a search.config properties file in
the $OUTDIR/install/jetty-.../resources directory.  Configuration
changes currently require the service to be restarted.  Except for
server configuration (serverProtocol, serverHost, serverPort,
indexerUser, indexerPassword), reasonable defaults are assumed
for the other properties:

com.perforce.search...
(general Solr configuration)
searchEngine: URL of the solr installation, e.g. http://localhost:8983
searchEngineToken: magic token matching the Perforce Server trigger
collectionName: solr collection that contains the indexed data

(general processing configuration)
queueLocation: location of queued changelist files to be indexed
maxSearchResults: maximum results returned by the service
maxStreamFileSize: largest file size to attempt to index content
ignoredExtensions: file with a CRLF list of extensions to skip
  content processing
neverProcessList: file with a CRLF list of extensions to never
  index
indexMethod: ALL_REVISIONS | HEAD_REVISIONS, HEAD... means to
  only keep the index up to date with the latest revision
blackFstatRegex: for Perforce attributes, which p4 attr to
  skip (empty means do not index fstat data)
whiteFstatRegex: for Perforce attributes, which p4 attr to
  include (empty means anything not in the blacklist)
changelistCatchupKey: key name where the latest processed changelist
  is located.  On startup the service will try to "catch up" based
  on this value.

(file scanner config)
fileScannerTokenKey: key name to indicate when the initialization
  is complete, empty implies "do not scan"
fileScannerTokenValue: key value to indicate when the initialization
  is complete, empty implies "do not scan"
scanPaths: CSV paths in the Perforce server to scan
fileScannerThreads: number of threads handling the processing
fileScannerSleep: used to throttle the scanner back
fileScannerDepth: when scanning how many revisions down to go, 0
  implies all revisions, 1, is head only, etc.
maxScanQueueSize: used to throttle the amount of files in
  the scan queue

(GUI config)
commonsURL, swarmURL, p4webURL: URLs of services that can show 
  the files via links in the web browser.  Swarm and P4Web 
  settings are mutually exclusive, with a preference to the Swarm
  URL.

API
---
See the API documention included in the installation for how
to gain programmatic access to search results and how to send
specific queries to the search service.  The P4Search web UI
exercises the underlying http API using javascript XHR.

Notes
---
* If you want to restrict the Apache Solr access to certain IP
addresses, you must add a handler to the etc/jetty.xml file in 
the Solr installation, e.g. 
org.eclipse.jetty.server.handler.IPAccessHandler.  See the
install.sh script for a hint on how to do this.

* If you suspect that the service has missed some changelists,
one easy way to fix it is to script sending a number of curl
requests to the server to force it to re-index files in those
changelists.  Re-indexing files will not corrupt the integrity
of the index, at worst it will simply duplicate the work.

* While Apache Solr has the ability to parse many types of files,
you may find it useful to look through the Solr logs and
determine if you need additional extractors, e.g. xmpcore.jar for
media files.

* This version installs with Apache Solr 4.5.1 by default.  When
testing against mp3 files I found that some mp3 files cause
the Apache Tika parser in this version to hang the CPU.  If
you have similar problems or expect to index a lot of media
files you might consider an earlier version of Solr (Solr 4.3.1
used Apache Tika 1.3 which worked for me), replacing the Tika
jar with version 1.5 (unreleased) or using the ...ignoredExtensions
configurable to exclude problematic files.

* If you're curious on how the initial scan is doing or about other
things the server is doing, the easiest way to check what is
happening is to tail the log file, e.g. run 
"tail -f start.log"
#	Change	User	Description
#2	18866	Sven Erik Knop	Updated with latest version
#1	9795	Sven Erik Knop	Populate //guest/sven_erik_knop/p4search/... from //guest/perforce_software/p4search/....
//guest/perforce_software/p4search/search/README
#2	9007	Doug Scheirer	update workshop p4-search with the latest released code: * code updates - bug fixes * adding jetty + solr tarballs * script updates * updated p4java jar to latest release
#1	8975	Matt Attaway	Populate official version of p4-search from the original Doug Scheirer source