Full text search with Sphinx

  1. Sphinx installation
    1. Get the source
    2. 
        wget http://www.sphinxsearch.com/downloads/sphinx-0.9.7.tar.gz
        tar xvzf sphinx-0.9.7.tar.gz
        

    3. Compile & install
    4. 
        cd sphinx-0.9.7
        ./configure --prefix=~/sphinx
        make
        make install
        

    5. Add to PATH
    6. open your .bash_profile file and add this snippet:
      
        if [ -d ~/sphinx/bin ]; then
          PATH=~/sphinx/bin:"${PATH}"
        fi
        
      Note: you need to reload the shell to apply the new setting

  2. Sphinx configuration
  3. Configure sphinx for searching coffee houses by their name and company name.

    Create a sphinx configuration file named sphinx.conf in the coffeehouse/config directory

    Caution:
    • You need to define an sql query that will fetch all the data that goes into the full text index. The first column returned by this query must be a unique numeric identifier.
    • You need to specify a port number, on which your search daemon will be listening. You need to pick a unique port number. Suggested port numbering schema for student shared host: 20000 + user id (to find your user id use the id command)
    • You need to specify paths to your index and log files. Make sure they point to locations in your home directory
    
    #
    # Sphinx configuration file sample
    #
    
    #############################################################################
    ## data source definition
    #############################################################################
    
    source coffee_houses
    {
    	# data source type
    	# for now, known types are 'mysql', 'pgsql' and 'xmlpipe'
    	# MUST be defined
    	type				= mysql
    
    	#####################################################################
    
    	# some straightforward parameters for 'mysql' source type
    	sql_host			= wierzba.wzks.uj.edu.pl
    	sql_user			= agnessa
    	sql_pass			= nndlhwra
    	sql_db				= baza_agnessa
    	sql_port			= 3306	# optional, default is 3306
    
    	# pre-query, executed before the main fetch query
    	# useful eg. to setup encoding or mark records
    	# optional, default is empty
    	#
    	# sql_query_pre		= SET CHARACTER_SET_RESULTS=cp1251
    	sql_query_pre		= SET CHARACTER_SET_RESULTS=latin2
    
    	# main document fetch query
    	#
    	# you can specify up to 32 (formally SPH_MAX_FIELDS in sphinx.h) fields;
    	# all of the fields which are not document_id or attributes (see below)
    	# will be full-text indexed
    	#
    	# document_id MUST be the very first field
    	# document_id MUST be positive (non-zero, non-negative)
    	# document_id MUST fit into 32 bits
    	# document_id MUST be unique
    	#
    	# mandatory
    	sql_query			= \
    		SELECT coffee_houses.id, coffee_houses.name,  companies.name \
    		FROM coffee_houses LEFT JOIN companies ON (coffee_houses.company_id=companies.id)
    }
    
    #############################################################################
    ## index definition
    #############################################################################
    
    index coffee_houses
    {
    	# which document source to index
    	# at least one MUST be defined
    	#
    	# multiple sources MAY be specified; to do so, just add more
    	# "source = NAME" lines. in this case, ALL the document IDs
    	# in ALL the specified sources MUST be unique
    	source			= coffee_houses
    
    	# this is path and index file name without extension
    	#
    	# indexer will append different extensions to this path to
    	# generate names for both permanent and temporary index files
    	#
    	# .tmp* files are temporary and can be safely removed
    	# if indexer fails to remove them automatically
    	#
    	# .sp* files are fulltext index data files. specifically,
    	# .spa contains attribute values attached to each document id
    	# .spd contains doclists and hitlists
    	# .sph contains index header (schema and other settings)
    	# .spi contains wordlists
    	#
    	# MUST be defined
    	path			= /home/epi/login/sphinx/var/data/coffee_houses
    }
    
    #############################################################################
    ## indexer settings
    #############################################################################
    
    indexer
    {
    	# memory limit
    	#
    	# may be specified in bytes (no postfix), kilobytes (mem_limit=1000K)
    	# or megabytes (mem_limit=10M)
    	#
    	# will grow if set unacceptably low
    	# will warn if set too low and potentially hurting the performance
    	#
    	# optional, default is 32M
    	mem_limit			= 16M
    }
    
    #############################################################################
    ## searchd settings
    #############################################################################
    
    searchd
    {
    	# port on which search daemon will listen
    	port				= 10480
    
    
    	# log file
    	# searchd run info is logged here
    	log					= /home/epi/login/sphinx/var/log/searchd.log
    
    
    	# query log file
    	# all the search queries are logged here
    	query_log			= /home/epi/login/sphinx/var/log/query.log
    
    
    	# client read timeout, seconds
    	read_timeout		= 5
    
    
    	# maximum amount of children to fork
    	# useful to control server load
    	max_children		= 3
    
    
    	# a file which will contain searchd process ID
    	# used for different external automation scripts
    	# MUST be present
    	pid_file			= /home/epi/login/sphinx/var/log/searchd.pid
    
    
    	# maximum amount of matches this daemon would ever retrieve
    	# from each index and serve to client
    	#
    	# this parameter affects per-client memory and CPU usage
    	# (16+ bytes per match) in match sorting phase; so blindly raising
    	# it to 1 million is definitely NOT recommended
    	#
    	# starting from 0.9.7, it can be decreased on the fly through
    	# the corresponding API call; increasing is prohibited to protect
    	# against malicious and/or malformed requests
    	#
    	# default is 1000 (just like with Google)
    	max_matches			= 1000
    }
    
    # --eof--
    
    
      
    sample sphinx.conf file

  4. Install acts_as_sphinx
  5. 
      ./script/plugin install http://svn.datanoise.com/acts_as_sphinx
      

  6. Run the indexer & search daemon
  7. How to index your data:

    
    rake sphinx:index
    
    This will index the data as specified in sphinx.conf.

    How to start the search daemon:

    rake sphinx:start
    This will start a search demon on the port specified in sphinx.conf.

    How to stop the search daemon:

     rake sphinx:stop
    Or just kill it (without -9).

    How to reindex data:

     rake sphinx:rotate
    Note: use this task when the daemon is running, otherwise use sphinx:index

  8. Use sphinx in your application
    1. Edit and test the model
    2. Add acts_as_sphinx call to the CoffeeHouse model (app/models/coffee_house.rb):

      
      class CoffeeHouse < ActiveRecord::Base
        acts_as_sphinx :host => '127.0.0.1', :port => the_port_your_searchd_is_running_on, :index => 'coffee_houses'
        [...]
      end
      
      Now in the command line try out the search (use a query that is likely to return results from your data set):
      
      script/console
      >CoffeeHouse.find_with_sphinx('starbucks')
      
      Note: in case script/console is not working, the same effect can be obtained with:
      
      irb
      >load 'config/environment.rb'
      
      In case irb is not working, it may not have been installed properly. A common case is when there's no executable called irb, but there is one called irb1.8. You may define an alias in .bash_profile to overcome this:
      alias irb='irb1.8'

    3. Full text search action in the controller
    4. add a new action in app/controllers/coffee_house_controller.rb:

      
        def full_text_search
          @page = params[:page]
          @page ||= 1;
          @query = params[:query]
          @query ||= ''
          @coffee_houses = CoffeeHouse.find_with_sphinx(@query,
            :sphinx => {:limit => PER_PAGE, :page => @page})
          @coffee_house_pages = pages_for @coffee_houses.total, :page => @page
          render :partial => 'list'
        end
        

      Note: make sure the search action (not full_text_search) looks the same as in "Ajax in Practice"

      add pagination helpers to app/controllers/application.rb:

      
        PER_PAGE = 10 unless defined? PER_PAGE
      
        def pages_for(size, options = {})
          default_options = {:per_page => PER_PAGE}
          options = default_options.merge(options)
          Paginator.new self, size, options[:per_page], (options[:page] || 1)
        end
        

    5. Search box in the view
    6. add a new search box in app/views/coffee_houses/list.rhtml:

      
        <div class="search-box">
        <% form_remote_tag :url => {:action => 'full_text_search'}, :update => 'coffee_houses',
        :loading => "Element.show('fts-loader')", :complete => "Element.hide('fts-loader')" do %>
          <%= text_field_tag 'query' %>
          <%= submit_tag 'Full text search' %>
        <% end %>
        <%= image_tag 'ajax-loader.gif', :id => 'fts-loader', :style => 'display:none' %>
        </div>
        
      ajax-loader.gif

      adjust the stylesheet:

      
        .search-box{
          width: 300px;
          padding: 10px;
        }
        

      Note: make sure _list.rhtml looks the same as in "Ajax in Practice"

    7. Index updates
    8. Schedule a cron job to run the rotate task at a frequency suitable for your site.

  9. References