A partial archive of https://discourse-mediawiki.wmflabs.org as of Saturday May 21, 2022.

Cirrus Search not working

amitkr

Hi,

I have installed cirrussearch inside mediawiki but it is not searching the word inside any uploaded documents.

Request someone to review and advise if any steps were missed:

Please note, the uploaded documents contains MS WORD, POWERPOINT, PDF’S, EXCEL, MSG (Outlook email) , TXT files.

I followed below steps:-

installed media wiki successfully.

installed elastica inside the extention folder.

installed cirrussearch inside the extention folder

after that I performed steps mentioned in README.txt file

--------------------------------------------------- Instructions in README.TXT file ----------------------------

All elastic versions prior to 5.3.1 have bugs that affect CirrusSearch:

  • elastic versions before 5.3.x requires the following config in your LocalSettings.php:

    $CirrusSearchElasticQuirks = [ ‘query_string_max_determinized_states’ => true ];

  • elastic versions before 5.3.1 suffer from a bug that prevent an index to be reindexed

    properly without missing docs when using multiple elasticsearch machines

  • when using elastic prior to 5.5.2 with the extra plugin and the super_detect_noop script

    you must activate the “super_detect_noop_enable_native” option (see docs/settings.txt)

Place the CirrusSearch extension in your extensions directory.

Make sure you have the curl php library installed (sudo apt-get install php5-curl in Debian.)

You also need to install the Elastica MediaWiki extension.

Add this to LocalSettings.php:

wfLoadExtension( ‘Elastica’ );

require_once( “$IP/extensions/CirrusSearch/CirrusSearch.php” );

$wgDisableSearchUpdate = true;

Configure your search servers in LocalSettings.php if you aren’t running Elasticsearch on localhost:

$wgCirrusSearchServers = [ ‘elasticsearch0’, ‘elasticsearch1’, ‘elasticsearch2’, ‘elasticsearch3’ ];

There are other $wgCirrusSearch variables that you might want to change from their defaults.

Now run this script to generate your elasticsearch index:

php $MW_INSTALL_PATH/extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php

Now remove $wgDisableSearchUpdate = true from LocalSettings.php. Updates should start heading to Elasticsearch.

Next bootstrap the search index by running:

php $MW_INSTALL_PATH/extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipLinks --indexOnSkip

php $MW_INSTALL_PATH/extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipParse

Note that this can take some time. For large wikis read “Bootstrapping large wikis” below.

Once that is complete add this to LocalSettings.php to funnel queries to ElasticSearch:

$wgSearchType = ‘CirrusSearch’;


cmd line result:

each and every steps got succes.

E:\dnd\iriwikipedia\extensions\CirrusSearch\maintenance>php updateSearchIndexCon
fig.php
content index…
Fetching Elasticsearch version…5.6.6…ok
Scanning available plugins…none
Inferring index identifier…my_wikipedia_content_first
Picking analyzer…english
Validating number of shards…ok
Validating replica range…ok
Validating shard allocation settings…done
Validating max shards per node…ok
Validating analyzers…ok
Validating mappings…
Validating mapping…ok
Validating aliases…
Validating my_wikipedia_content alias…ok
Validating my_wikipedia alias…ok
Updating tracking indexes…done
general index…
Fetching Elasticsearch version…5.6.6…ok
Scanning available plugins…none
Inferring index identifier…my_wikipedia_general_first
Picking analyzer…english
Validating number of shards…ok
Validating replica range…ok
Validating shard allocation settings…done
Validating max shards per node…ok
Validating analyzers…ok
Validating mappings…
Validating mapping…ok
Validating aliases…
Validating my_wikipedia_general alias…ok
Validating my_wikipedia alias…ok
Updating tracking indexes…done
Deleting namespaces…done
Indexing namespaces…done

E:\dnd\iriwikipedia\extensions\CirrusSearch\maintenance>php forceSearchIndex.php
–skipLinks --indexOnSkip
[ my_wikipedia] Indexed 9 pages ending at 12 at 11/second
[ my_wikipedia] Indexed 8 pages ending at 22 at 11/second
[ my_wikipedia] Indexed 10 pages ending at 32 at 12/second
[ my_wikipedia] Indexed 9 pages ending at 42 at 12/second
[ my_wikipedia] Indexed 9 pages ending at 52 at 12/second
[ my_wikipedia] Indexed 10 pages ending at 62 at 12/second
[ my_wikipedia] Indexed 10 pages ending at 72 at 13/second
[ my_wikipedia] Indexed 7 pages ending at 79 at 13/second
Indexed a total of 72 pages at 13/second

E:\dnd\iriwikipedia\extensions\CirrusSearch\maintenance>php forceSearchIndex.php
–skipParse
[ my_wikipedia] Indexed 45 pages ending at 52 at 137/second
[ my_wikipedia] Indexed 27 pages ending at 79 at 154/second
Indexed a total of 72 pages at 154/second

Thanks - Amit

nomoa

CirrusSearch will index uploaded files if they are properly handled by a media handler extension (https://www.mediawiki.org/wiki/Category:Media_handling_extensions).
As far as I know only Pdf files are properly handled this way thanks to the PdfHandler extension.

With CirrusSearch you should only be able to search for Wiki page text (and PDFs if you installed the PdfHandler extension).
If you are not finding uploaded documents it’s expected since you don’t have a media handler extension that supports these file formats.
If you are not finding normal wiki page (with wikitext), it’s unexpected and there’s probably a problem in your setup and the best way to help us to help you would be to give us more context:

  • any logs you could find (mediawiki logs, elasticsearch logs)
  • the output of curl -s localhost:9200/_cat/indices

Or anything else that you think might help us.

amitkr

nomoa

Thanks Amit,

the screenshot you uploaded does not seem to contain any errors.

Could you clarify if your problem is:
A: I’m not able to find any files I uploaded to this wiki

or:

B: I’m not able to find anything on this wiki, the search results page is always empty even if search for a word that appears in a page (not an uploaded file)

Thanks!

amitkr

If I am searching a word inside mediawiki - if that is a name of the file than it is searching but I want to search a word which is used inside the uploaded documents such as pdf or excel .

amitkr

Searching a word which is used inside the uploaded documents is not working.

nomoa

Thanks for the clarification Amit,

CirrusSearch is working as expected then. The content of uploaded files is only indexed if the file type is properly supported by a media handler extension (see my original response).

Sadly the best you could do is to install the PdfHandler extension (be sure to read the debugging section as it contains important info about some maintenance script to run if the PDF files are already uploaded). This would only cover PDF files, other files such word and excel documents are not supported by any Mediawiki media handler extension as far as I know.

amitkr

Thanks for helping me

amitkr

Still Searching a word inside uploaded document in not working can any one help me on this

TheDJ

There are not ready made solutions for this, as far as we know. You’ll likely have to build your own solution.

TheDJ

Specifically, you will need to create a MediaHandler plugin/extension for MediaWiki, that can read .docx and .doc files, then implement the getPageText() function in your subclass to return the full text of the document. That will allow cirrus search to index the document format.

This is what PDFHandler does for PDF documents.

Konradmd

I’ve the same problem:
But I installed PDFHandler first - and PDFHandler seems to work properly since I get PDF-Preview Image and Thumbnails for linked and uploaded PDFs.

So: How can I test or see if the PDF files were indexed correctly - or if it is a problem of the search itself?

Tgr

Probably better to start a new topic than asking in one that has already been marked as resolved.