A partial archive of https://discourse-mediawiki.wmflabs.org as of Saturday May 21, 2022.

Cirrus Search not searching documents (pdf)

Konradmd

I installed PDFHandler and CirrusSearch - but the wiki search only shows results from within normal wiki pages - documents (pdf) are not searched (or indexed?).
PDFHandler seems to work properly since I get PDF-Preview Image and Thumbnails for linked and uploaded PDFs.

How can I test or see if the PDF files were indexed correctly - or if it is a problem of the search itself?

nomoa

One way to see what was indexed is to add action=cirrusDump to the page URL. It’ll show you the content that was indexed. It will display a JSON file where the pdf content should appear in the field file_text (note the output may be hard to read if you don’t have a browser extension that pretty prints JSON).
For example this pdf file on commons has the PDF text content inside file_text.

If for you file_text is empty it could mean that either the PDF as no text content available (image or scan?) or that you need to run additional maintenance scripts (see the debugging section of the PDFhandler documentation).

If you see proper text in file_text field it’s certainly that at query time we do not hit these pages.
If you search for a word that appears in the pdf do you see results when you select all namespaces?

Konradmd

Adding the action Parameter to the URL has no effect (shows the search results page with no results).
URL looks like:

index.php?search=Test&title=Spezial:Suche&action=cirrusDump

Tgr

You should add it to the URL of the file page, not the search page.

Konradmd

The JSON-Result appending this parameter to a files’s URL is complete empty.

nomoa

If by appending ?action=cirrusDump to a file URL the output is [] then it means the page has not been indexed at all.
Does this happen for newly uploaded files as well or only files that were uploaded prior to installing CirrusSearch and/or PDHHandler?

If it happens only for files that were uploaded prior to installing these extensions I’d recommend running PDFHandler maintenance scripts:

  • refreshImageMetadata.php -f
  • rebuildImages.php -f

And CirrusSearch script:

  • extensions/CirrusSearch/maintenance/forceSearchIndex.php

If even newly created PDF files do not appear in search results and also return [] when appending ?action=cirrusDump then it’s likely a problem in your setup or a bug in the code.

The only thing I can suggest at this point is trying to get more information out of the MediaWiki logs (someone on this channel may have better pointers on how to inspect MW logs) to help narrow down the issue.

One problem that may explain why PDF files are not indexed by CirrusSearch is a known limitation of the JobQueue if your setup uses the database backend to store jobs
(see https://phabricator.wikimedia.org/T124196 for more details).

Konradmd

The ?action=cirrusDump shows result after restarting the elasticsearch service in the background.

But even if there are JSON results, text="" and file_text=false

The PDFs are pure text files generated with MS Word … no scans or something similar.

I ran the maintenance scripts so often, I can’t count …

Here is a dump of the JSON generated with cirrusDump option:

text ""
source_text ""
text_bytes 0
content_model "wikitext"
language "de"
heading []
opening_text null
auxiliary_text []
defaultsort false
file_text false
file_media_type "OFFICE"
file_mime "application/pdf"
file_size 957241
file_width 1239
file_height 1754
file_bits 0
file_resolution 1474

The PDFHandler seems to work at least a little bit … since it detects the correct file parameter and generates thumbnails.

I’ll try to find something in the MediaWiki logs.

Are there special requirements of MW version or PHP version? I’ve the sneaky suspicion that there is something wrong deep in the LocalFile.php - since I get errors running refreshImageMetadata.php … like “wrong filename or folder” (Die Syntax für den Dateinamen, Verzeichnisnamen oder die Datenträgerbezeichnung ist falsch.)

nomoa

Thanks for continuing to investigate, please let us know your findings.

Also it might be interesting to post the output refreshImageMetadata.php as others may be able to help.

Konradmd

I changed in PdfHandler.image.php the command calls like:

$arg_str = str_replace("\\", "/", $this->mFilename );
if ( $wgPdfInfo ) {
      $cmd = wfEscapeShellArg( $wgPdfInfo ) .
      " -enc UTF-8 " . # Report metadata as UTF-8 text...
      " -meta " .      # Report XMP metadata 
      $arg_str;
      ...

It turned out, that \ in the filename for the pdfinfo- and pdfToText-Calls wasn’t escaped correctly.
Now, the error with the wrong filename is gone an I get the complete metadata including the PDF’s text.

The data seems to be stored correctly in the wiki database - in the table filearchive in the blob field fa_metata the JSON coded metadata looks good (including pdf text).

Question 1: Is this the correct location in the database to store the text information, Cirrus Search is using for it’s search index?

But the JSON result of the document with action=cirrusDump is still there but with text=""

nomoa

I’m not an expert of MediaHandler but reading the doc filearchive seems to store information of the media that have been deleted so I don’t think it’s the table you are looking for.

With action=cirrusDump the pdf text content should appear in file_text, the text field only contains a text version of the page wikitext.

If you see that file_text is null or empty I’d suggest to rerun forceSearchIndex.php to see if it fixes the issue.

Glad you found a cause to the error in refreshImageMetadata.php.

Tgr

Yeah, for non-deleted images it’s image.img_metadata.

Konradmd

After running the maintenance scripts:
updateSearchIndexConfig.php --reindexAndRemoveOk --indexIdentifier=now
and
forceSearchIndex.php
it works!!!

file_text in the JSON result with action=cirusDump is filled and finally the search results contains the PDFs files.

Many, many thanks for your help!

At the end, it was the error in the PDFHandler.image.php that let the pdfinfo.exe and pdftotext.exe fail.

Tgr

FWIW it is fixed in current versions of PdfHandler.

Konradmd

Downloading the version 1.31 from https://www.mediawiki.org/wiki/Special:ExtensionDistributor/PdfHandler the error appears again :frowning:

Or which version do you mean with “current version”?

Tgr

Weird, 1.31 has proper escaping. Do you have reproduction steps?