Administration Guides

How to OCR Image Data with Search & Recover

Home



Overview:

This guide helps walk through how bulk OCR of image data can be done with the Search & Recover command builder and then index the results for searching.  This solution depends on an open source OCR library that is available to be installed on the Search & Recover appliance.


Requirements:

Install OCR libraries on ECA node 1 as follows:

  1. ssh to node 1 as ecaadmin.
  2. sudo -s (enter ecaadmin).
  3. zypper install tesseract-ocr  (requires Internet connection).
  4. Or manual download https://software.opensuse.org/download.html?project=Publishing&package=tesseract-ocr  .

Summary:

  • This script example shows how command builder can help generate a script file to automate the OCR detection of images based on the Tessact Open ource OCR library, installed on the Search & Recover appliance.
  • For larger quantities professional services should be purchased to assist with scripting a parallel solution to multi thread process image data. The example in this guide is quick start guide on how easy OCR solutions can be built with Search & Recover.
  • Once the text files are content ingested search results will return the image file name with a txt extension to allow navigation to the folder containing the image.

Search for OCR Input Data for Processing with Command Builder:

  1. Using the Search & Recover GUI, locate the OCR data by using any type of search to list the files.  It is common to store all OCR scanned data under a single path.  This example assumes this is the case.
  2. Using the FIle Path option enter the path to the OCR data (i.e. /ifs/data/dfsdata/search/ocr) and add image file extensions (i.e. tif jpg) to a list with spaces to the Extension input box.  Click the check box for files only.
  3.  
  4. Per the screenshot above, this will locate all files with images under the path entered and only list those files in the results.
  5. Using the command builder icon generate the file list and enter the OCR command "tesseract" into the first dialog box.
    1.  
  6. Excel is an easy tool to modify the  script file to specify the output file name and path.  Open the file in Excel and import as CSV using space as the separator.  See example below. NOTE: you may need to fix file names with spaces in the path or file name.
    1.  
    2. Now copy column B files to Column C and it should look like this image below.  Save the file as .sh text file.
    3.  
    4. You may need to save as CSV and then use a text editor to search and replace the comma for a space.  You can also remove the comments at the top of the file.
  7. Create NFS mount on ECA node 1 for image processing

  8. An NFS mount is needed on the cluster to allow Search & Recover to OCR the images, and create a .txt version of the file.  This mount will need root mount options.  See steps below to create the NFS export on /ifs. 
  9. The screenshot shows the NFS root client export on path /ifs/ :
    1.  
  10. Now create the mount point on Node 1 of the Search & Recover appliance.
  11. ssh to ecaadmin node 1 as ecaadmin.
  12. sudo -s (enter ecaadmin password to become root user).
  13. mkdir -p /ifs .
  14. Mount the cluster with this command (NOTE: use /etc/fstab for a persistent mount point to handle reboots)
    1. mount 172.31.1.104:/ifs /ifs  (Note: use SmartConnect name vs ip address used in the example)
  15. Verify by typing "mount".
  16. Verify with ls /ifs to make sure you see files and directories returned.
  17. Copy Batch Script to ECA Node for processing Image Data

  18. Copy script file to Search & Recover node 1 with scp or winscp tool using ecaadmin user the file will be copied to /home/ecaadmin.
  19. Example with scp from the command line:    scp ocr.sh ecaadmin@172.31.1.125:/home/ecaadmin/ocr.sh  .
  20. Change permissions:
    1. chmod 777 /home/ecaadmin/ocr.sh 
  21. Execute the OCR conversion (NOTE: This can take a long time to complete, potentially hours).
  22. cd /home/ecaadmin/ .
  23. ./ocr.sh &> results.txt .
  24. Monitor progress with this command:
    1. tail -f  /home/ecaadmin/results.txt .
  25. Once your script finishes you will have a file matching the same file name with .txt added to the file.  The .txt file will contain the text extracted from the image file. 
  26. Search & Recover incremental will detect the new .txt files and index the content of the files if content ingestion is enabled on the OCR path.




© Superna Inc