AnsweredAssumed Answered

Way to identify Scanned PDF's (Images) vs. PDF's with content

Question asked by TerriWardEDJ on Aug 25, 2015
Latest reply on Aug 26, 2015 by Beeks

There are 2 different types of pdf's that I have noticed:

 

Type 1:

Ones that are scanned and the scanner does not use OCR, which means that they are basically an image file.

Some of them do have some metadata, such as a scanner brand tag which is viewable on the back end of the document using the multiview tool.

The policy engines can't scan the content in these since it is image based and not text based.

 

Type 2:

PDF's that are not image files that are composed as a text file and saved as a .pdf.  These files contain content that is text, the metadata is fully searchable by the CA DLP Policy engines.

 

My understanding is that CA Data Protection's policy engines have no OCR (optical character recognition) whatsoever.  So any files that are images or scanned do not get analyzed by policy.

For instance, a scanned document with Social Security numbers in it would not be recognized, if the scanner did not have OCR to write the metadata to the back end of the document.

 

We would like some way to identify the scanned .pdf files that do not have much metadata embedded onto them (because they are scanned files) and perform a silent monitor.

I'm pretty skilled with policy and have read the policy guide ad-nauseam but don't know of a way to do this.  I see that I can make a policy that would trigger on the file size but that doesn't really help since some of these are image files that can be quite large.

 

I am sure someone else has ran into this before so any ideas would be appreciated.

Outcomes