Explanation of the file content that gets scanned by the data control scanner
This article refers to the Sophos data leakage prevention functionality - DLP - it describes what content gets scanned when the data control scanner detects each of the following file types.
| File type | Content that is scanned | Metadata that is scanned |
|---|---|---|
| Adobe PDF | All strings in the page content streams, in page order. | Document title, author, subject, keywords, creator |
| RTF | Textual data destined for output together with that for the "footnote" and "endnote" destinations. Also "header" and "footer" destinations. | Document title, author and company |
| Microsoft PowerPoint 2000/2003/2007/2010 | Slide titles, slide body text, text within slide text boxes | Document title, author, company, custom metadata (PowerPoint 2003 and 2007) |
| Microsoft Excel 2000/2003/2007/2010 | Names of worksheets, contents of cells containing text, contents of cells containing numeric data. | Document title, author, company, custom metadata (Excel 2003 and 2007) |
| OpenOffice Calc LibreOffice Calc | Names of worksheets, contents of cells containing text, contents of cells containing numeric data. | Document title, author, company, custom metadata |
| Microsoft Word 2000/2003/2007/2010 | Body text, text in headers & footers, text within text boxes, text within table cells, non-visible comments. | Document title, author, company, custom metadata (Word 2003 and 2007) |
| OpenOffice Impress LibreOffice Impress | Body text, text in headers & footers, text within text boxes, non-visible comments. | Document title, author, company, custom metadata |
| OpenOffice Writer LibreOffice Writer | Body text, text in headers & footers, text within text boxes, text within table cells, non-visible comments. | Document title, author, company, custom metadata |
| Microsoft Project 2000/2003/2007/2010 | Task names, notes, resources | Document title, orig. author, last author and company, custom metadata |
| Microsoft Visio 2000/2003/2007/2010 | User-entered text associated with all diagram "shapes", comments applied to the document, page names. | Document title, author and company |
| Microsoft Outlook .MSG | Body text | ‘Subject’ line |
| Microsoft Outlook Express .EML | Body text | ‘Subject’ line |
| HTML | As per HTML file but with all the tags removed | Title, author, company, custom metadata |
| Other file formats that appear to contain text | Any plain text string longer than five text characters. | Document title |
If you need more information or guidance, then please contact technical support.
- Article ID: 64383
- Created: 9 Oct 2009
- Last updated: 14 Feb 2012


