[Archiver-dev] UpLib and archiving

Tue Oct 19 03:56:46 CEST 2010

On Mon, Oct 18, 2010 at 8:13 PM, Bill Janssen <janssen at parc.com> wrote:

>> And that is *very* cool.  How do you handle security issues, i.e. html parts
>> with evil content (javascript) or Content-Disposition filenames that lie about
>> their type?
>
> I don't run Javascript, so evil parts get to stick around and
> potentially do damage in the future.  UpLib has an extensible system of
> data analysis engines called "rippers" which are automatically run on
> each document; if you were concerned about the possibility of lingering
> malware a malware-detector ripper could be added to flag and/or remove
> such content.

Sanitizing javascript should be the default behavior.  This is a
major XSS exploit, and if you want others to utilize your software
for their sites, they will open their site to XSS if this
is not done.

> As for the Content-Disposition filenames: UpLib runs its own
> content-type determiner over the content to try to see what it is rather
> than just relying on the filename, though it will fall back to the
> filename if it can't figure it out.  And I've hardcoded some typical
> situations.

Falling back to filename should be a configurable option, and
it should be disabled by default.

If you are really paranoid about security, you should have
a whitelist of filename extensions that you allow.
At a minimum, at least have a list of extensions that are
forbidden (e.g. .shtml, .cgi).

IMO, the content-type should be the authoritative source
of what the type of file is, but scanning the data is
reasonable depending on how robust it is.  Attackers are known to
give incorrect values in an attempt to fool email processors,
but such attempts are usually done with the content-disposition
filename parameter since some popular MUAs display it to
the user, which can mislead them on its true contents.

I recommend that all attachments be saved into an attachments
area so you can place restrictive web server configuration
settings on it.  This approach assumes you serve up attachment
data directly via the file system via standard HTTP server
retrieval.  If you serve up attachments via custom
web service (e.g. servlet, CGI), then filenaming concerns of
attachments are not as critical.

>> * Email address obfuscation.  Obviously we'd want to support that, but using
>>   what algorithm?  xxx'ing out the domain?  Using a central forwarding
>>   service?  How do we recognize email addresses?
>
> I don't obfuscate anything, really.  But this is an issue for a public
> Web UI design, I think.

If your plugin architecture has the support for data filtering
step before archiving, this could be done with plugins.

Or, if you have plugins that allow the filter of content on
retrieval, that may be better.  This way the stored data
still has the email addresses intact, but they get obfuscated
on rendering.

--ewh