Connect with your future at the 2nd Annual Alberta Graduate ...

Get in touch with your ...

The results of the referendum are now ...

Learn more about AUGSA's advocacy ...

Alberta Graduate Conferen

Connect with your future at the 2nd Annual Alberta...

Contact Us!

Get in touch with your AUGSA....

Referendum Results

The results of the referendum are now available....

Advocacy and Lobbying

Learn more about AUGSA's advocacy work...

Metadata: What is it and why should you care?

Attention: open in a new window. PDFPrintE-mail

Metadata. Even if you know what metadata is (data describing the format of data) you still probably don't see why you should care about it.  But, there are some good reasons why you should.

Most modern word processors store metadata about the documents they create.  This metadata typically contains information like the author, subject, keywords, and similar information.  It may include information about previously deleted or modified information in your document so that it can be 'undone'.  Photos also store metadata.  Information like the camera settings and the date and time the picture was take are embedded in the image.  And, most smart-phones, if they have a GPS, will include the GPS location the picture was taken.  Audio recordings may store metadata relating to the album, artist, track number and genre.

Why should you care if it is there or what it contains?

Let’s look at the case where you are working on a research paper and you have some anonymized data.  If you copy and paste this data into your document and then anonymize it, the original information may still be stored in the metadata for the 'Undo' command.  The source of the information from the cut and paste may be recorded in your metadata.   If you include a photo, the name of the owner of the camera may be included, and the location, date and time that the picture was taken may be present. The anonymization of your data may be compromised.

Another case is privacy.  If you post a picture of your house to a web site, you may be including its location.  If you submit a confidential document, the name of the person who wrote it (or rather the name of the person who uses the software) may be available.

A final case is professionalism.  If you have spent significant time polishing your document, do you want all of your earlier revisions to be available to the people reading your document?

Regardless of the reason, you should be able to review the metadata in your documents and be able to remove them.

There are lots of different ways that this can be done, and I won't be discussing all of them.  However, if you have your own favourite method, which I haven't covered, I would love to hear about it.  There will be a discussion thread going in the AUGSA group on The Landing (https://landing.athabascau.ca/pg/groups/15119/augsa/).

Now, let’s go over a few metadata pointers.

The first thing you need to do is check to see what metadata is attached to a document you are publishing.  My preferred method to do this is using a application called Tika.  It is an Open Source application published by Apache, and can be found at http://tika.apache.org/download.html).  It is a Java application, so to use it, you will need to have Java installed on your computer as well.  (For information on installing Java, you can check out http://java.com).  As it is Java, it will work on most computers.  I have personally tested it on Linux and Windows 7, and it worked on both of them.  Feel free to contact me ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it ) if you have any questions about how to run this program.

That's it.  Tika does not modify the files it examines in any way, so it is safe to use. It is Open Source, with  no licensing fee, and there is no spyware or adware associated with it.  And Tika displays all the metadata it finds, while some tools will only display a subset of the metadata.

Now that you have checked the metadata you can decide if you are fine with publishing it, or if you would like to remove it from the file. One way to do this is to save your document in a format that does not support metadata.

For instance, if you take a picture with your phone, it is normally saved as a JPEG file (.jpg).  When you open it in your photo editor, you can choose to save it as a different format.  If you save it as a BitMap (.bmp), all the metadata associated with the photo, such as the camera information or the GPS information would be lost, as the BitMap format does not support it.  Once you have saved it as a BitMap, you can then save it as a JPEG image again and publish or use the image.

If you are publishing a Microsoft document on Windows, you can use the plugin found here: http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=8446 This is supposed to remove all hidden data, including tracked changes, from a MicroSoft Office document.  This would be the 'easiest' way to do it.  Microsoft has this, or something similar, available for the different versions of Office, so you should be able to find one appropriate for your use.

Another way would be to save the file as a format that does not support metadata.  For instance, the RTF format has a much smaller amount of metadata that it supports compared to an MS Word document.  After saving it as a different format, you can check it with Tika to determine the metadata which it has saved.  A Text document does not support any metadata.  Of course, if you save your document as text, you will lose formatting information, such as fonts, font size, and other layout information, so this would probably not be your first option.

Removing metadata from other document types would be done in a similar manner.  First check the metadata in the document using Tika or the tool of your choice.  Then, you can do a Google search on removing the metadata.

Metadata is a valuable tool for classifying documents and providing additional information about a document.  Be aware of your metadata, so you can control what you publish for the world to see!