Metadata: Your File’s Hidden DNA and You

Post Syndicated from Skip Levens original https://www.backblaze.com/blog/metadata-your-files-hidden-dna-and-you/

A Photo Overlaid with Metadata Information

The files you use every day on your Mac or PC, whether at home or at work, carry around a slew of hidden data that can be incredibly useful to you… or problematically revealing to others. For example, the image in the header reveals latitude and longitude details in an iPhone photo that you could use to organize the photo along with others taken in the same place. But anyone else can access the same data and enter it directly into Google Maps to discover exactly where that picture was taken! Not quite as useful.

But if you know what this hidden information is—and how to use it—it can be incredibly helpful in diagnosing problems with files, organizing or protecting data, and even removing information you don’t want revealed! If you don’t, it can be a huge annoyance, and potentially even dangerous.

“It” is “metadata” and it’s something everyone works with, even if they don’t know it. Whenever you move a file—through email, into or out of a sync or cloud storage service, or to another device—you’re likely altering its metadata. It’s something we work with at Backblaze every day. And because moving files into and out of computer backup and cloud storage services can affect metadata, we thought we’d take a high-level look at how this information works in common file types to help you understand how to optimize its use in your own file management.

You can follow along as we walk through several examples, then tackle some real world file mysteries with the power of metadata. At the end of the post, you will find a list of several tools for Macs, PC’s, and command line to test out and add to your own ‘metadata toolbox.’

What is file metadata?

A great way to think of file metadata is as extra information about a file, carried along with that file, that makes it easier to use and find. So it’s not the actual document or photo itself, it’s information about it—like the file’s name, thumbnail image, or creation date. This information is embedded in or associated with the file, and helps make it easier for you, your applications, and your computer to actually use those files.

Information about a File for Humans

The most obvious kind of metadata is a file’s name, extension, icon, and the timestamp of the its creation date. This simple metadata alone makes searching across an entire hard drive of files and folders as easy as typing a part of the name into the finder or search bar, sorting the results by date, then singling out the file you want by the proper thumbnail or filename.

Information about a File for Computers

A less well-known example of file metadata is meant to make working with files easier or safer for your operating system. Your files might carry notes for the operating system that they should be opened with a specific application. Or a flag might be set on a file you’ve downloaded from the internet or mail attachment warning your OS that it may not be safe to use.

Examples of Different File Previews
An example of basic file information on macOS and Windows.

Other critical information about a file is the permissions, or privilege levels, extended to users on that computer:

An Example of Mac OS User Permissions Metadata
An example of permissions settings on a file in Mac OS.

For example, files on UNIX-like systems, like Linux and macOS X, are marked with the name of the user account that created them (the ‘owner’), the computer account group they belong to, and the permissions for the owner and other users to open and view that file, or make changes to it.

When permissions on files are set correctly, you rarely need to think about them as a user. But if this permissions information changes, users could lose access to files, or files could be opened by users that shouldn’t have access.

Information about a File for Applications

Another category of information is human-readable, but really intended for your applications to use. Some of this information can be incredibly detailed. The best-known example of ‘application metadata’ is camera and location data embedded in images by the cameras when you take pictures, such as the camera information and the camera’s lens and shutter setting when the particular picture was taken.

Application Specific Metadata
Application metadata in an iPhone picture reveals the camera model and settings, and even GPS coordinates.

All this information is read by your image editing software to enable new features. For example, in iPhoto you can search for all images taken in the same location, or find all images shot with the same camera. That means that these files are a trove of interesting information such as the camera type, shutter speed, and even GPS coordinates where the picture was taken.

Information You Won’t Want to Share

You may already know that you do not want to broadcast the location of photos you share, but even plain old documents can have information embedded in them that you’d rather keep to yourself.

A file unknowingly containing personally identifiable data
Inspecting a file’s metadata that contains personal identification information.

In the image above, you’ll see the file metadata of an old word processing document that happily includes names and email addresses for anyone to see! It’s common for files to include information like usernames, email addresses, GPS coordinates, or server mount paths. This is the kind of information you might want to delete before making a file public.

How Metadata Changes as You Move Files from Place to Place

As your files move around—copied from user to user and system to system—all of this useful metadata is vulnerable to being changed or lost. This has implications for your workflow, especially when you inevitably need to reconcile different versions and copies of files.

Unfortunately, the operating-system-specific tags or comments you place on files are the first to be lost when they move from location to location, and system to system.

For example, if I carefully color tag a folder of images on my Mac, then send them to be reviewed by a colleague who works on a PC, all those tags are gone when I get the files back. For this reason, true workflow-specific tags are usually applied in an external system that is dedicated to managing this kind of metadata for files—like a photo manager or a digital asset manager.

File Permissions Can Change from Macs, Windows, and Linux

It’s also common for files received on one OS to come over with non-standard permissions set. For whatever reason, documents saved on a PC end up having the executable bit set when they are moved to a Mac. The files will still open, but there’s no reason for them to be marked like an application.

File Creation and Modification Dates Can Change, Too

When you create or change a file on your computer, the time is recorded as part of the file’s metadata. But what happens when the time on one computer differs from another? Most modern OS’s do a good job of syncing to special time servers, and compensating for universal time based on location, but there are still changes introduced that make sorting files by time a challenge.

Permissions and Timestamps Can Change from Network and Cloud Storage File Metadata and Cloud Servers

When files are copied to network servers, or the cloud, things can get completely changed. Depending on how the file is moved, and how the storage provider handles files, your modification dates could get completely blown away, and since the ‘old’ file you’re uploading is new to the storage system, it becomes a new file with an entirely new creation date.

Individually, these changes are annoying, but collectively they threaten to kill with a thousand cuts. As time stamps, tags, and permissions are changed, your carefully organized file hierarchy or valuable archival information could be in tatters.

A Real World Example of Changing File Metadata

To see how metadata changes, let’s follow a single file downloaded to a Mac, then a PC, then upload and download them to different cloud storage options to see what changes get introduced.

First: A Computer-to-Computer Test

In this test I downloaded a PDF from Backblaze’s website to a Mac. On the Mac, I added color tags, and even comments using the Finder’s preview pane. Next, I downloaded that same file on a Windows system, then copied it over to the Mac.

Despite appearing to be the exact same PDF file, let’s fire up a terminal window on the Mac to inspect them further and make sure.

To follow along, navigate to the folder of files you want to inspect so that it’s handy. Then open another finder window and double click on the ‘Terminal’ application, which is found in the Utilities folder inside of your Applications folder. The terminal application will launch, and you’re placed at the ‘prompt’ ready for your command.

To navigate to the folder you want to work with, type in ‘cd’ at the terminal prompt to change directory, enter a space, then drag the folder of files you want to work with into the terminal window and drop it. You’ll see that the path to the folder is automatically resolved to that folder’s location, saving you a lot of typing.

Now that I’m in the proper folder, the tool I want to use is the humble ‘ls’ command to list a folder’s files. To do so, type in “ls” and then a space, then a dash, immediately followed by “[email protected]”—this will retrieve the long form of results, and the ‘@’ flag will explicitly show extended metadata on the Mac.

Comparing Two Files' Metadata
Detailed ls results comparison of the two files reveals extended attributes metadata and file permissions mismatches.

As you can already see, the following changes have been introduced:

  1. The Windows file has non-standard permissions (the PDF file is marked as executable as if it were an application, which you can tell by the asterisk marker at the end of the file name, and the permissions sets are all marked with an ‘x,’ indicating that the file is ‘executable’ or treated like an application or command instead of a document.)
  2. The Mac’s Finder shows that the file color tag and comments that I’ve entered are missing in the Windows version.
  3. The Mac has flagged files downloaded on the Mac for its file Quarantine, which is part of the Gatekeeper security feature on mac OS X that marks and prevents potential malware or security risks to your system. This was completely bypassed when copying it over from Windows, so no Quarantine flags were set.

Next Stop, the Cloud

Now, I’ll move these files to and from three different types of cloud storage—Backblaze B2 Cloud Storage, Google Drive, and Dropbox—and see how they change.

To move the files to Backblaze B2, I used rclone, which is an extremely popular tool to copy and sync files from any mix of storage and cloud systems. For Google Drive, I used their web interface, and for Dropbox I uploaded via the web, then retrieved the files as a compressed file.

Now, when I compare all the files side by side I can see how different all of the file metadata is.

Comparing the Files Post-Cloud Download
Slight metadata differences emerge post download from different cloud storage services.
Command Line File Comparison
The downloaded test files’ differing metadata information as returned by the ls command.

First, all of my user-entered metadata, like tags and comments, were not picked up by cloud storage, as expected. Secondly, the Mac’s Gatekeeper security feature also promptly labeled every file downloaded with the ‘Quarantine’ flag. Backblaze B2 returned files with proper file permissions, (644 or read/write for the user, read for the group, and read for all others) and preserves the creation date of the original file.

Both GDrive and Dropbox applied new file creation and file modification timestamps—and bizarrely, the files returned by Dropbox have a “modified date” 8 hours in the future! Does Dropbox know something we don’t?

You can see how searching and sifting through all of these copies on my Mac has become tremendously complicated now.

Solving Metadata Workflow Mysteries and Challenges

Hopefully it’s clear that unless your files only live on your local system, as they move from system to system, the metadata they carry around will change.

Workflow Example 1: Using Metadata Tools to Learn About a ‘Mystery’ File

Let’s apply what we’ve learned in some common examples of how metadata is changed in files, how to inspect them, and some suggestions to correct them.

Inspecting a file’s metadata information can be helpful in diagnosing misnamed files, or files that have lost their file extension. The operating system usually blindly trusts the file extension. For example any file named with a .pdf extension will try to open it as a PDF file even if it’s really something else!

MacOS file information for a mystery file
MacOS File Information for a “Mystery File.”

Above, I have a file from a very old backup that is missing an extension. The Mac is having trouble interpreting the way the original Windows OS file system encoded the date, so my Mac thinks the file was created December 31, 1969! (I’m pretty sure I wasn’t using MS Office in 1969.)

Without an extension, my Mac assumes this file must be a text file, and offers to open it in TextEdit, the default app for opening text files. When I double click on the file, the OS tries to open it but throws an error.

Solving the Mystery Using Exiftool
Mystery solved by inspecting the hidden file metadata: It’s an old Word doc backup!

Reaching into the toolbox, I use a command-line program called exiftool, a powerful tool to reveal a file’s embedded file metadata. (Navigate to the bottom of the post to read more about exiftool and where you can learn more about how to use it). By calling the exiftool from the terminal application, and passing in the name of the file I want to inspect, all is revealed! This is, in fact, a Microsoft Word file.

Looking closer, I can even see that this isn’t the original file, it was autosaved from the original file, which has an entirely different name. Mystery solved! I can now safely add the ‘.doc’ extension to the file, and it will open properly with my word processor that can still import this version of Microsoft Word.

Workflow Example 2: Uncovering Duplicate Files

Next, let’s take this entire folder of PDF copies that I used for upload tests. After all that uploading and downloading, my single original file has 8 copies. I ‘know’ that I only need one of these, so let’s try de-duping them!

De-Dupe Confusion in Gemini
Due to the file permissions and extended attributes differences, I might accidentally delete file versions I want to keep.

When I try to dedupe this folder using a tool like Gemini, a duplicate file finding tool, I’m presented with several choices of duplicates for me to remove. In other words, Gemini 2 was able to determine that there are duplicates, but isn’t sure which set of files it should keep.

If I select by ‘oldest’ duplicates, it leaves me with the Dropbox versions, by ‘newest’ it leaves me with the GDrive versions, etc. In this particular case, the ‘automatic’ selection tool lets me mark the GDrive and Dropbox versions as the duplicates I will delete. However, the differences in file permissions and extended attributes in Mac’s Finder are preventing these files from being de-duped any further.

I still have two files—the ‘original’ files downloaded to my Mac and PC. Gemini insists they are different files, but we know they are not, so let’s meet some new tools.

Setting Proper Permissions

I could, of course, use Mac’s Finder to reset the permissions of this single file downloaded from Windows. But what if I’m faced with having to reset permissions on thousands of files at once?

Chaining Two Commands Together
In this more advanced example, I’m chaining two commands together at once to first find, then reset permissions on all documents at once.

To show how you can combine several tools at once, chain the ‘find’ and the ‘chmod’ commands together to first find all documents in my current folder, then change permissions on all of them at once.

Cleaning Mac Extended Attributes

Next, I’ve decided that I want to clear all of the extended attributes that the Mac has set on these files. For this task, I’ll use Apple’s xattr tool.

xattr Code Snippet
Here, I’m using Apple’s xattr tool to remove all Finder extended attributes like comments, color tags, and Quarantine flags, etc.

Now, when I rerun Gemini 2 on this folder, I identify the last duplicate, delete it and I’m back to one file again.

The Final Results of the Gemini Test
With fixed permissions and removing macOS extended attributes, I can now fully de-dupe these files.

File Metadata Takeaways

As we’ve seen, the metadata carried by the files you use every day changes over the life of the file as it moves from system to system, and server to server. And those changes can be problematic when it comes to the usefulness and security of your data.

You now have the power to see that information, inspect it, and—with the tools listed below—you can change it, solve the mysteries that crop up trying to mediate those changes, and clean up metadata you don’t want made widely known when you share the files.

Do you have more questions about file metadata and how it affects how you use and save your files? Let us know! Meanwhile, the tools listed below are excellent starting points to aid in further exploration.

Addendum: Tools Reference

Here is a list of tools referenced in the article, and other interesting command-line and GUI tools to move, dedupe, and rename files:

exiftool—Hands-down the most widely used metadata exploration tool, which lets you inspect and manipulate standard EXIF and other associated metadata. Latest Windows and macOS downloads are available on the exiftools.org website, via Linux package system, or on a mac with ‘brew install exiftool.’ There are many GUI ports available from the website as well.

rclone—Uses rsync style syntax to copy and sync file locations to and from the widest variety of destinations including almost every known cloud storage choice.

xattr—A macOS system tool to inspect, create, or remove file extended attributes.

ranger—An old school ‘file commander’ that includes an embedded metadata pane. Binaries available, build from source, or on a Mac install with ‘brew install ranger.’

MacPaw Gemini2—Still one of the most widely-used GUI de-dupe tools on the Mac.

fdupes—One of several available command-line de-duping tools.

A Better Finder Rename—A GUI tool to rename batches of files, and even rename according to parent folder structure and EXIF information.

Bulk Rename Utility—A Windows analogue of ‘A Better Finder Rename’ on the Mac.

rename—(or ‘brew install rename’) A truly impressive tool to rename entire batches of files with regex, or simple text replacement or addition. Be sure to use the “–dry-run” flag to test what changes it will make first!

The post Metadata: Your File’s Hidden DNA and You appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.