Structured vs. Unstructured Data

Video

Here’s a link to a video version of this blog entry:  Structured Data Video

Why do we care?

In storage lately there’s been a lot of discussion about unstructured data. The first question that I think whenever I hear a new terms put out there is: why do I care?  

The answer turns out to be: because the growth model for the 2 data types is so dramatically different.

The graph below is by IDC which depicts storage usage over time, breaking out structured and unstructured data. In this case structured is labeled block and unstructured is labeled file. 

What’s striking is that that the unstructured data has a growth curve substantially greater than that of structured data. If we look out to the 2014 projections we see that, combined, the forecast is to ship 80EB of storage. An exabyte (EB) is 10^18, or a billion billion. So it’s a big number. 

It also shows that the lion’s share – almost 70 of this 80EB – is expected to come from unstructured data. So you’re getting close to saying 90% of all storage shipped will be to store unstructured data. That’s a big enough difference to qualify as a substantial shift in the storage industry. And it doesn’t really matter which side of the equation you’re on – whether a producer or consumer of storage – change of this magnitude warrants a look under the covers to see what is it all about. And that’s what we’ll be doing in this session.

Unstructured Data Definition

While there isn’t a canonical definition of the two terms, generally the term structured data (SD) is applied to databases (DB) and unstructured data (UD) applies to everything else. 

The terms themselves aren’t terribly meaningful as all computer data is structured. Anyone who has written computer programs realizes early on that digital computers don’t do well with anything other than perfect structure. Misplace so much as a single semicolon in a program and you’ll be spending time in the debugger. 

Writing out data is similar. The data you write doesn’t have to make any sense – it can be random bytes, but the way the data is written out to the storage medium is very orderly. Otherwise, it would be impossible to read it back. 

Structured Data

I suspect the term SD came from the name for a common language used to access DBs, called Structured Query Language, or SQL

SQL provides a well-defined way for applications to manage data in a DB. 

The little snippet of SQL code here illustrates how an application could retrieve all rows from a table called Book where the Price is greater than 100 and request that the result be sorted in ascending order by title. 

Most people haven’t written DB code themselves, but they have used an office spreadsheet tool like Excel on Windows or Numbers on the Mac. These tools allow you to create simple DB tables. 

The example here shows customer records with the typical information — name, address, and so forth. And that’s it — you’ve created a DB table! While sophisticated database management systems (DBMS) allow you to do much more than you can do with a spreadsheet, the base concept is essentially the same: tables with rows and columns, often containing straight ASCII text as this example shows. 

Unstructured Data 

Earlier, we said that UD is simply the complement of SD; that is, it’s everything other than a DB. That’s a pretty broad class. To break that down further, UD can be divided into two subclasses: file and object. 

Files

File data is the more familiar, as computer users are accustomed to seeing this data in the Windows Explorer or Mac Finder screens, as in the image below.

This is an image of a file system on a disk that contains the files created by a digital camera. Each file is an image of type JPG and has associated system metadata (SMD). In this case, the SMD shown is:

  • The filename;
  • The date the file was last modified; 
  • The file size; and
  • The file type (JPEG image).

This view is quite familiar to computer users and is the principle mechanism for finding and using files on your computer.  

Objects

Objects are the evolution of the basic file types that we’ve had for over 50 years now. The next article and video “What is an Object?” talks about the difference between objects and files in detail. Here, it is sufficient to say that objects are files with an additional type of metadata, called custom metadata (CMD). 

While SMD allows only a fairly pedestrian set of characteristics to be expressed (file name, size, and so forth), CMD allows for much richer data expression. It’s therefore no surprise that CMD was introduced around the time that rich data types, for example videos, pictures, and music files, were introduced to the general computer user. 

A simple example of CMD is when you use an application to import the pictures from your video camera and you are able to add any text that suits you, for example the names of the people in the photograph, where the picture was taken. The point of CMD is that, unlike SMD, CMD is not limited. A good software system will let you add any arbitrary text you want and associate it directly with the object (in this case, a photograph).

Why is Unstructured Data Growing So Rapidly?

A good question is why would one form of data, UD, grow so much more rapidly than the other? After all, both data types are required, as DBs perform essential functions. 

One reason is that the actual user content of a DB is typically text, as shown in the earlier spreadsheet example. For about 40 years, files were likewise most often comprised of just text. But the world has changed. Now users want rich content, not just plain text. 

Rich data types include things such as pictures, music, movies, and x-rays. Even basic office document types such as Word and Powerpoint are becoming increasingly rich media containers, where it’s now easy for a user to embed much more than just text.

                

While rich data types provide a far superior user experience over text alone, they do so at the expense of storage space. Rich media types are not just slightly larger that basic text, they can be orders of magnitude larger. 

To get a sense of both the difference in user experience and the different storage capacity usage of these two data types, consider this simple example. 

Rich Data vs. Text

Let’s say you’re trying to decide which movie to go see. A traditional method of doing so would to be to read a movie review. Here is a link to a movie review by the Boston Globe for the movie “Real Steal”. When I downloaded a copy of this review, it took up ~10KB of capacity.

The movie snippet below is from the full movie trailer available on the Internet Movie Database (IMDB) web site.

Of the two, which gave you a better sense of the movie? Clearly the trailer serves this purpose far better, which is why movie theaters show coming attractions in the form of trailers rather than written text on the screen. However, when I downloaded the trailer, rather than using up just 10KB of capacity, it used ~200MB. That’s an incredible difference, with 20,000 times more capacity required for the trailer than for the movie review!\

Of course, we can’t extrapolate too much from a single example; that’s the job of industry analysts. But this does give a sense just how great a difference there is between the storage required for rich data and that required for plain text, and it does give us a sense of why analysts forecast so much more storage dedicated to unstructured versus structured data going forward.

Takeaways

Here’s 3 takeaways.

  1. Most prefer Rich Data over basic text;
  2. Rich data takes up WAY more space
    • Text movie review: ~10KB
    • Full HD Trailer: ~200MB
    • 20,000x greater storage capacity!
  3. Use of Rich Data is increasing at an increasing rate
© Robert Primmer 2013