Wednesday, June 16, 2010

Is XML "Better" than TXT?

xml_system_en I'm sure there are other articles that discuss and explain this topic, but I searched Google and couldn't find any that hit the nail on the head the way I wanted it hit. 

So if this is old hat for you, I apologize, but you can always move on and read something else (http://www.peopleofwalmart.com) if you prefer.

 

So, the question is: "Is XML better than TXT?"

The short answer is: NO.

The long answer is: IT DEPENDS.

XML, or extensible mark-up language, is a tag-based format for presenting textual information.  XML files can be created on one system and used or "consumed" by a completely different system, since the syntax and structure are fairly well defined and understood.  This is not to say there isn't any ambiguity about HOW you employ XML, there most certainly is.  The science of XML is pretty straightforward.  The art of XML is not.  There is a huge amount of potential variation in how you can put XML to work just within the context of the file or "stream" structure itself.  Elements versus Attributes, for example, and to what extent you mix the use of them proportionally.

But what about the 10,000 foot view?  When does it make more sense to use traditional TXT formatting instead of the newer XML tag formatting?  Consider the following example:

Scenario 1:

You need to export a list of user accounts from a computer system and import it into a database which resides on a different network entirely.  The only information you need to collect is the user ID (aka "username") and the "full name" of the user.  Due to various environmental and procedural factors, you must transport the information by file, no direct stream or socket connection is permitted.  You must export the information, save it to a file, copy the file to storage media, and transport the media to the remote system for import.

Digression 1:

You could do this with XML in several ways.  You could do this with TXT in a few ways also, but the most common ways are either comma-separated value format (CSV) or tab-delimited format.  You could also employ a standard INI format, but I've left that out of this discussion (for now, maybe another future discussion).  Here's a few examples...

Option A - XML Elements

<useraccounts>

  <useraccount>

    <userid>412553</userid>

    <fullname>John Doe</fullname>

  </useraccount>

  <useraccount>

    <userid>555010</userid>

    <fullname>Susan Jones</fullname>

  </useraccount>

</useraccounts>



Option B - XML Elements with Attributes



<useraccounts>

  <useraccount userid="412553" fullname="John Doe"/>

  <useraccount userid="555010" fullname="Susan Jones"/>

</useraccounts>



Option C - Plain Old TXT, CSV format



412553,John Doe

555010,Susan Jones



It should be pretty obvious in this scenario that option C is the most compact, and therefore requires the least overhead (storage, transport, time, etc) to use.  But this is only one scenario.



But something else to consider is the overhead of "consuming" this data on the receiving end.  In most cases, you must rely upon a special API for validating, parsing and extracting information from within XML streams and files.  On a Windows computer, this usually involves something like .NET, the MSXML API, or XMLHTTPRequest or something like that.  A TXT file however, requires much more meager resources to reach into files, such as the FileSystemObject object within the WSH API.  The net effect of resource consumption is insignificant on small files or small numbers of files, but when you're parsing milllions or billions of them at a time, for long periods, the overhead can add up to something to keep an eye on.



Scenario 2:



Same basic operation, but now you need to gather quite a few more pieces of information for each user account. Now you've been asked to get information such as company, department, phone numbers, email addresses, job title, manager name, group memberships, and so on.



Digression 2:



I will go ahead and say that in this scenario, it might make more sense to use an XML format, with a mix of elements and attributes.  The reason is that you can tie collection-based information together much more "logically" with XML tag nesting and attributes than you can with "flat" text structures.  Consider the following:



Option A - XML Elements and Attributes



<useraccounts>

  <useraccount id="412553">

    <fullname>John Doe</fullname>

    <department>Administration</department>

    ... (more elements)...

    <groups>

      <group name="Domain Users"/>

      <group name="Corp Admin"/>

      <group name="Project Managers"/>

    </groups>

  </useraccount>

  <useraccount id="555010">

    <fullname>Susan Jones</fullname>

    <department>Sales</department>

    ... (more elements) ...

    <groups>

      <group name="Domain Users"/>
      <group name="Corp Sales"/>

    </groups>

  </useraccount>

</useraccounts>



Option B - TXT / CSV format



412553,John Doe,Administration,GROUPS=Domain Users+Corp Admin+Project Managers
555010,Susan Jones,Sales,GROUPS=Domain Users+Corp Sales


Breaking it Down - This is a bit deceiving.  While the TXT option looks more compact, it will become a friggin mess once the list of group memberships gets to be large or the group names are long.  In addition, you will lose ground in the resource overhead battle once you start having to sub-parse each line and break out the groups and check for special/odd characters in the group names. I chose an arbitrary syntax obviously.  You could devise any one of an infinite possible ways to format your user accounts within the TXT realm.  However, the payoff comes when you can leverage the XML tools at hand to quickly extract the groups collection from each user.



What if the platform on which you export this data is Windows, but the platform on which it will be imported is UNIX or AIX or Linux or whatever?  You can't count on being able to quickly setup a process for understanding a custom one-off TXT structure/syntax when you don't know it in advance, or you have to be able to accept it from hundreds of remote and inconsistent platform environments.  As the consumer, you tell the folks providing the data that it must be in a particular format, correct?  I hope so.  After decades of struggling to establish standards, not only within a common organization (employer, government agency, etc.) but between organizations (customer vs supplier, government vs private sector, etc.) the appeal and affordability of XML should be obvious.



Conclusion



So, again, I repeat myself: XML is better in some cases.  TXT is better in some cases.  If the data is not very verbose or complicated, and can be presented well using a "flat" structure, then a TXT format may work very well.  If the data is verbose, interrelated with nesting or sub-groups, it may work best within an XML structure.  There are a million variables to factor into almost any project or task which involves moving information between disperate environments.  The point is to never knee-jerk and just do everything one way.  Look at each situation, size it up, and make your decision based on the facts and your best judgement.

No comments: