Skip to main content

Take It to DE-LIMIT: Winning with Epic Delimiters

In Python you must use either \u0001 (unicode ) or \001 (octal) to get the same SOH character used by SnapLogic, \u001. This is good news for the migration since the output format of the files can stay the same. While this a functional solution, it is not a satisfying solution. The \u001 character is not a character in unicode ( \u0001 or U+0001), UTF-8 (01), octal (\001), or hexidecimal (\x01). This non-character caused unexpected time debugging unicode in unit tests, and eventually some deep reflection. That reflection uncovered a key principle: A non-standard, let alone invalid, approach should have clear benefits to justify issues with integrations or migrations. The reflection also revealed that we rarely think about file delimiters until it's too late.

Is \u001, or another non-standard character, a better field delimiter than the humble comma or tab? Why would \u001 be used as a delimiter? The simplest answer as to why it was used is that the \u001 character is referenced in the SnapLogic documentation. From the docs:

The Delimiter property specifies the character to be used as a delimiter in parsing the CSV data. In case that tab is used as a delimiter, enter "\t" instead of pressing the Tab key. Any Unicode character is also supported.  As of 4.3.2, this property can be an expression, which is evaluated with the values from the pipeline parameters.
Example: \t, \u001
Default value: ,
Format: Character or Unicode

Perhaps it was chosen to save space? All UTF-8 characters will occupy between 1 and 4 bytes on disc. In the case of the SOH character, this is a single byte. Comma and Tab are also single byte characters in UTF-8. At first glance, there is no space benefit to using SOH a delimiter over a comma or tab. However, if \u0001 is used as a the separator and it is not enclosed in quotes, the file will be smaller by the number of quote and escape characters needed. If disc space can be saved by avoiding quotes, then SOH is a clear winner.

Which leads us to the most unlikely of google searches 'best practices for quoting and escaping'. According to documentation from Snowflake and a very good, concise post on the subject best practice is for all fields to be enclosed in quotes if they could contain the delimiter, no matter which delimiter is used. And while unusual, the SOH ( \u0001 ) character could appear in a string field within the file. Therefore those fields should be enclosed by quotes and the quotes should be escaped. The cost of any disc space saved by not enclosing strings is probably offset by any support issues arising from stray characters in the file.

Since all fields should be quoted and escaped, the winning move is to use the humble comma or tab as delimiters. These choices also have the benefit of being easier to work with as standard delimiters are expected by more systems and software. While the standard delimiters should be the go to for any organization a key take away from this research, however, is that there are always exceptions. If you are in a crunch for disc space or network bandwidth using \u0001 is a solid trick you can use to trim bytes.

A key aspect to software engineering is weighing the costs and benefits of various decisions. In this case we need to decide if costs should be optimized by avoiding errors or by reducing disc consumption. Under ideal circumstances these decision points can be identified ahead of time and we make proactive plans. Sometimes we don't realize until after the fact that a decision was made. A great software engineering partner can help bring value based decisioning into the software development process by identifying trade offs rather than blindly following established patterns.

Post by gkontos
April 7, 2022

Comments