Monday, March 26, 2012

Frustrations With Data Conversion Task

It seems to me, after some frustrations, that Destination components (and perhaps the Source components as well) should have included data conversation facilities as part of their designer/UI; being the experts on their data requirements. In particular, for the Destination components, data conversion could have simply been an extension of their record mapping activity.

Presently, I am force to seperately (1) note the datatypes, then (2) convert them in the Conversion component, then (3) map them in the Destination component, then (4) cuss...(think a lot of fields now and going forward)...

A seperate Data Conversion task might still be useful for completeness, but generally data conversation is only meaninful when you are directly accessing a datastore.

I am curious as to why data conversation features was not embedded in the Destination component (e.g. in its superclass)? Perhaps I am missing a subtle point.

Please provide more details as to your issue.

Derived columns can convert data. As can the Data Conversion components.|||

G1 wrote:

Clearly the Destination component and perhaps the source component should offer data conversation as part of their design (ie. GoF expert pattern)

I am curious as to why this was not done?

Can you explain a little more?

What is a data conversation?

What is "the destination component"? There are many destination components.

What is "the source component"? There are many source components.

What does GoF stand for?

-Jamie

|||See re-edited version (during your reply) above...|||

G1 wrote:

See re-edited version (during your reply) above...

Can I ask one simple question...are you using the word "conversation" correctly or is it a typo where you actually meant to say "conversion"?

-Jamie

|||

Typo.

I have since edited my post for clarity. Unfortunately, even as I was doing so you were responding to it.

Nontheless, thanks for your quick responses.

|||

I simply don't understand. Data sources and destinations SHOULD NOT handle data conversion. If that were the case, the whole notion of ETL wouldn't hold water. The idea is to pull from a source, transform the data, and load it into the destination as fast as possible. You don't want to hold up processing into a destination just to convert data.

Use a derived column or the data conversion components to suit your needs.

I don't understand the part about "noting" the datatypes. You already know them, otherwise you wouldn't know that you need to convert them. Personally, I never use the data conversion component. I always cast my fields in a derived column component.

|||

Phil Brammer wrote:

I simply don't understand. Data sources and destinations SHOULD NOT handle data conversion. If that were the case, the whole notion of ETL wouldn't hold water. The idea is to pull from a source, transform the data, and load it into the destination as fast as possible. You don't want to hold up processing into a destination just to convert data.

Use a derived column or the data conversion components to suit your needs.

I don't understand the part about "noting" the datatypes. You already know them, otherwise you wouldn't know that you need to convert them. Personally, I never use the data conversion component. I always cast my fields in a derived column component.

I agree 100%. The purpose of the source adapter(s) is to make data available to the pipeline for transformation. Nothig else. Each component does something very specific, and very atomic, that is what makes it possible to build complicated business logic.

-Jamie

|||

Originally, my anks was with the fact that the Data Conversion component does not provide the data destination field types during its type selection. Without this nice bit of information, I had to flip back and forth to ensure the exact matching required by the Destination component (ie. Unsigned 1-Byte to Unsigned 2-Byte).

Downstream, another annoyance is that the pre-converted fields remain alphabetically mixed-in with the newly converted fields, cluttering the drop-down lists. While availability of pre-converted versions can be useful at times, their presence in this way is a source of mistakes. There should be some other way of managing it all; it is Data Conversion component not Copy Column (which I've not tried), after all.

Nevertheless, I am more convinced now that data conversion belongs at the datastore interface. Examples of this architecture brings to mind Excel, Word, DataAdapter, Serialization, Paint, etc. all of which read and write a growing list of source/destination datastore formats while utilizing a single internal data format to apply their growing list of functionality.

Incidentally, with respect to Phil Brammer's question, I did not already memorize the data types of all 120 fields averaged across 80 tables Smile

I hope this is clearer. Thanks for you replies.

-- George

PS. Don't get me wrong, ETL is a better conceptual model than DTS...by far

|||

G1 wrote:

Originally, my anks was with the fact that the Data Conversion component does not provide the data destination field types during its type selection. Without this nice bit of information, I had to flip back and forth to ensure the exact matching required by the Destination component (ie. Unsigned 1-Byte to Unsigned 2-Byte).

So you want to be able to see the metadata of the destination when you are defining your data conversions? Yes, I can see why this would be useful if you have alot of columns. Though I can't see how your original post relates to it. But that's a dogression...

Good point well made.

G1 wrote:

Downstream, another annoyance is that the pre-converted fields remain alphabetically mixed-in with the newly converted fields, cluttering the drop-down lists. While availability of pre-converted versions can be useful at times, their presence in this way is a source of mistakes. There should be some other way of managing it all; it is Data Conversion component not Copy Column (which I've not tried), after all.

Yes, many people find the ordering of columns in the UIs annoying. Answer me this though. What is the correct ordering? This is the sort of information that Microsoft would want and if you ask 10 people you might get ten different answers. See the problem here?

G1 wrote:

Nevertheless, I am more convinced now that data conversion belongs at the datastore interface. Examples of this architecture brings to mind Excel, Word, DataAdapter, Serialization, Paint, etc. all of which read and write a growing list of source/destination datastore formats while utilizing a single internal data format to apply their growing list of functionality.

Comparing SSIS with those tools is not comparing apples with apples. its an irrelevant comparison in my opinion.

And Paint? What has Paint got to do with this?

G1 wrote:

Incidentally, with respect to Phil Brammer's question, I did not already memorize the data types of all 120 fields averaged across 80 tables

I hope this is clearer. Thanks for you replies.

-- George

PS. Don't get me wrong, ETL is a better conceptual model than DTS...by far

DTS is an ETL tool. SSIS is an ETL tool. I think what you meant to say was "SSIS is a better conceptual model than DTS"

-Jamie

|||

"Yes, many people find the ordering of columns in the UIs annoying. Answer me this though. What is the correct ordering?"

In this case, I don't think there is such a thing as "correct" ordering. However, I would like more complexity. For example, (1) multiple columns in the field list that include such attributes as data type, size, record sequence, etc; (2) auto column sizing; (3) user selectable column sorting; and (4) auto hiding of pre-converted fields, possibly as expandable entries beneath their converted selves similar to tags in XML Viewer. In any event, the current field listing ignores the need for such complexity.

"Comparing SSIS with those tools (Excel, Word, DataAdapter, Serialization, Paint) is not comparing apples with apples. its an irrelevant comparison in my opinion."

The similarity of SSIS to these tools is that they all "import -- manipulate -- export" their data. And like them, SSIS extracts/imports and loads/exports a variety of datatypes/formats. And once imported permits a variety of manipulations on them, independent of their source datatypes. For example, Char(N) and NText are both treated as strings and 1-Byte Signed and 4-Byte Signed are both treated as Integers, etc. In fact, in most cases the underlying datatype is only relevant at the interface to the datastore.

There may be a benefit to having a distinct Data Conversion component; however, the current one could be refined. As well, the Data Destination component (and possibly the Data Source component) should be enhanced to included an embedded instance of their own version of this improved Data Conversion component.

George.

|||

G1 wrote:

There may be a benefit to having a distinct Data Conversion component; however, the current one could be refined. As well, the Data Destination component (and possibly the Data Source component) should be enhanced to included an embedded instance of their own version of this improved Data Conversion component.

George.

No, I'll go down to my grave disagreeing with you on this point. See my posts above for why. Jamie's too.

|||

Phil Brammer wrote: "I simply don't understand. Data sources and destinations SHOULD NOT handle data conversion. If that were the case, the whole notion of ETL wouldn't hold water. The idea is to pull from a source, transform the data, and load it into the destination as fast as possible. You don't want to hold up processing into a destination just to convert data."

I agree with you up to a point.

The thing is, conceptual seperation and physical seperation are two different things. Physical proximity is often convenient which is the point of the proposed embedded data conversion component. Performane, however, is not a concern since the conversion component is created only as needed.

On the other hand, the availability of conversion within the datastore access components could increase ease of use as mentioned previously and even effeciency by, among other things, making it easier to limit the collection of data types that participate in the transformation process proper. In particular, to allow a pattern of use wherein "exported" and/or "pre-loaded" data is converted at the interface to the datastore, instead of arbitrarily throughout the transformation process. In any event, I expect most existing conversions already occur just before or after such datastore access or would if it weren't for the Data Conversion component's field clutter issue mentioned previously .

With respect to the persuite of performance, one of the things I've come to appreciate is that "ease of use" SHOULD be part of the performance equation. Often, performance as the naked goal results in tools that are conceptual nightmares (not the case with SSIS); like a sharp knife without a handle, such components/systems become a pain to use, apply, refine, etc.

As for going down to your grave disagreeing, go ahead Smile . Your way may be equally right. Yes, you heard me. We both may have equally viable self-consistent frameworks or conceptual models justifying our tastes, expectations, and so forth.

George.

No comments:

Post a Comment