An EncodingTranscoder Custom Pipeline Component

In a previous installment, I wrote about one of the simplest types of custom pipeline components, that only dealt with the context of the message. Sometimes, however, it is necessary to somehow process the contents of the message itself, as it flows through the pipeline. You may want to inspect the contents of the message in order to make some decisions, for instance. Or you may need to alter the contents of the message in any way necessary to convert it, say from one format to another.

In this post, I’ll present one such component, whose purpose is to convert the incoming message from one encoding to another. This might be useful, for instance, if you receive documents from multiple trading partners in several countries, each of which sends a message using its default Windows ANSI codepage. It can also be useful to process messages coming from legacy systems that send messages in specific character sets such as EBCDIC, for instance.

The component we’ll build will accept two parameters of type System.Text.Encoding to represent the formats the source and target messages are encoded with.

One important thing to keep in mind, is that a custom component must be designed to be able to deal with arbitrary large messages. This means that we must make sure that the memory requirements for using a component in a production environment are kept flat and do not depend upon the size of the incoming messages.

Also, for efficiency reasons, we must absolutely avoid designs that would require the incoming message to be read more than necessary. Ideally, the message should be read in its entirety only once as the bytes are shoved from the adapter to each component in the pipeline and eventually to the Message Box.

Therefore, we must incorporate in our design a method for converting the bytes of the incoming message to the target Encoding on the fly, and only when actually required by downstream components.

A Custom TranscodingStream

As I mentioned in a previous article, BizTalk is built from scratch with support for streaming arbitrary large messages. Therefore our component will need to extensively deal with the System.IO.Stream class, one of the most important to be familiar with when building custom components.

In fact, it is customary for pipeline components to wrap the original data stream of incoming messages inside a custom class that derives from System.IO.Stream, so that custom processing can be done upon read requests from downstream components.

So we’ll build a custom Stream class whose job is to perform the conversion on the fly, as necessary. For this purpose, we’ll use one instance of each of the System.Text.Encoder, and System.Text.Decoder classes from the .net framework. Those classes are ideally suited for our scenario as they are dedicated to perform such processing required for encoding and decoding chunks of bytes to and from their respective Encodings respectively.

Without further ado, let’s dive into the code!

    public class TranscodingStream : System.IO.Stream
    {
        private System.Text.Encoding sourceEncoding_;
        private System.Text.Encoding targetEncoding_;

        private System.Text.Decoder decoder_;
        private System.Text.Encoder encoder_;

        private System.IO.Stream stream_;

First, the canonical declaration of a class that derives from System.IO.Stream. Nothing fancy here. We setup a few members to hold on to the source and target Encodings, as well as one instance for each of the key Encoder and Decoder classes we just talked about. Obviously, we also need to keep track of the original wrapped Stream since this is the one we will read the data from.

        #region System.IO.Stream Overrides

        public override bool CanRead
        {
            get { return stream_.CanRead; }
        }

        public override bool CanSeek
        {
            get { return stream_.CanSeek; }
        }

        public override bool CanWrite
        {
            get { return false; }
        }

        public override long Length
        {
            get { return stream_.Length; }
        }

        public override long Position
        {
            get
            {
                return stream_.Position;
            }
            set
            {
                throw new Exception("The method or operation is not implemented.");
            }
        }

        public override void SetLength(long value)
        {
            throw new Exception("The method or operation is not implemented.");
        }

        public override long Seek(long offset, SeekOrigin origin)
        {
            throw new Exception("The method or operation is not implemented.");
        }

        public override void Flush()
        {
            throw new Exception("The method or operation is not implemented.");
        }

        public override void Write(byte[] buffer, int offset, int count)
        {
            throw new Exception("The method or operation is not implemented.");
        }

	...

        #endregion

Those are basic methods and stubs, that sole purpose in life is to make our Stream class well behaved. All of those are straightforward since we are creating a read-only, forward-only non-seekable stream class that is sufficient enough for our purposes.

Next, comes the constructor.

        #region Construction

        public TranscodingStream(System.IO.Stream stream, System.Text.Encoding sourceEncoding, System.Text.Encoding targetEncoding)
        {
            stream_ = stream;

            sourceEncoding_ = sourceEncoding;
            targetEncoding_ = targetEncoding;

            decoder_ = sourceEncoding_.GetDecoder();
            encoder_ = targetEncoding_.GetEncoder();
        }

        #endregion

You’ll notice that among the obvious bookkeeping, the constructor extracts an instance of the Decoder class from the source Encoding, as well as an instance of the Encoder class from the target Encoding, both of which have been passed as arguments.

Up to this point, we’ve only seen mostly boring code. Most interesting however, is the Read method where the substantial meat of the code is, the one that performs the actual conversion as requested.

        private const int BUFFER_SIZE = 4096;
        private List<byte> remaining_bytes = new List<byte>();

        public override int Read(byte[] buffer, int offset, int count)
        {
            // prepends any remaining bytes to the encoded buffer

            List<byte> encoded_bytes = new List<byte>(remaining_bytes.Count);
            encoded_bytes.AddRange(remaining_bytes);
            remaining_bytes.RemoveRange(0, remaining_bytes.Count);

            while (encoded_bytes.Count < count)
            {
                // read one chunk from the underlying stream

                byte[] raw_bytes = new byte[BUFFER_SIZE];
                int read_count = stream_.Read(raw_bytes, 0, raw_bytes.Length);
                if (read_count == 0)
                    break;

                // decode the chunk based on the source encoding

                int char_count = decoder_.GetCharCount(raw_bytes, 0, read_count, false);
                char[] chars = new char[char_count];
                decoder_.GetChars(raw_bytes, 0, read_count, chars, 0, false);

                if (char_count > 0)
                {
                    // encode characters into an encoding buffer based on the target encoding
                    
                    int encoded_count = encoder_.GetByteCount(chars, 0, char_count, false);
                    byte[] encoding_bytes = new byte[encoded_count];
                    encoder_.GetBytes(chars, 0, char_count, encoding_bytes, 0, false);

                    if (encoded_count > 0)
                        encoded_bytes.AddRange(encoding_bytes);
                }
            }

            // copy the encoded buffer to the requested output buffer

            int output_count = Math.Min(encoded_bytes.Count, count);

            encoded_bytes.CopyTo(0, buffer, offset, output_count);
            encoded_bytes.RemoveRange(0, output_count);
            remaining_bytes.AddRange(encoded_bytes);
            return output_count;
        }

One of the gotchas of transcoding text is that the number of bytes required for encoding a specific chunk of text varies depending upon the encoding used. For instance, in UTF-8, a character may be encoded to a sequence of up to 4 bytes. Each character in the Stream can indeed take up a different number of bytes. Some Encodings, however, use a fixed number of bytes to encode any given characters.

So in the general case, it might be necessary to read more bytes from the original stream than is actually requested by the code calling the Read method. Therefore, the class holds in the remaining_bytes member some amount of bytes that have already been decoded from the original Stream but that have not yet been requested by the calling code.

Each time an attempt to read from our Stream is made, the Read method performs the following steps:

  • The method prepares a local buffer, whose purpose is to hold as many encoded bytes as requested by the calling code. This buffer is first filled with any remaining bytes that have been encoded as part of a previous run, but not actually returned to the caller.

  • If the local buffer of encoded bytes is not sufficiently filled up to satisfy the request, more bytes will need to be read from the original stream.

  • In that case, a whole chunk of raw_bytes is read from the original stream.

  • Those bytes are decoded using the specified source Encoding. First, we ask how many bytes the Decoder will be able to return, and then actually decode that many characters.

  • The method then tries to convert those characters back to a sequence of encoded bytes, using the specified target Encoding. First, we ask how many bytes the Encoder will be able to produce out of those characters and then actually encode that many bytes.

  • The current chunk of encoded bytes is appended to the local encoded buffer.

  • This is repeated until either the original stream has been read fully, or the local encoded buffer is sufficiently filled up to be returned to the calling code.

  • In that case, we return the number of bytes requested to the caller. As we said, it is possible that more bytes than necessary have actually been decoded. We save those bytes in the remaining_bytes buffer, to be able to use them in a subsequent call.

For completeness, we also need to override the ReadByte method, so that is has a chance to return any remaining bytes before performing an actual read on the original Stream.

        public override int ReadByte()
        {
            if (remaining_bytes.Count == 0)
                return base.ReadByte();

            int next_byte = remaining_bytes[0];
            remaining_bytes.RemoveAt(0);
            return next_byte;
        }

That’s it. This useful class provides a fully-efficient yet very simple way to transcode text on the fly from one encoding to another. Let’s now put this class to real use.

On to the Pipeline Component Itself

By now, you’re probably familiar with our base class that abstracts away the tedium of writing the boilerplate code necessary to create a custom pipeline component. So let’s create a new C# Class Library project and add a reference to the PipelineComponentBase assembly.

Then, let’s write the basic class declaration.

[ComponentCategory(CategoryTypes.CATID_PipelineComponent)]
[ComponentCategory(CategoryTypes.CATID_Encoder)]
[ComponentCategory(CategoryTypes.CATID_Decoder)]
[Guid("00000000-0000-0000-0000-000000000000")]
public class EncodingTranscoder : PipelineComponentBase, Microsoft.BizTalk.Component.Interop.IComponent
{
    private System.Text.Encoding sourceEncoding_ = System.Text.Encoding.GetEncoding(1252);
    private System.Text.Encoding targetEncoding_ = new System.Text.UTF8Encoding(false);
    ...
}

Obviously, our component accepts two parameters, of type System.Text.Encoding. These will be passed on to the custom TranscodingStream class as part of the pipeline processing.

Do not forget to replace the empty Guid attribute with a real one, using such tools as Guidgen, for instance.

Specifying Design-Time Properties

I chose to expose a textual representation of the Encoding classes at this stage. I could have opted to perform the conversion upon saving or retrieving the properties from the PropertyBag. It would have worked equally well either way.

[System.ComponentModel.Editor(typeof(UI.EncodingDropDownEditor), typeof(UITypeEditor))]
public string SourceEncoding
{
    get { return FormatEncoding(sourceEncoding_); }
    set { sourceEncoding_ = ParseEncoding(value); }
}

[System.ComponentModel.Editor(typeof(UI.EncodingDropDownEditor), typeof(UITypeEditor))]
public string TargetEncoding
{
    get { return FormatEncoding(targetEncoding_); }
    set { targetEncoding_ = ParseEncoding(value); }
}

Notice that both properties have been decorated with an Editor attribute, that instructs the Visual Studio property editor as to how we want those properties to be displayed to and manipulated by the user. We’ll come to this a bit further down in this post.

The following code snippets show how the component parses or formats the textual representation of an Encoding:

#region Implementation

private static System.Text.Encoding ParseEncoding(string encoding)
{
    string pattern =
            @"^(?:(?<codepage>[0-9]+)|(?<body>[A-Za-z\-_0-9]+))$|" +
            @"^(?<name>.*)\ *" +
            @"(?:(?:\((?<codepage>[0-9]+)\))|" +
            @"(?:\((?<body>[A-Za-z\-_0-9]+)\)\ *))$";

    Match match = Regex.Match(encoding
                        , pattern
                        , RegexOptions.Singleline
                        | RegexOptions.IgnoreCase
                        | RegexOptions.IgnorePatternWhitespace);

    if (match.Success)
    {

        string codepage = match.Groups["codepage"].Value;
        string bodyname = match.Groups["body"].Value;

        if (String.IsNullOrEmpty(codepage))
            return System.Text.Encoding.GetEncoding(bodyname);

        else
        {
            int cp = 0;
            if (Int32.TryParse(codepage, out cp))
                return System.Text.Encoding.GetEncoding(cp);
        }
    }

    throw new ArgumentException(String.Format("Unable to parse the specified encoding string. \"{0}\" is not a valid encoding name.", encoding), encoding);
}

private static string FormatEncoding(System.Text.Encoding encoding)
{
    return String.Format("{0} ({1})", encoding.EncodingName, encoding.CodePage);
}

#endregion

Serializing Design-Time Properties

Serializing the properties is a simple matter of storing the design-time properties in the the PropertyBag supplied by Visual Studio (at design-time) or the Biztalk engine (at runtime).

#region IPersistPropertyBag Overrides

public override void Load(IPropertyBag propertyBag, int errorLog)
{
    string name;

    // only per-instance configuration data overwrite default values

    name = (string)ReadProperty(propertyBag, "SourceEncoding", errorLog, "Western European (Windows) (1252)");
    if (!String.IsNullOrEmpty(name))
        SourceEncoding = name;

    name = (string)ReadProperty(propertyBag, "TargetEncoding", errorLog, "Unicode (UTF-8) (65001)");
    if (!String.IsNullOrEmpty(name))
        TargetEncoding = name;

    // load other default properties

    base.Load(propertyBag, errorLog);
}

public override void Save(IPropertyBag propertyBag, bool clearDirty, bool saveAllProperties)
{
    if (saveAllProperties || SourceEncoding != "Western European (Windows) (1252)")
        WriteProperty(propertyBag, "SourceEncoding", SourceEncoding);
    if (saveAllProperties || TargetEncoding != "Unicode (UTF-8) (65001)")
        WriteProperty(propertyBag, "TargetEncoding", TargetEncoding);

    base.Save(propertyBag, clearDirty, saveAllProperties);
}

#endregion

In the code above, I have chosen specific default values for the source and target encodings. These values are often encountered in the scenarios I would want to use the custom component. Any other sensible values will do.

Providing Enhanced Experience in the Visual Studio property grid

As we’ve said earlier, the Visual Studio property grid only knows how to display and manipulate simple scalar types, such as strings, integers and booleans. It also includes basic supports for editing collections of simple types.

For more complex types, explicit support has to be added to the component.

In our case, the component will present the source and target encoding parameters as drop-down lists to the Visual Studio property grid. In order to do that, we must first create the appropriate list:

public class EncodingDropDownList : ListBox
{
    public EncodingDropDownList()
    {
        List<String> encodings = new List<String>();

        foreach (EncodingInfo info in Encoding.GetEncodings())
            encodings.Add(String.Format("{0} ({1})", info.DisplayName, info.CodePage));

        encodings.Sort();
        Items.AddRange(encodings.ToArray());
    }
}

This is a simple ListBox-derived class that contains the sorted list of all available encodings as their textual representations.

Next, we need to include a Custom Editor:

public class EncodingDropDownEditor : UITypeEditor
{
    public override UITypeEditorEditStyle GetEditStyle(ITypeDescriptorContext context)
    {
        if (context == null || context.Instance == null)
            return base.GetEditStyle(context);

        return UITypeEditorEditStyle.DropDown;
    }

    public override object EditValue(ITypeDescriptorContext context, IServiceProvider provider, object value)
    {
        IWindowsFormsEditorService editorService;

        if (context == null || context.Instance == null || provider == null)
            return value;

        editorService = (IWindowsFormsEditorService)
               provider.GetService(typeof(IWindowsFormsEditorService));

        EncodingDropDownList control = new EncodingDropDownList();
        string encoding = (string)value;
        control.SelectedValue = (string)encoding;

        control.Click += delegate(object sender, EventArgs e)
        {
            encoding = (string)control.Items[control.SelectedIndex];
            editorService.CloseDropDown();
        };

        // drop the list down and wait for user selection

        editorService.DropDownControl(control);
        return encoding;
    }
}

The GetEditStyle method instructs the property grid that we want to use a drop-down list to edit the properties. This will make Visual Studio display a small dropdown icon when the corresponding property is clicked inside the property grid.

The EditValue method is triggered when a user actually clicks the small dropdown icon in the property grid. In that case, we create a new ListBox in the form of our custom EncodingDropDownList class. Then, we wire a new anonymous event handler to the Click event, so that the selected value can be retrieved.

Then, we display the drop-down list to the user so that a new selection can be made. This is a blocking call, and it will wait until the user either selects a new value or dismiss the drop-down list by pressing the ESC key. When the user selects a new value from the list, the anonymous delegate will capture its value and store is inside a local variable.

At this stage, the potentially new selected value is returned to Visual Studio, who will then update the value for the correspond design-time property.

Please, note that the preceding code requires that references to the System.Drawing and System.Windows.Forms assemblies be added to the Visual Studio projet.

Microsoft publishes a document for Understanding Design-Time Properties for Custom Pipeline Components in BizTalk Server. Please, refer to this excellent whitepaper for more information on how to create custom editors and type converters for your properties.

Wiring-up the TranscodingStream to the Incoming Message

We all this code out of the way, we can now actually code the behaviour of our pipeline component.

As part of the pipeline processing, the following method will be called by the Messaging Agent.

IBaseMessage IComponent.Execute(IPipelineContext pContext, IBaseMessage pInMsg)
{
    if (!Enabled)
        return pInMsg;

    WriteEventLog(Resources.EncodingTranscoder_StartExecute
        , pContext.PipelineName
        , CategoryName(pContext.StageID)
        );

    if (pInMsg == null)
    {
        WriteErrorLog(Resources.EncodingTranscoder_InvalidInMsg);
        return pInMsg;
    }

    // assign a new TranscodingStream to the incoming message

    System.IO.Stream stream = pInMsg.BodyPart.GetOriginalDataStream();
    System.IO.Stream transcodeStream = new TranscodingStream(stream, sourceEncoding_, targetEncoding_);

    // return the message for downstream pipeline components (further down in the pipeline)

    pInMsg.BodyPart.Data = transcodeStream;
    pContext.ResourceTracker.AddResource(transcodeStream);

    return pInMsg;
}

Notice that the Execute method is extremely simple, and does not actually manipulate the contents of the message at all. As far as this method is concerned, its only job when invoked as part of the pipeline processing is to substitute the original stream from the incoming message with a TranscodingStream, that wraps the original stream.

It’s only later on, when all components in the pipeline would have executed, that a request from the Messaging Agent to read from the stream will trigger the conversion of the message from the source Encoding to the target Encoding.

That is why the Execute method registers the substituted stream with the BizTalk Resource Tracker, so that it is not garbage collected too soon and will still be able to perform its job when required.

The code shown above is a canonical example of an Execute method, albeit with a specific System.IO.Stream-derived class this time.

Wrapping It Up Together

That’s it for the code.

You can now add a nice custom icon, named MyIcon.ico in the project’s Resources subfolder and compile the project. Once registered as a pipeline component, you will be able to use it inside a new pipeline project:

This entry was posted in Pipeline Components. Bookmark the permalink.

2 Responses to An EncodingTranscoder Custom Pipeline Component

  1. Patrick says:

    Maxime,This is another great post explaining a lot in great detail. Would that customeditor also pop up when using "per instance" pipeline configuration when in BizTalk Management console ?

  2. Maxime says:

    Unfortunately no, Patrick, it would not.However, stay tuned, because that is precisely the subject for next week\’s post.

Comments are closed.