The long journey through Azure Services for XML transformation and its lessons

The long journey through Azure Services for XML transformation and its lessons

As foryouand-
yourcustomers, we help you implement, maintain and enhance the required data logistics.

As Data Engineers at for​you​and​your​cus​tom​ers in Munich we worked with a customer on his project to transform a huge set of different XMLs into one giant combined XML and replace the current Talend-based solution. It was a journey full of suprises and lessons we like to share, because you might experience them with XMLs too.

Product data consolidation helps ensure the consistency and accuracy of product information across all digital channels. This can increase customer trust and satisfaction, as well as improve search engine optimization (SEO) and conversion rates. By having a centralized and up-to-date product data repository, companies can streamline their operations, reduce manual errors and save time, leading to cost savings and increased efficiency. Moreover, having a single source of truth for product information can enable data-driven decision making and enable companies to better understand customer needs and preferences. At foryouandyourcustomers, we help you implement, maintain and enhance the required data logistics.

I want to share my experience with the struggle and fun to explore a way to replace the existing Talend solution for one of our customers  from the industrial manufacturing industry. I will share technical details for interested audiences to avoid the same issues we faced and see, how we did not give up.  

The increasing number of transformations and resource demands of multiple layers of SQL tables, procedures, queries, and tMaps  had destabilized the system and significantly slowed down the processes. Azure and its data factory emerged as a saviour, offering a scalable and cost-effective cloud solution. Adoption of a more appropriate NoSQL structure instead of a flat SQL table design helped to streamline in-process data field mapping and simplify the development process by leveraging the intuitive visual components of the Azure Data Factory, reducing the learning curve and effort required of developers.

With a clear objective in mind, we worked tirelessly, pushing through the difficulties and celebrating their achievements along the way. Our hard work paid off, and the end result was a system that was stable, efficient, and able to meet the growing demands of our customer’s business.

A quick project initialisation

The initial phase of the project was characterized by its relative simplicity, as demonstrated by the following key steps:

  1. Establishment of a blob storage repository for the transfer of XML files.
  2. Configuration of a CosmosDB to store the imported data. Given the size limitations of CosmosDB, we decided to divide each XML document into its distinct elements for storage in the database.
  3. Development of two critical procedures: a) Import of XML into CosmosDB, and b) extraction and formatting of data from CosmosDB for output in the desired XML format. These procedures formed the foundation for the successful execution of the project and paved the way for further progress.

The first issue appeared when we tried to create CosmosDB containers. The solution was to just retry later. There seemed to be some synchronisation issues on Azure’s side and it took some time until the restrictions were updated.

The initial journey was nothing short of remarkable. With the power of Azure Data Factory (ADF) at our fingertips, we were able to make  big strides with ease. The Copy-Step feature provided a simple yet effective way to configure our Blob storage as a source. The conversion of XML files into JSON was seamless, thanks to the standard transformation of converting between XML and JSON. The definition of the hierarchy level, to be used as a basis for element iteration from XML, and a mapping of the first level took us only a few hours to accomplish.

However, it was not all sunshine and rainbows, as we faced challenges along the way. The ADF Editor proved to be a hindrance, requiring us to set the advanced mode active to access all the required functions. This led to a small setback, as it took some time to figure this out. Nonetheless, our determination and expertise saw us through and we were able to overcome this obstacle.

For those interested in the details, the following file lists the individual steps:

Filling the CosmosDB was a smooth experience and it was done in no time with the final pipeline. This marked the beginning of our next phase of the project.

The mighty NoSQL querys of CosmosDB

As we progressed with mapping the source to the target structure, however, we encountered a new challenge . Our data structure consisted of at least two iterable layers, but the data manipulation editor could only access the first layer. A simple flattening step from Azure Data Factory wasn’t enough to access all the necessary data.

Objects can only contain one more complex layer.

We decided to tackle this issue by using self-joining. Instead of simply reading the data from CosmosDB as a container request, we ran a query that could flatten the data to any desired level. This allowed us to extract different hierarchies and separate streams and join them by their keys, effectively resolving the data hierarchy challenge.

After the first setup was complete, we adapted two more XML file formats, but during debugging sessions, we encountered another problem. The Azure Data Factory’s conversion of XML to JSON determines the JSON format for each file.

This can  lead to an inconsistent data structure if XML does not provide information about its structure, especially whether a level is iterable or not. The solution was to generate specific queries, one with a self-join and the other with a simple mapping, and then joining both streams.

Azure and its bugs

We prepared the final outcome of the joined streams for aggregation, with the output being in JSON format. However, yet another challenge arose, when we discovered that Azure Data Flows cannot export XMLs. Nevertheless, this roadblock was not the end, as a more pressing issue presented itself during debugging. Despite appearing to be fine, the data flow did not generate any results during the debugging process. We attempted to resolve this by embedding the data flow into a pipeline, yet its execution failed unexpectedly. It took several days of analysis and interpretation of the error message to finally uncover the root of the problem:

You can easily spot the issue is ‘@pos’, but we couldn’t get rid of this.

At this stage in the process, we had encountered a significant obstacle that threatened to derail all of our progress thus far. The issue arose as a result of Azure Data Flows unable to interpret structures that contained the “@”character. The problem was rooted in the auto-conversion of XMLs, which resulted in the addition of “@” characters to all tag attributes, including deeper hierarchy levels. The copy-step mapping in ADF could not set keys for these deeper levels and there was no process in place to replace the “@” character after the fact.

Summary of reproduction:

  1. Azures processes add for repeating XML-Elements “@” character to all their attributes when converting to JSON.
  2. Azure does not provide a way to change this behaviour.
  3. This JSON is stored in CosmosDB.
  4. Azure requires the data structure of the CosmosDB container to create a DataFlow.
  5. Azure generates a DataFlow description containing “@” characters in the data structure definition.
  6. Azure has no way to escape the “@” character.
  7. Azure can execute such DataFlow in debug mode, but not in a pipeline.

This wasn’t a trivial problem, and it took us several days of interpreting error messages and debugging before we realized what was happening. We created a support ticket with Azure in the hopes that they would be able to resolve the issue. After waiting several weeks for a solution, it realised that we would have to find a workaround.

The situation was further complicated by the fact that CosmosDB, where we stored our data, was also affected by the “@” symbol issue. Azure requires the data structure of the CosmosDB container in order to create a DataFlow, and the ADF generates a DataFlow description that contains “@” characters in the structure definition. Furthermore, Azure has no way of escaping these characters, meaning that the DataFlow can be executed in debug mode, but not in a pipeline.

In light of these challenges, we were faced with a difficult decision. We could not use queries to resolve the issue, as they would also contain the “@” symbol. Our options were limited, but we ultimately decided to try changing the import procedure to bypass the CosmosDB, opting for a custom procedure in Java or Python. While this solution would make the process more complicated, it would allow us to keep things as simple and generic as possible.

Azure Logic Apps for data transformation

As we delved deeper into the solution, we discovered a fantastic solution in Azure Logic Apps. This tool proved versatile in addressing both problems 4 and 5 from the list above, as it allowed us to convert between XML and JSON structures using the Liquid template language. We leveraged its functionality by loading a mapping into an integration account and creating and linking a logic app to it as a REST service. The result was a streamlined process where the XML file content could be passed through the logic app, have the transformation mapping applied , and receive the resulting JSON output. It was an elegant solution that felt like a light bulb moment, and we were thrilled to have found it.

However, we encountered a roadblock when trying to pass the XML file from an ADF pipeline to the REST call. The available variables and their size limits prevented us from using this method. Instead, we used the Copy-Step to read and convert the XML to JSON, and send the content directly to the REST API. This solution required us to adapt the Liquid template, as the resulting JSON structure was different from the original XML.

Parsing the values was now quickly possible, but we again met a limitation: The Copy-Step cannot handle returned values or process them anyway. The logic app was able to receive the XML file content, apply the transformation, and store the resulting JSON directly into CosmosDB, bypassing the need for a return value.

We added additional logic to the app to handle all the different templates and mappings on the import side, keeping the system landscape streamlined and simple. Another logic app was also created to convert the resulting JSON back into the required XML structure. This process required converting the JSON to text, as Azure does not offer a direct JSON to XML transformation.

The optimized data structure in CosmosDB made the DataFlow transformation much easier, as the self-joins could be avoided. The structure was already flattened during the import, so the overall number of similar steps did not increase. It simply shifted from the DataFlow to a Logic App.

In the end, we required fewer steps than we initially estimated, and despite the challenges we faced, we were able to successfully transform the XML files.

The solution

After the successful implementation of Azure and its data factory, the team was elated and relieved. The solution was scalable, cost-effective and provided a centralized repository for product information. The simplicity of the initial phase, with the establishment of a blob storage repository and the configuration of CosmosDB, made the project initiation a breeze. The team faced some challenges along the way, such as issues with the ADF editor, but their determination and expertise saw them through. The final pipeline of filling CosmosDB was a smooth experience, and the team was ready for the next phase of the project. The ability to tackle the issue of multiple iterable layers by using self-joining made the team even happier, as it allowed for greater flexibility in data manipulation. The team was grateful for the solution that Azure and its data factory provided, which improved the consistency and accuracy of product information, reduced manual errors and enabled data-driven decision making.

Other Magazine Articles

Yearly Group Meeting 2022 ​ in Regensburg: Who, if not us? When, if not now?

Hannes Weikert on 22.09.2022