Posts about Architecture

Amazon Sagemaker: What, Why and How

Posted by Gaurav Mishra on Dec 27, 2019 4:59:00 PM

According to IDC, the Artificial Intelligence market will attain a gigantic 37% compound annual growth by 2022. Owing to its popularity, several tools and software have emerged in the market to make AI adaptation easier. However, one tool that clearly stands out in all respects is Amazon Sagemaker. In this blog, we take an in-depth look at what it is, why use it, and how to go about its usage.

What is Amazon Sagemaker?

Amazon SageMaker is a fully managed AWS solution that empowers data scientists and developers to quickly build, train, and deploy machine learning models. It is in the form of an integrated development environment for machine learning, the Amazon SageMaker Studio, which acts as a base to build upon a collection of other AWS SageMaker tools.

You can build and train ML models from scratch or purchase pre-built algorithms that suit your project requirements. Similar tools are available for debugging models or adding manual review processes atop model predictions.
amazon sagemaker what it isImage via Amazon

Why Should You Use It?

The complexity of the machine learning project in any enterprise increases with the expansion of scale. This is because machine learning projects comprise of three key stages - build, train and deploy - each of which can continuously loop back into each other as the project progresses. And as the amount of data being dealt with increases, so does the complexity. And if you are planning to build a ML model that truly works, your training data sets will tend to be on the larger side.

Typically, different skill sets are required at different stages of a machine learning project. Data scientists are involved in researching and formulating the machine learning model, while developers are the ones taking the model and transforming it into a useful, scalable product or web-service API. But not every enterprise can put together a skilled team like that, or achieve the necessary coordination between data scientists and developers to roll out workable ML models at scale.

This is exactly where Amazon Sagemaker steps in. As a fully managed machine learning platform, SageMaker abstracts the software skills, enabling data engineers to build and train the machine learning models they want with an intuitive and easy-to-use set of tools. While they play to the core strengths of working with the data and crafting the ML models, the heavy lifting needed for developing these into a ready-to-roll web-service API is handled by Amazon Sagemaker.

Amazon SageMaker packs all the components used for machine learning in a single shell, allowing data scientist to deliver end-to-end ML projects, with reduced effort and at lower cost.

How It Works?

With a 3-step model of Build-Train-Deploy, Amazon SageMaker simplifies and streamlines your machine learning modeling. Let’s take a quick look at how it works.


Amazon SageMaker offers you a completely integrated development environment for machine learning that lets you improve your productivity. With the help of its one-click Jupyter notebooks, you can build and collaborate with lightning speed. Sagemaker also offers you a one-click sharing facility for these notebooks. The entire coding structure is captured automatically, which allows you to collaborate with others without any hurdle.

Apart from this, the Amazon SageMaker Autopilot is the first automated machine learning capability of this industry. It allows you to have complete control as well as visibility into your respective machine learning models. The traditional approaches of automated machine learning do not allow you to peek in the data or logic used to create that model. However, the Amazon Sagemaker Autopilot is capable of integrating with Sagemaker Studio and provides you complete visibility into the raw data and information used in the creation.

One of the highlights of Amazon SageMaker is its Ground Truth feature that helps you in building as well as managing precise training datasets without facing any hurdle. The Ground Truth provides you complete access to the labelers via Amazon Mechanical Trunk along with pre-built workflows as well as interfaces for common labeling tasks. The Amazon Sagemaker comes with the support of various deep learning frameworks including PyTorch, TensorFlow, Apache MXNet, Chainer, Gluon, Keras, Scikit-learn, and Deep-Graph library.

Leveraging Amazon Sagemaker, Srijan built a video analytics solutions that can scrape video feed data to log asset performance.


Using AWS Lambda, Amazon SageMaker and Amazon S3, Srijan developed a video analytics solution for the client. The solution utilized a machine learning model to scrape video feed data and log asset performance over a given period of time and assigned location.

As a result, it helped in:

  • Claims validation against machines that were failing to clean the given sites
  • Insight based behavior analysis of the assets, leading to improvement of the product
  • Enabling more proactive, instead of reactive, asset performance assessment and maintenance

View the Complete Case Study



Using Amazon SageMaker Experiments, you can easily organize, track, and evaluate every iteration to machine learning models. Training a machine learning model packs various iterations to measure and isolate the impact of changing algorithm versions, model parameters, and changing datasets. The Sagemaker Experiments help you in managing these iterations via capturing the configurations, parameters, and results automatically, and storing them as ‘experiments’.

SageMaker comes with a debugger functionality that is capable of analyzing, debugging, and fixing all the problems in your machine learning model. Debugger makes the training process entirely transparent by capturing real-time metrics during the process. The Sagemaker Debugger also comes with a facility of generating warnings as well as remediation advice if any common problems are detected during the training process.

Apart from this, AWS Tensorflow optimization offers you a scaling facility of up to 90% with the help of its gigantic 256 GPUs. Using this, you can experience precise, and sophisticated training models in very little time. Furthermore, the Amazon Sagemaker comes with a Managed Spot Training that helps reduce training costs up to 90%.


Amazon SageMaker offers you a one-click deployment facility so that you can easily generate predictions for batch or real-time data. You can easily deploy your model on auto-scaling Amazon machine learning instances across various availability zones for improved redundancy. You just need to specify the desired maximum and minimum numbers, and the type of instance, and then leave the rest to Amazon Sagemaker.

The major problem that can affect the accuracy of your entire operation is the difference between data used to generate predictions and the data used to train models. The SageMaker Model Monitor can help you in getting out of this puzzle by detecting and remediating concept drift. The Sagemaker Model Monitor detects the concept drift in all of your deployed models automatically and then provides alerts to identify the main source of the problem.

The Amazon Sagemaker also packs Augmented AI facility, with the help of which, you can easily allow human reviewers to step in if the model is unable to make high confidence precise predictions. Moreover, the Amazon Elastic Inference is capable of minimizing your machine learning inference costs by 75%. Lastly, Amazon also allows you to integrate Sagemaker with Kubernetes, by which you can easily automate the deployment, scale, and management of your applications.

So there you have it, a look at how Amazon Sagemaker can help build, train and deploy machine learning models to suit your project requirements. 

Srijan is an advanced AWS Consulting Partner, and can help you utilize AWS solutions at your enterprise. To know more, drop us a line outlining your business requirements and our expert AWS team will get in touch.

Topics: AWS, Machine Learning & AI, Architecture

Customized Progressive Web Apps Solution With Drupal

Posted by Kimi Mahajan on Dec 9, 2019 5:40:47 PM

With users expecting reliable, fast and engaging experiences from apps they use, even a short lapse in loading time is convincing enough to move onto its next available alternative.

An app loses 20% of its users for every step between the user’s first interaction until the user starts to use the app.

Nowadays, Progressive Web Apps are offering higher customer engagement bringing better mobile experience, thus offering better business financial gains than the native apps. Drupal proves to be a great platform for creating PWAs. Let’s understand the what and why of PWAs and how native-app like experience can be delivered on your Drupal site.

What are PWAs?

Progressive Web Apps combine the best of the worlds of mobile and web functionality. With application-like interface, PWAs offer the same engaging experience of a native app, but with the convenience of a web browser.

PWAs are quick, don’t need to be installed and once added to a smartphone home screen, they continue to upgrade silently in the background. It would not be wrong to say that they are web pages which ‘live’ on a user’s home screen.

pwa infographic

Unlike an app, a user doesn’t necessarily need to find, download and install a PWA. They can immediately start using a PWA upon finding it.

Here’s a quick round up of its amazing features:

  1. Device independent: They can work on any device, taking advantage of existing features available on the device and browser.
  2. Responsive: PWAs are responsive and fit the device screen size.
  3. Appearance: A PWA looks exactly like a native app, and is built on the application shell model, with minimal page refreshes.
  4. Instantly Available: A PWA can be installed on the home screen, making it readily available.
  5. Secure: Because PWAs reside on a user’s home screen, they are made to be extremely safe by hosting over HTTPS to prevent man-in-the-middle attacks.

You can take a feel of PWA yourself by navigating to and access the fast and responsive PWA of Twitter, which has the capability to work even offline.

example of PWA for twitter

How Does PWAs Benefit?

Native apps have been useful for eCommerce conversion as they are known to perform 20% better than websites. However, they are costly to create and are a risky investment as they need to be found in an app store, downloaded, installed and then used.

Here’s a typical consumer app funnel for native apps:

This graph clearly shows there’s a 10–30% dropoff in every step starting from finding the app in the app store to sharing it in the network.

gartner prediction for PWA by 2020

Gartner predicts that progressive web apps will replace 50% of general-purpose, consumer-facing mobile applications by 2020.

PWAs have an edge over native apps in terms of boosting user retention as a user prefers easy to use less data consuming app with improved performance and usability, rather than its website. As per Google Developers, the conversion rates for AliExpress and Safari increased by 104% and 82%, respectively, upon using PWAs. opted for PWA to increase their average page-load time to bring a significant boost in the conversion rate. Also, poor connectivity and the prevalence of low-end devices was hindering its growth. example

The development team of then built a PWA called Housing Go, which helped a 38% increase in total conversions, with visitors spending 10% longer on the site, many of them returning often.

But should every web app be a Progressive Web App? And if yes, even a native app functions similarly. Then what’s so unique about PWAs?

To decide whether to go for PWA or not, it is important to identify your customer base and their action trends. The needs should be analysed before opting for PWA, such as cross browser support, most often used functionality which needs to be facilitated in offline mode.

How Do PWAs Work?

Developers can create this functionality with the help of Service Worker and Web App Manifest specification.

pwa working


The 3 components of a PWA are:

App Shell: App shell is stored and served from the cache and provides HTML, javascript and CSS to power your application UI.

Service Worker stores the resources in the browser cache when the page loads for the first time. It returns the response to the user post checking the cache when the user visits the app next time. A component of JavaScript code, it manages push notifications and helps build a web application which can work offline.

Web Manifest is a config JSON file which has the metadata of the icon for the installed app, related background color, and theme of the site when downloaded.

Adding Progressive Web App Functionality To Drupal Through React

For Drupal 7 websites, the functionality can be implement by simply enabling the Progressive Web App module. For Drupal 8, one needs to customize service workers apps and apply an easy patch (check the issue). Once done, then go to admin/config/system/pwa and configure the settings. This will implement the 'Add to homescreen' functionality to your website as shown below:

pwa add to homescreen

Add to home screen option available on the website

Though PWA can be developed with front end frameworks like Angular, React, etc., we at Srijan make it possible by creating custom blocks using Drupal 8 Block API. This can be done by embedding React JS application file in its library file.

This way, we place our react block on required space on the same Drupal page, so the entire page doesn’t have to be React or Drupal only. 

The advantage of using this approach is that it doesn’t unnecessarily increase the load of the application, and makes it load quickly and smoothly. Also, it doesn’t hamper SEO rankings.

We help ambitious enterprises modernize their web experience and build digital solutions. Contact us today to explore the possibilities and get the conversation started.

Topics: Drupal, Architecture, User Experience and User Interface

Why Should Your Organization Opt for Infrastructure as a Service (IaaS)

Posted by Kimi Mahajan on Nov 29, 2019 1:29:00 PM

Businesses are getting rid of keeping data in traditional data centers and physical servers and are migrating to innovative and reliable cloud technologies. With several benefits of cloud computing including anytime data access, enhanced disaster recovery, improved flexibility and reduced infrastructure staff burden, enterprises are developing more cost-efficient applications with higher performance and more effortless scalability.

IaaS, one such cloud computing model, has made lives of both enterprises and developers simpler by reducing their burden of thinking about infrastructure.

But, how do enterprises know if they need to opt-in for IaaS?

Understanding Infrastructure as a Service (IaaS)

IaaS refers to the cloud services offered over a network allowing businesses to access their infrastructure remotely. A perfect fit for any size enterprise, it offers the advantage of not having to buy hardware or other equipment, and easily manage firewalls, IP addresses, servers, routers, load balancing, virtual desktop hosting, storage, and much more, cost-effectively through a scalable cloud model.

It gives organizations the flexibility to spend only for the services used, which gives an edge to IaaS cloud computing over traditional on-premise resources. The businesses find it easier to scale by paying per usage from an unlimited pool of computing resources instead of wasting resources on new hardware.

Understanding Infrastructure as a Service (IaaS)

Why Opt For IaaS Cloud Model?

IaaS is beneficial for organizations for a number of reasons. Let’s discuss its benefits in detail-

Usage of Virtual Resources

Your organization might never have to think of investing in resources such as CPU cores, hard disk or storage space, RAM, virtual network switches, VLANs, IP addresses and more, giving you the feeling of owning a virtual datacenter.

It allows multiple users to access a single hardware anywhere and anytime over an internet connection, keeping their users on the move. And in case even if a server goes down or a hardware fails, its services aren’t affected, offering greater reliability.

Cost Savings With Pay-As-You-Go Pricing Model

With metered usage, enterprises need to pay for the time when the services were used and avoid fixed monthly and annual rental fees and any upfront charges. This is beneficial as it leads to lower infrastructure costs and also prevents them from having to buy more capacity to have a back-up for a sudden business spike. IaaS providers gives users an opportunity to purchase storage space, wherein they need to be careful as the pricing may differ with providers.

Highly Scalable, Flexible and Quicker

One of the greatest benefits of IaaS is the ability to scale up and down quickly in response to an enterprise’s requirements. IaaS providers generally have the latest, most powerful storage, servers and networking technology to accommodate the needs of their customers. This on-demand scalability provides added flexibility and greater agility to respond to changing opportunities and requirements. Also, with IaaS the process of time to market the product is much more fastened to get the job done.

High Availability

Business continuity and preparing for disaster recovery are the top drivers for adopting IaaS infrastructure. It remains a highly available infrastructure, and unlike the traditional hostings, even in case of a disaster, it offers its users the flexibility to access the infrastructure via an internet connection.

With a robust architecture and scalable infrastructure layer, organizations can consolidate their different disaster recovery systems into a virtualized environment for disaster recovery, for securing their data. This stands as the perfect use case for IaaS.

By outsourcing their infrastructure, organizations can focus their time and resources on innovation and developing new techniques in applications and solutions.

How Do You Choose Between IaaS, Containers or Serverless?

The next question you might have is how to make a choice between opting for IaaS cloud computing model, containers or serverless model?

Well, the one thing they all share in common is that they simplify the developer’s life by letting them focus only on generating code. Let’s look into the differences:






Instantly available virtualized computing resources over the internet, eliminating the need of hardware 

Contains application and associated elements needed to run the application  properly with all dependencies

Broken up into functions and hosted by a third-party vendor

Use Case

Organizations can consolidate their disaster recovery systems into one virtualized environment for backup, securing data

Refactoring bigger monolithic application into smaller independent parts, eg: splitting a large application into a few separate services such as  user management, media conversion etc.

For applications which do not always need to be running.

Vendor Operability

Cloud vendor manages infrastructure

No vendor lock-in

Vendor lock-in

Pricing Model


At least one VM instance with containers hosted is always running, hence costlier than serverless.

Pay for what you use; cost-effective


User responsible for patching and security hardening

Not maintained by cloud providers; developers are responsible for its maintenance

Nothing to manage

Web Technology Hosting

Can host any technology, Windows, Linux, any web server technology

Only Linux-based deployments

Not made for hosting web applications





Deployment Time

Instantly available

Take longer to set up initially than serverless 

Take milliseconds to deploy


IaaS is the most flexible model and suits best to the needs of temporary, experimental and unexpected workloads. Srijan is an Advanced AWS Consulting Partner. Leveraging AWS’s vast repository of tools, we can help you choose the best option for outsourcing your infrastructure for you to achieve your business goals. Contact us to get started with your IaaS journey.


Topics: AWS, Cloud, Architecture

Setup React App From Square One Without Using Create-React-App

Posted by Pradeep Kumar Jha on Nov 1, 2019 6:17:18 PM

The breakthrough in technology has brought a whole new range of tool suite for developers to make the software development process more efficient. React App is among one of them! A prominent tool recommended by the React community to create single-page applications (SPAs) and also get familiar with React.

React App ensures that the development process is refined enough to let developers leverage the latest JavaScript functionalities for better experiences and optimization of the apps for production.

For one of our clients- a giant retail travel outlet - who went out to get a realistic travel budget in mind for the travelers to plan ahead and avoid spending shocks along the way we built a budget planner. 

Built on React.js on top of Drupal, it is a dynamic feature and can be added anywhere in the website (under blogs, services) without coding. 

Creating React App doesn't require configuration of web pack (bundler for modules) and babel (compiler). They come inbuilt. Developers can right away start with coding here. However, the drawback is that they won’t be able to get an idea about things happening in the background.

If we set up React App without using the Create React App, then we will be able to know which all NPM packages/components are needed to make react app working.

About React App

Create React App was built by Joe Haddad and Dan Abramov. The GitHub repository is well-maintained by the creators to fix errors and deliver updates frequently. 

It is a prominent toolchain for building apps quickly and efficiently. A toolchain can be defined as a set of different s/w development tools that are optimized to perform specific functions. For example, the C++ development process requires the compiler to compile the code and a build system, say CMake, to manage all the dependencies. Similarly, the React App is used to create “hello-world” applications.

This blog will showcase how to create a React app from scratch. The prerequisite is to have the NPM package manager installed in the system.

Below mentioned are the steps for the same-

Step 1- Create an app directory

mkdir myApp

Step 2- Access myApp folder and run 

npm init

This, in turn, will create a package.json file for which you can provide the name and version.


npm init -y

This will create a package.json file with a default package name and version.

Step 3- Install react and react-dom packages

npm install react react-dom

This will create a node_modules folder with all dependent libraries to further add dependency inside package.json file)

Step 4- Create a .gitignore file, to avoid pushing unnecessary files to GitHub

vi .gitignore

Under files section, add all the files which you don’t wish to be tracked by Git

  1. node_modules
  2. dist
  3. ._

dist ( Distributed folder ):- This is an auto-generated build directory. We don’t require this folder because it will be generated by the compiler.

Step 5- Create an app folder

mkdir app

Access app directory and then create three files 

  1. touch index
  2. js index.css
  3.  index.html

Step 6- Edit index.html and add below snippet

<!DOCTYPE html>






        <div id="app"></div>



( No need to add any styling inside index.css file as of now )

Step 7- Edit index.js file and add below snippet

import React from 'react';

import ReactDOM from 'react-dom';

import './index.css';

class App extends React. Component{



            <div>Hello World</div>




ReactDOM.render(<App />, document.getElementById('app'))

At the time of running this JSX code (XML/HTML- like syntax used by React that extends ECMAScript) in the browser, it will drop an error. Because the JSX code browser doesn’t understand and that is when we require Babel and Webpack.

npm install --save-dev @babel/core @babel/preset-env @babel/preset-react webpack webpack-cli webpack-dev-server babel-loader css-loader style-loader html-webpack-plugin

NPM install takes 3 exclusives, optional flags that either save or duplicate the package version in your main package.

1.  JSON: -S, ~save: 

The package appears in your dependencies

2. -D, 

The package appears in your devDependencies

3. -O, ~save-optional:

The package appears in your optionalDependencies

We will use Flag--save-dev to differentiate between built dependency & app dependency. 

Once installed successfully, you can check the package.json file to check the differences.

Webpack Configuration

Webpack, as stated is a module bundler, which primarily focuses on bundling JavaScript files for usage in a browser. Though it is also capable of transforming, bundling, or packaging just about any resource or asset.

Check the steps below for webpack configuration-

touch webpack.config.js

Step 1- Add below snippet in this file

var path = require('path');

var HtmlWebpackPlugin =  require('html-webpack-plugin');

module.exports = {

    entry : './app/index.js',

    output : {

        path : path.resolve(__dirname , 'dist'),

        filename: 'index_bundle.js'


    module : {

        rules : [

            {test : /\.(js)$/, use:'babel-loader'},

            {test : /\.css$/, use:['style-loader', 'css-loader']}




    plugins : [

        new HtmlWebpackPlugin ({

            template : 'app/index.html'




Step 2- To allow babel-loader work well, we have to add babel preset config to package.json

"main": "index.js",


    "presets" : [





Step 3- To run build, we need to add webpack to script within the package, i.e.,  package.json

"main": "index.js",


    "presets" : [





  "scripts": {

    "create": "webpack"


Step 4- Run below command

npm run create

With this, webpack will be created, which, in turn, will create a dist folder and our bundle file including index.html.

Watch this video to learn more about it


Step 5- Now to start webpack dev server, add below snippet inside package.json file

"scripts": {

    "start": "webpack-dev-server --open"


It will start building our code as soon as we run npm start.

Step 6- All setup is done. Now run `npm run start` command to see the result on the browser.

The final directory structure will somewhat look like this as shown in the picture - 


Note: You might observe some storybook related extra files in the picture. However, these files won’t be visible in your setup. If you want to know about the storybook, stay tuned for our next blog.


I hope that now you have understood the fundamentals of Create React App better. If Yes, then implement the given steps right away and start building your awesome ideas!

Stay tuned for my next blog in which I will discuss Storybook. 


Happy Coding!

Topics: Planet Drupal, JavaScript & UI/UX, Architecture, Framework and Libraries

AWS Glue: Simple, Flexible, and Cost-effective ETL For Your Enterprise

Posted by Gaurav Mishra on Oct 31, 2019 6:28:00 PM

An Amazon solution, AWS Glue is a fully managed extract, transform, and load (ETL) service that allows you to prepare your data for analytics. Using the AWS Glue Data Catalog gives a unified view of your data, so that you can clean, enrich and catalog it properly. This further ensures that your data is immediately searchable, queryable, and available for ETL.

It offers the following benefits:

  • Less Hassle: Since AWS Glue is integrated across a wide range of AWS services, it natively supports data stored in Amazon Aurora, Amazon RDS engines, Amazon Redshift, Amazon S3, as well as common database engines and Amazon VPC. This leads to reduced hassle while onboarding.
  • Cost Effectiveness: AWS Glue is serverless, so there are no compute resources to configure and manage. Additionally, it handles provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully managed, scale-out Apache Spark environment. This is quite cost effective as you pay only for the resources used while your jobs are running.
  • More Power: AWS Glue automates much of the effort spent in building, maintaining, and running ETL jobs. It crawls your data sources, identifies data formats, and suggests schemas and transformations. It even automatically generates the code to execute your data transformations and loading processes.

AWS Glue helps enterprises significantly reduce the cost, complexity, and time spent creating ETL jobs. Here’s a detailed look on why use AWS Glue:

Why Should You Use AWS Glue?

AWS Glue brings with it the following unmatched features that provide innumerable benefits to your enterprise:

Integrated Data Catalog

AWS Glue consists of an integrated Data Catalog which is a central metadata repository of all data assets, irrespective of where they are located. It contains table definitions, job definitions, and other control information that can help you manage your AWS Glue environment. 

Using the Data Catalog can help you automate much of the undifferentiated heavy lifting involved in cleaning, categorizing or enriching the data, so you can spend more time analyzing the data. It computes statistics and registers partitions automatically so as to make queries against your data both efficient and cost-effective.

Clean and Deduplicate Data

You can clean and prepare your data for analysis by using an AWS Glue Machine Learning Transform called FindMatches, which enables deduplication and finding matching records. And you don’t need to know machine learning to be able to do this. FindMatches will just ask you to label sets of records as either “matching” or “not matching”. Then the system will learn your criteria for calling a pair of records a “match” and will accordingly build an ML Transform. You can then use it to find duplicate records or matching records across databases.

Automatic Schema Discovery

AWS Glue crawlers connect to your source or target data store, and progresses through a prioritized list of classifiers to determine the schema for your data. It then creates metadata and stores in tables in your AWS Glue Data Catalog. The metadata is used in the authoring process of your ETL jobs. In order to make sure that your metadata is up-to-date, you can run crawlers on a schedule, on-demand, or trigger them based on any event.

Code Generation

AWS Glue can automatically generate code to extract, transform, and load your data. You simply point AWS Glue to your data source and target, and it will create ETL scripts to transform, flatten, and enrich your data. The code is generated in Scala or Python and written for Apache Spark.

Developer Endpoints

AWS Glue development endpoints enable you to edit, debug, and test the code that it generates for you. You can use your favorite IDE (Integrated development environment) or notebook. Or write custom readers, writers, or transformations and import them into your AWS Glue ETL jobs as custom libraries. You can also use and share code with other developers using the GitHub repository.

Flexible Job Scheduler

You can easily invoke AWS Glue jobs on schedule, on-demand, or based on an event. Or start multiple parallel jobs and specify dependencies among them in order to build complex ETL pipelines. AWS Glue can handle all inter-job dependencies, filter bad data, and retry jobs if they fail. Also, all logs and notifications are pushed to Amazon CloudWatch so you can monitor and get alerts from a central service.

How It Works?

You are now familiar with the features of AWS Glue, and the benefits it brings for your enterprise. But how should you use it? Surprisingly, creating and running an ETL job is just a matter of few clicks in the AWS Management Console. 

All you need to do is point AWS Glue to your data stored on AWS, and AWS Glue will discover your data and store the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL.

Here’s how it works:

  • Define crawlers to scan data coming into S3 and populate the metadata catalog. You can schedule this scanning at a set frequency or to trigger at every event
  • Define the ETL pipeline and AWS Glue with generate the ETL code on Python
  • Once the ETL job is set up, AWS Glue manages its running on a Spark cluster infrastructure, and you are charged only when the job runs

The AWS Glue catalog lives outside your data processing engines, and keeps the metadata decoupled. So different processing engines can simultaneously query the metadata for their different individual use cases. The metadata can be exposed with an API layer using API Gateway and route all catalog queries through it.

When to Use It?

What with all the information around AWS Glue, if you do not know where to put it in use? Here’s a look at some of the use case scenarios and how AWS Glue can make your work easier:

1 Queries Against an Amazon S3 Data Lake

Looking to build your own custom Amazon S3 data lake architecture? AWS Glue can make it possible immediately, by making all your data available for analytics even without moving the data. 

2 Analyze Log Data in Your Data Warehouse

Using AWS Glue, you can easily process all the semi-structured data in your data warehouse for analytics. It generates the schema for your data sets, creates ETL code to transform, flatten, and enrich your data, and loads your data warehouse on a recurring basis.

3 Unified View of Your Data Across Multiple Data Stores

AWS Glue Data Catalog allows you to quickly discover and search across multiple AWS data sets without moving the data. It gives a unified view of your data, and makes cataloged data easily available for search and query using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

4 Event-driven ETL Pipelines

AWS Glue can run your ETL jobs based on an event, such as getting a new data set. For example, you can use an AWS Lambda function to trigger your ETL jobs to run as soon as new data becomes available in Amazon S3. You can also register this new dataset in the AWS Glue Data Catalog as part of your ETL jobs.

So there you have it, a look at how AWS Glue can help manage your data cataloguing process, and automation of the ETL pipeline. 

Srijan is an advanced AWS Consulting Partner, and can help you utilize AWS solutions at your enterprise. To know more, drop us a line outlining your business requirements and our expert AWS team will get in touch.

Topics: AWS, Architecture

Should Your Enterprise Go For a Headless Commerce Architecture?

Posted by Nilanjana on Oct 25, 2019 10:03:25 PM

E-commerce enterprises looking to deliver seamless user experiences, often wonder how to do so. Is there a way that does not require them to invent their own IoT device or build back-end solutions from scratch?

Enter Headless Commerce.

An extension of the headless content management system, headless commerce offers you the capabilities to build customized user experiences across channels, paving the way for omnichannel retail. 

Here’s a deep dive into discovering all about headless commerce, and whether it is the right choice for your business.

What is a Headless Commerce Architecture?

A headless commerce architecture is the separation of the frontend of your e-commerce experience from the backend. Doing so allows for a greater architectural flexibility, allowing your front end developers to solely focus on customer interactions, without worrying about the impact on critical backend systems.

At the same time, it leaves your backend developers free to use APIs to deliver things like products, blog posts or customer reviews to any screen or device.

The headless commerce architecture entirely separates the presentation layer of your store from the business-critical processes like order and inventory management, payment processing, or shipping. It delivers a platform via a RESTful API that comprises of a back-end data model and a cloud-based infrastructure. 

The headless commerce system works very much like a headless CMS, by passing requests between the presentation and application layers through web services or application programming interface (API) calls.

For example, when the user clicks a “Buy Now” button on their smartphone, the presentation layer of sends an API call to the application layer to process the order of the customer. The application layer then sends another API call to the presentation layer to show the customer the status of their order.

How is it Different From Traditional Commerce?

Here’s a look at the various features that differentiate headless commerce from traditional commerce, or make it a better choice:

headless commerce

Advantages of Using Headless Commerce

Headless commerce architecture brings in a number of advantages for the e-commerce businesses. Here’s a look at some of the reasons why you should consider going headless:

Omnichannel Experience

Using a headless CMS gives you the flexibility to propel your content anywhere and everywhere. For an e-commerce brand, that means the ability to deliver your products, demo videos, testimonials, blog posts, customer reviews etc to any channel that may have emerged.

More Customization and Personalization

As explained earlier, headless commerce systems give you the control to manage the look and feel of your brand. You can design a customized experience for your admins as well as customers right from scratch, without finding yourself hitting restrictions.


A decoupled architecture allows you to make rapid changes in the frontend without disturbing the backend, and vice versa. Also, new functionalities and integrations can be applied in much less time, because of the openness of the architecture. 

Faster Time to Market

Using headless commerce to build an omnichannel retail experience facilitates a faster time to market. Brands can focus solely on building frontend experiences across different touchpoints, as the content and products are housed centrally and delivered via API. 

Agile Marketing

Headless commerce system can support new technologies as and when they arise, making it perfect for designing new customer experiences. This enables marketing teams to roll out multiple sites across different brands, divisions and portfolios in a matter of few days.

Seamless Integrations

A headless commerce system uses APIs, which makes it easier to integrate and communicate with other platforms. You can also add your brand or e-commerce platform to any new device, and it’s just a matter of few hours.

Better Conversion Optimization

With headless commerce systems, you can easily deploy changes on your ecommerce platform. You can run multiple tests at once, and optimization based on these quickly, which will help you constantly improve your e-commerce experiences.

When Not to Use Headless Commerce?

You are aware now of the reasons you should use a headless commerce architecture. And while there are a lot of them, it is also prudent to note a couple of drawbacks that come with it:

Costs Involved

Headless commerce does not provide you with a front end, developers have to build their own from scratch. And this could be both time-consuming and costly. Plus, developers will need to troubleshoot their own front-end creations, leading to ongoing costs beyond the initial build.

Marketer Isolation

Since the headless commerce system offers no frontend presentation layer, marketers can no longer:

  • Create content in a WYSIWYG (what you see is what you get) environment
  • Preview how content will look like on the end user’s device or screen
  • Quickly ideate, approve, create, and publish content without relying on another department

This makes marketers totally dependent on the IT team, not just to build the front-end presentation layer, but also to update it and populate it with content. 

So then what should you do? Well, focus on your business requirements. 

Do the above listed drawbacks not mean much to your business compared to the advantages headless brings? Then you should absolutely go ahead. But if it does, there is a third solution. Decoupled commerce.

Decoupled commerce system is only different from headless commerce in that it doesn’t remove the frontend delivery layer from the backend entirely. It gives marketers back their power, while also giving the brand the same headless freedom needed to deliver content to different devices, applications, and touchpoints through APIs. It is, in a nutshell, the best of worlds. And your choice should entirely depend on your business needs.

At Srijan, we have expert Drupal teams to help e-commerce enterprises assess their current architecture, and identify if headless or decoupled is the way to go. Post assessment, we work to set up the respective Drupal architecture, based on your exact business requirements. 

Get ready to deliver immersive online experiences across customer touchpoints. Let’s get the discussion started on implementing decoupled Drupal for your enterprise.

Topics: Retail & Commerce, Architecture

Refactoring applications for cloud migration: What, when and how

Posted by Gaurav Mishra on Sep 27, 2019 3:54:00 PM

Enterprises migrating their applications to the cloud often face the difficulty of finalizing an approach that is in line with their migration goals. Here are a bunch of questions that can help you in this:

  • What are your business goals?
  • What are your application capacities?
  • What is the estimated cost for your cloud migration process?

Answering these questions, and then selecting the best suitable cloud migration path will guarantee long term success of your enterprise with the migration approach you choose.

In this post, we take a look at one of the most popular methods of cloud migration: Refactoring, what is it and when should you choose it?

What is refactoring migration?

Refactoring is the process of running your applications on the infrastructure of your cloud provider, that is, you will need to completely re-architecture your applications to better suit the new cloud environment. This approach involves modifying your existing applications, or a large chunk of the codebase in order to take better advantage of the cloud-based features and the extra flexibility that comes with them.

Refactoring migration is found to be more complex than the other cloud migration approaches because while making application code changes, you must also ensure that it does not affect the external behavior of the application.

For example, if your existing application is resource intensive, it may cause larger cloud billing because it involves big data processing or image rendering. In that case, redesigning the application for a better resource utilization is required before moving to the cloud.

This approach is the most time-consuming and resource-intensive of all approaches, yet it can offer the lowest monthly spend in comparison. We further take a look at the benefits, and limitations it has to offer:

Benefits of Refactoring

Most benefits of refactoring are delivered in the future. They include:

  • Long-term cost reduction: Refactoring approach ensures an over-time reduction in costs, matching resource consumption with the demand, and eliminating the waste. This results in a better, and more lasting ROI compared to the less cloud-native applications.

  • Increased resilience: By decoupling the application components and wiring together highly-available and managed services, the application inherits the resilience of the cloud.

  • Responsive to business events: Using this approach enables the applications to leverage the auto-scaling features of cloud services that scale up and down according to demand.

Limitations of Refactoring

The disadvantages of this approach include:

  • Vendor lock-in: The more cloud-native your application is, the more tightly it is coupled to the cloud you are in.

  • Skills: Refactoring is not for beginners. It requires the highest level of application, automation and cloud skills and experience.

  • Time taking: Because refactoring is resource-intensive, and much more complicated in terms of changing from a non-cloud application to a cloud-native application, it can take a lot of time to complete.

  • Getting it wrong: Refactoring requires changing everything about the application, so it has the maximum probability of things going sideways. Each mistake can cause delays, cost escalations and potential outages.


Refactoring is a complex process, but it is well worth the improvement that you get in return. Some companies even go as far as refactoring parts of their business solutions to make the whole process more manageable. This compartmentalization could also lead to refactor becoming longer and more resource-intensive.

When to choose refactoring?

Now that you are aware of the advantages and limitations associated with Refactoring approach, the next step is to identify when you should choose this approach. Take a look:

1. Enterprise wants to tap the cloud benefits

Does your business have a strong need to add features, scale, or performance? If so, refactoring is the best choice for you. Exploiting the cloud features will give you benefits that are otherwise difficult to achieve in an existing non-cloud environment. 

2. Scaling up or restructuring code

Is your organization looking to scale an existing application, or wants to restructure their code? You can take full advantage of cloud capabilities by migrating via the refactoring process.

3. Boost agility

If your organization is looking to boost agility or improve business continuity by moving to a service-oriented architecture, then this strategy may be worth pursuing. And that’s despite the fact that it is often the most expensive solution in the short-medium term.

4. Efficiency is a priority

Refactoring has the promise of being the most efficient cloud model because your application is cloud-native, and will exploit continuous cloud innovation to benefit from reducing costs and improvements in operations, resilience, responsiveness and security.

How to refactor?

So you know when to choose refactoring, the next question is how? There are in general, four ways to refactor your applications for the cloud.

1. Complete Refactoring

In this type, 50% of the code is changed and the database is updated to utilize as many cloud-native features as required by the application. This strategy can improve performance, operations costs and IT teams' ability to meet the needs of the business. On the downside however, the process could be too costly or complex, and can introduce bugs.

2. Minimum Viable Refactoring

This requires only slight changes in the application, and is therefore, both quick and efficient. Users who take this approach often incorporate cloud-native security, management and perhaps a public cloud database into their migrated workload.

3. Containerization Refactoring

In this, applications are moved into containers with minimal modifications. The applications exist within the containers, which enables users to incorporate cloud-native features and improve portability. 

This approach is found to be more complex because of the learning involved in adapting to new tools. But that is easily checked, as with the popularity of containers and their growing ecosystems, the costs and refactoring times continue to decrease.

4. Serverless Application Refactoring

This approach has similar issues as containerization as it changes the development and operations platform, which requires learning new tools and skills. Some modifications are required to make the application work effectively and take advantage of serverless systems on the public cloud. 

Unlike containers however, serverless platforms don't provide portability, so lock-in is a major downside to this approach.

You can refactor your applications using either of these ways, but it is advisable to do Minimum Viable Refactoring for most of it. Refactoring is a highly variable activity, dependent on the current application complexity. And during its discovery assessment process, it’s not possible to predict how long an application refactor will take. It could be around three-to-six months per application depending on complexity and previous experience.

Hence, a targeted timeline, refactoring in parts, and checking progress with collected data are some of the best practices to keep in mind while taking up Refactoring cloud migration approach. Because of these reasons, this approach is chosen by very few enterprises that have the time, money, and resources for it.

Looking to shift business-critical applications to or even between clouds? Just drop us a line and our expert team will be in touch.

Topics: Cloud, Architecture

Why Platform as a Service (PaaS) is the answer to high-performance hosting

Posted by Kimi Mahajan on Sep 24, 2019 3:10:00 PM

Running, compiling or configuring your web application on a single virtual server instance can be complex as well as time-consuming.

However, with new technologies emerging and evolving, the entire cloud computing process is getting simplified.

Let’s understand how Forbes has termed Platform as a Service (PaaS) to be the dominant cloud service model and why it stands as the best suited solution for your high-performance hosting needs.

Understanding Platform as a Service

PaaS service delivery model has evolved from Software as a Service (SaaS) cloud offering. It allows the customer to make use of virtualized servers by not purchasing them directly but renting it so as to design, develop, test, deploy and host a web application.

PaaS vendors offer the following along with the cloud offering:

  1. Specific software development tools such as source code editor, a debugger, a compiler, and other essential tools which developers needs to build their application.
  2. Middleware which acts as an intermediate between user-facing applications and the machine's operating system.
  3. Operating system for developers to build an application.
  4. Databases to store data and for developers to administer and maintain them.
  5. Infrastructure to manage servers, storage, and physical data centers.

Why choose PaaS over IaaS and SaaS?

Before comparing PaaS with Infrastructure as a Service (IaaS) and SaaS, it is important to understand what each service means and how it helps users achieve their goals.

Let’s understand each one by comparing them with modes of transportation.

On- premises IT infrastructure is like owning a car. When you own a car, you take the responsibility for its maintenance.

IaaS is like renting a car. You choose the car as per your own preference and drive it wherever you wish. And when you think of upgrade, you can simply rent a different car. SaaS is like taking a public transport, wherein you share your ride with other fellow passengers with a common route.

However, PaaS can be thought of opting for a cab, wherein you don’t drive the car by yourself, but pay the driver to take you to your destination.understanding-cloud-offeringsNow after understanding what each means, let’s compare IaaS, PaaS and SaaS on the basis of what service you manage (✔) and what you don’t (╳).









Operating System






(AWS), Cisco Metapod, Microsoft Azure

AWS Elastic Beanstalk, Windows Azure, Google App Engine

Gmail, Google docs, GoToMeeting


As per Gartner, global public cloud services market is expected to grow to over $383 billion by 2020.Global Market of Public Cloud Services

Perfectly suited for software developers, PaaS helps them deploy applications and test and manage them without needing all the related infrastructure.

It’s very different from the traditional forms of web hosting like shared or Virtual Private Server hosting, wherein the developer has to take up the responsibility of ensuring the production environment is good enough to host their application and set up the application server, database, run-time platform, set up server configuration and many more, before beginning to code.

With HTTP caching servers, PaaS ensures faster application loading and eliminates issues like latency and downtime even if one server goes down. Applications can be deployed to the servers with a single command. It is useful for high-traffic websites (when your server may be under heavy load) which have performance issues in a shared environment.

PaaS can be thought of as a multi-server high performance solution which virtually distributes the web traffic across multiple terminals, keeping your site performance at peak.

High speed hosting services not only improves the user experience of your site, but they also have a positive impact on search engine ranking and users are likely to stay longer on the site as the site speed and resource delivery will be quick.

Here are 5 advantages that PaaS offers over other cloud offerings:

1. Helps Build Applications Quickly
PaaS allows developers to build applications quickly than they would possibly build, configure, and provision with their own platforms and backend infrastructure. With PaaS vendors providing web servers, storage, networking resources, it allows them to gain instant access to a complete software development environment, without any need of configuring or maintaining them and focus mainly on delivering projects speedily.

2. Minimal Development and Cost-Effective Approach

PaaS services offer templates and code libraries to allow rapid development by providing prebuilt backend infrastructure and other resources. It offers new capabilities to your in-house development team without hiring additional staff, thereby reducing costs associated with development in building applications from scratch.

3. Easy Collaboration on Pre-Built Sophisticated Tools
PaaS offers advantage over traditional hosting in a way it lets developers in distributed teams to collaborate. It allows them to create applications using pre-built software otherwise expensive development tools to develop, test and reiterate.

4. Scalability and Future-Proofing

The reusable code not only facilitates easy app development and deployment but also increases the opportunity for scalability. This allows businesses to scale and modify their product or processes efficiently and focus on core business initiatives rather than maintaining underlying IT infrastructure.

5. Cross-Language Support

PaaS cloud services support developers to build applications on multiple programming languages.

How is PaaS different from Serverless Computing?

PaaS and serverless computing are similar in a way where a developer has to worry about working on code, and the vendor handles all backend processes. However, it is different from serverless computing as mentioned in the below table:


Serverless Computing



Automatically scales

Will not scale unless programmed

Startup time


Is running most of the time to be available to users


Do not provide development tools/frameworks

Provides development tools/frameworks

Pricing Model


Not precise

Any demerits?

However, before taking the decision to opt for PaaS process, it is important to understand your business needs in order to find a solution that is a good fit.

Firstly, the decision for PaaS providers should be taken wisely as you might not be able to switch the vendor after an application is built. Each vendor may not support the same languages, libraries, APIs, architecture, or operating system used to build and run applications. Although it is possible to switch PaaS providers, the process can be time consuming and may even result in rebuilding the application to fit the new platform.

Another thing to keep in mind is that the external vendor will store most or all of an application’s data, along with hosting its code and may actually store the databases via a third party. So it is important to test the security measures of the service provider and you should know their security and compliance protocols before making the decision.

Srijan can help you take the truly strategic option of opting for PaaS out of various options, so as to potentially deliver more with better functionality. Contact us to get the conversation started.

Topics: Cloud, Agile, Architecture

Headless Drupal - What it means for marketers

Posted by Kimi Mahajan on Sep 17, 2019 3:56:00 PM

If your website is on the right CMS, it becomes easy to create marketing campaigns, drive leads, and tell your brand’s story to the world. However, making content available on every new device in the market accessible to a user becomes a challenge for marketers.

Headless Drupal may sound exactly what a marketer needs - a platform that helps content reach any device a user uses. Yet, there are some significant problems that it poses to the marketer. Let’s understand them in detail.

Revisiting Headless Drupal

A traditional Drupal has a back-end (stores the content) and front-end (which decides the delivery of that content). Now as there is no limit to devices accessible to users, brands need to go beyond just delivering content on websites and web apps.

With a pure headless CMS, tightly coupled front-end is removed, and it delivers content through an API anywhere and on any device (commonly referred to as API-first).

Headless Drupal offers faster functioning than traditional Drupal and offers highly responsive and fast websites ensuring rich user experience.

When the user interface is decoupled from the CMS, the logic for displaying content on each device is on the front-end and its native tools are responsible for controlling the user experience.

How Headless Benefits Marketers?

It is important for marketers to be where their customers are and send the right communication, on the right channel, at the right time. Here are the 3 benefits of headless Drupal to marketers:

1. Platform Independent Communication

Headless Drupal CMS offers great flexibility to marketers as they can deliver one piece of content in multiple formats – to a desktop, smartphone, app, VR devices, smart speakers, and smart appliances. It saves marketers a lot of time previously spent creating and optimizing content for different devices.

2. Freedom on Content Display
Marketers prefer to use headless as it offers choice over how your content appears on the frontend, with extra security over traditional Drupal. JavaScript frameworks has gained more traction due to the demand for more flexibility in the front end. Its emphasis on client-side rendering offers a more engaging and dynamic user experience.

3. The Faster, The Better
Decoupled Drupal is also faster as the logic for displaying the content is decided by the front-end interface. As marketers are in a constant urge to impress the existing customers and at the same time attract new ones, a faster site helps them in engaging with customers as fast as possible.

Why it is Not Marketers’ First Choice?

Though headless Drupal has been beneficial for developers, but is it valuable to marketers as well? Below are the reasons why marketers, despite its advantages, don’t prefer to go for headless Drupal.

1. No Preview Available

With no presentation layer in a headless Drupal, marketers are not able to create and edit content with a WYSIWYG editor as they would with the traditional Drupal. The most challenging part is they can’t preview their content before publishing to their audience.

2. Dependency on Developers

With headless Drupal, development teams can create a custom-built front-end to customize the layout and entire design of individual pages.

The marketers will have to be fully dependent on developers to carry out tasks for conversion optimization purposes, which proves to be an inefficient solution for them.

3. Marketers Have to Manage Fragmented Environment

Today’s marketers have to engage with their audience in real-time, publish content in line with the latest trends, launch landing pages, deploy microsites, track progress, monitor data, collaborate with advertising campaigns, and much more.

A headless Drupal makes the marketers manage content workflows, form building, and microsite deployments. Managing everything at such a huge scale, soon creates an expensive and hard to manage ecosystem. Not only it complicates the life of a marketer, it also gets in the way of creating a seamless and connected customer experience.

4. Impacts the SEO

Marketers lose standard SEO functionality on adopting headless Drupal for their content strategy and will eventually have to invest additional time and cost for Drupal SEO development.

What It Means For Marketers?

Marketers can consider going for decoupling Drupal when they want to publish the content on more than one platform such as multiple websites, various front-end devices or when they need real-time updates of a site where performance would be killed by using traditional Drupal.

However, if their requirement is to manage a responsive website, headless Drupal won’t be beneficial and will slow down time to market. And, also the costs involved are too high.

Solution For Marketers - Progressive Decoupling

Decoupled Drupal loosely separates the back-end from the front-end, creating an architecture which serves perfectly to both developers and marketers simultaneously.

As a marketer, you can benefit by its user-friendliness and the API-driven omnichannel delivery capabilities. The content layer separated from the presentation layer allows marketers to have an authoring experience that feels familiar. The presentation layer above the API layer allows for seamless integration and blending of different tools and technologies.

So to conclude, headless Drupal isn’t for everyone, and in many cases sticking with a traditional CMS or choosing decoupled Drupal is the best option.

If considering decoupled Drupal strategy seems intimidating, Srijan can help you connect with the experts to help drive your marketing strategy with it. Contact us to get the best out of Drupal.

Topics: Drupal, Planet Drupal, MarTech, Architecture

8 Best Practices To Keep in Mind While Building Data Lake

Posted by Kimi Mahajan on Sep 4, 2019 4:41:00 PM

Data lakes have brought new possibilities and extra transformational capabilities to enterprises to represent their data in a uniform and consumable way in a readily available manner.

However, with an increasing risks of data lakes transforming to swamps and silos, it is important to define a usable data lake. One thing is clear when opting for data lake for your enterprise - it’s all about how it’s managed.

To help data management professionals get the most from data lakes, let’s look into the best practices for building an efficient data which they’re looking for.

Rising Era of Data Lakes

The challenges in storage flexibility, resource management, data protection gave rise to use of cloud based data lake.

As already detailed in our blog- What is a Data Lake - The Basics, data lakes refer to a central repository of storing all structured, semi-structured and unstructured data in a single place.

Hadoop file system (HDFS), a distributed file system, created the first version of data lake. With the increased popularity of data lakes, organizations face a bigger challenge of maintaining an infinite data lake. If the data in a lake is not well curated, it may flood it with random information difficult to manage and consume, leading to a data swamp.

Keeping Data Lakes Relevant

Data lakes have to capture data from the Internet of Things (IoT), social media, customer channels, and external sources such as partners and data aggregators, in a single pool. There is a constant pressure to develop business value and organizational advantage from all these data collections.

Data swamps can negate the task of data lakes and can make it difficult to retrieve and use data.

Here are best practices to keeping the data lake efficient and relevant at all times.


1. Understanding Business Problem, Allow Relevant Data

First and foremost, start with an actual business problem and think to answer the question why should a data lake be built?

Having a clear objective in mind as to why is data lake is required, helps in remaining focussed and works well to get the data job done, quickly and easily.

A common misconception that people have is that they think data lake and database are the same. The basics of a data lake should be clear and should be rightly implemented for the right use cases. It’s important to be sure about what all a data lake can do and what it can’t.

The practice of collecting data without having a clear goal in mind might make the existence of data irrelevant. A well-organized data lake can get easily transformed into a data swamp when companies don’t set parameters about the kinds of data they want to gather and why.

A data most important to a department in an organization might not be relevant to another department. In case of such conflicts over what kinds of data are most useful to a company at a given time, bringing everyone on the same page about when, why and how to acquire data would be crucial.

Companies leaders should adopt future-oriented mindsets for data collection.

Making clearly defined goals about data usage helps prevent overeagerness when collecting the information.


2. Ensuring Correct Metadata For Search

It’s important for every bit of data to have information about it (metadata) in a data lake. The act of creating metadata is quite common among enterprises as a way to organize their data and prevent a data lake from turning into a data swamp.

It acts as a tagging system to help people search for different kinds of data. In a scenario where there is no metadata, people accessing the data may run into a problematic scenario where they may not know how to search for information.Keeping Data Lakes Relevant

3. Understand the Importance of Data Governance

Data lakes should clearly define the way data should be treated, handled, how long it should be retained and more.

Excellent data governance is what equips your organisation to maintain a high level of data quality throughout the entire data lifecycle.

The absence of rules stipulating how to handle the data might lead to data getting dumped in one place with no thought on how long it is required and why. It is important to assign roles to give designated people access to and responsibility for data.

The access control permissions will help users, as per their roles, find data and optimize queries, with people assigned responsibility of governing data, and reducing redundancies.

Making data governance a priority as soon as companies start collecting data is crucial, to ensure data has a systematic structure and management principles applied to it.


4. Mandatory Automated Process

An organization needs to apply automation to maintain a data lake, before it gets converted to a data swamp. Automation is becoming increasingly crucial for data lakes and can help them achieve the identified goals in all phases as mentioned below:

  • Ingestion Phase

A data lake should not create development bottlenecks for data ingestion pipelines and rather allow any type of data to be loaded seamlessly in a consistent manner.

Early ingestion and late processing of data lakes will allow integrated data to be available quickly for operations, reporting, and analytics. However, there may be a lag between data updating and new insights being produced from the ingested data.

Change Data Capture (CDC) automates the process of data ingestion and makes it much easier for a data store to accept changes within a database. CDC ensures that it only updates the changed records of the database instead of reloading the entire tables. Though CDC ensures correct record update, those records need to be re-merged to the main database.

  • Data Querying Phase

The databases running on Hive or NoSQL need to be streamlined to process data sets as large as what the data lake might hold. The data visualization is required for the user to know what exactly to query.

The workaround for this is to use OLAP cubes or data models generated within memory, scalable to the level of use in a data lake.

  • Data Preparation Phase

When data in the cloud is not arranged and cleaned and is lumped with no one having an idea of what is linked to what, and what types of insights the business is looking for, it leads to confusion and issues for automating processing of raw data. They need to have clear goals in mind for what the data lake is supposed to look at.

  • Uniform Operations Across Platforms

Data lakes must be able to generate insights through ad hoc analytics efficiently to make the business more competitive and to drive customer adoption. This can be achieved with the creation of data pipelines to allow data scientists to run their queries on data sets. They should be able to use different data sets, and compare the results over a series of iterations to make better judgment calls. The lake is likely to be accessing data from multiple cloud sources and hence these pipelines must be able to play well with these different source materials.


5. Data Cleaning Strategy

A data lake can become data swamp unintentionally, unless enterprises adhere to strict plans for regularly cleaning their data.

The data is of no use if it has errors, or there are any redundancies. It loses its accountability and causes companies to reach incorrect conclusions, and might take years or even months before someone realizes that the data is not accurate, if they ever do.

Enterprises need to take a further step and decide what specific things they should regularly do to keep the data lake clean. It can be overwhelming to restore a data lake which has converted a swamp.


6. Flexibility & Discovery with Quick Data Transformation

A data lake should allow for flexible data refinement policies, auto data discovery and provide an agile development environment.

Many data lakes are deployed to handle large volumes of web data and can capture large data collections.

Out of the box transformations that are ready for use should be implemented in the native environment. One should be able to get accurate statistics and load control data for better insights into processes that can provide an operational dashboard using the statistics.


7. Enhancing Security and Operations Visibility

User authentication, user authorization, data in motion encryption and data at rest encryption is needed to keep your data safe, to securely manage data in the data lake.

The data lake solution should be able to provide real-time operations monitoring and debug capabilities and notify with real-time alerts on new data arrivals. In order to extract the most value out of your data, you need to be able to adapt quickly and integrate your data seamlessly.


8. Make Data Lake Multipurpose

A single lake should typically fulfill multiple architectural purposes, such as data landing and staging, archiving for detailed source data, sandboxing for analytics data sets, and managing operational data sets.

Being multipurpose, it may need to be distributed over multiple data platforms, each with unique storage or processing characteristics.

Today, data lake has come on strong in recent years and fits today's data and the way many users want to organize and use their data. Its ability to ingest data to be used for operations and analytics as enterprise’s requirements for business analytics and operations evolve.

Are you interested in exploring how data lakes can be best utilized for your enterprise? Contact us to get the conversation started.

Topics: Data Engineering & Analytics, Architecture


Write to us

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms Of Service apply. By submitting this form, you agree to our Privacy Policy.

See how our uniquely collaborative work style, can help you redesign your business.

Contact us