Recently I had a requirement to “provide a collaboration space on a Big Data pipeline where consumers of data could view and do stuff with their data”. A bit vague, however, I started thinking about possible options and decided to investigate a notebook solution.
I’m a big fan of Jupyter notebooks after doing a data mining exercise with Python to learn more about data science. If you are interested, this one on Pluralsight was for beginners running machine learning on the Titanic data to predict who would survive based on a variety of features such as fare paid, cabin number, gender, family size etc. Apparently the Titanic dataset competition is a classic on Kaggle, but this is a new area for me.
On a side note, it’s interesting to see how the lines in software roles are blurring (e.g. data scientists are learning just enough coding to be dangerous. Software engineers (like me) are fiddling with data science techniques (just enough to be dangerous). DevOps are coding. Coders are DevOpsing.
Anyway, back to the collaboration requirement. At the core of collaboration is security – how to lock down data so that only the intended audience can see it? We chose Azure AD B2B to allow different organizations to access the system with the flexibility of Azure RBAC. So far it’s been a pleasant experience although the Azure Portal’s UI for AD is a bit buggy and flat so we ended up putting our own UI on this to map to our application hierarchy (and spare the user the pain of using the portal for managing users and permissions). On the data side of things, we’re providing a Data Lake to allow for the scale of data expected (infinite is a big call, let’s just say with a lot of IoT devices streaming data every few seconds just the raw input data starts to pile up quickly) and authentication is done using RBAC and ACLs and the data is partitioned into folders by consumer. This solves the requirement of storing the vast raw input data and being able to store transformed or aggregated data according to the output requirements (which is normally where the real value-add for customers is).
We also implemented RLS in Azure SQL for multi-tenanted in-place viewing of transformed, aggregated data by multiple consumers (will save that for another post) which integrated well with AD B2B. This gave us the flexibility so that any consumer with an AD compatible tool could connect to the platform and self-service (for example, they could connect with Azure Storage Explorer to the data lake or blob store and browse their own data). On the relational database side, any SQL client supporting AD integration (e.g. SSMS) would allow the user to connect to the SQL db and see their transformed data.
Great so far, but what if the user wanted to run some jobs on their data or munge it with other data in an environment that was familiar to them? They’d need to download or copy the data to their own system which means time and stale data. Jupyter Notebooks is an open-source, visual tool with many integrations to allow users to do just this. It has integration with Git, sharing, forking and supports a variety of languages. Of course, I’m not a fan of spinning up VMs and fiddling around with systems when I could be spending that time on adding business value (or playing with son!) so luckily Azure provides a serverless option in the form of Azure Notebooks. It’s in preview but seems pretty solid, albeit rough around the edges from a UX point of view, but best of all, no servers to manage – PaaS all the way! It also means one less UI I have to write and 1 less thing my users have to learn.
Next question: As a user, how to connect Azure Notebooks to the platform Azure Data Lake and Azure SQL DB without giving out usernames, passwords or tokens? Ideally we want to use the same AD credentials that we have used throughout the platform. Log in to Notebooks with your Microsoft LiveId (don’t get me started on the UX for the Microsoft Login flow) and you will be greeted with your new workspace. Start a “New Library” (connect to Github if you want versioning). It’s Public by default (watch out for storing secrets and things) but you can make it private if you wish.
Install the following SDKs for Python (assuming version 3.7+)
!pip install azure-mgmt-resource !pip install azure-mgmt-datalake-store !pip install azure-datalake-store
Get an authentication token using Pandas
This is the part that needs improvement. There’s 2 bits of semi-sensitive information that you will need to expose to your end-users to facilitate this, which won’t please your security teams: The Azure AD TenantID (find this in the Azure Portal or use Azure CLI) and the actual instance name of your data lake. With AD B2B at least we’re not exposing username’s and passwords…
import pandas as pd from azure.datalake.store import core, lib, multithread token = lib.auth(tenant_id='<AzureADTenantId>', resource = 'https://datalake.azure.net/')
As soon as you run this you’ll be prompted to open a new browser window and enter the device code given:
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code XYZZ to authenticate.
This will link your login with the authentication request and return a token for the user that you log on with in that browser window. Now… you are probably thinking: Why can’t I just pass through the AD credentials of the person that I am currently logged on with using Azure Notebooks? It’s a good question on something on my list to investigate – I have not yet found a way around this, however it does still meet the requirement of AD B2B authentication.
Check that you have a token
Make an ADSL File System client
adlsFileSystemClient = core.AzureDLFileSystem(token, store_name='<DataLakeName>') print(adlsFileSystemClient.ls())
And that’s all there is to it. Now you can view the contents of the file system which will be limited by the ACL’s you set on the Data Lake.
According to your permissions for your logged on user you can read data, run transformations and make directories:
You can then connect to your Data Lake with Azure Storage Explorer and confirm that the new directory has been added.
That successfully demonstrates exposing your Data Lake to end users with Azure AD B2B (I guess it is worth pointing out this will work with normal Azure AD too – we just happen to be using the B2B part of it).
Let me know if you have any questions/issues…