Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
My name is Jen Taghi.
I'm a software engineer at Salesforce.
I'm part of the Slack engineering team, and today I'm going to be talking about.
Scaling enterprise development with cloud IDs, security and performance at scale.
So this presentation is inspired from a project that I led at Slack which
involved the development of a new remote development environment platform here at
Slack that allowed engineers here to move from developing at using their laptops to
a shared remote development environment.
If you've used products like GI Pod and other cloud IDs,
you might be familiar with it.
But this is about building something from the ground up
that meets your unique needs.
Alright, so with that, I'll get started.
I think the idea is like, why do we even need it?
What are the challenges that are unique to enterprise software development?
So we are often.
We have to navigate huge code bases, distributed teams.
Everybody's working on certain subsets of the code base that makes local development
extremely slow and hard to manage.
When oftentimes if you're working on your laptop, it might take you minutes, if not
hours, to be done with your builds because your local laptops just doesn't have the
hardware or the compute capacity required to actually build these large code bases.
Even with the advent of moving away from monoliths to microservices, you
still have extremely complicated.
Code basis and development environment.
And that requires extremely heavy tool chains.
If you're doing I development Android web development or some
ML AI work, you need to install.
And set up those tools and you need to make sure that they continue to
work, and that is a huge undertaking.
So that's why this move to cloud IDs is preferred.
Cloud IDs are remote cloud hosted development environments.
They're usually accessed via remote IDE that is connected to your
machine using a secure SSH channel.
So you get access to a terminal.
Just as you would of your laptop.
So you can run commands you can run, get commands and any other
things any local CLI tools that you would run at your laptop.
You'll do it at eight.
On the virtual machine you get a a numbered development environment.
So you can use any regular browser to be able to test your changes.
So you know, the idea is that this offloads the heavy computing bit
to the cloud which scales really well thereby making a much more
pleasant development experience.
So I think I did cover this, just to enumerate it further, the motivation
of you moving to cloud IDs is mostly to navigate large code bases and these
heavy workloads to be able to work in a more performant, efficient way.
But also I think the more important point is reliability.
If you're working off of your laptop.
There's going to be some tool chain that is going to break with every
patch, every update that you make.
And everyone will have their own unique sets of challenges.
So you will have end up with things like, it works on my machine but not on yours.
So the idea is to have a consistent development experience across enterprise.
Where the setup, the development setup is more reliable because the VM and
the OS version there doesn't change.
If it changes for everyone.
So that means the remediation and changes can be made to make sure
your tool chain stays stable.
So let's talk about the particular problem at slack.
We.
We had a already complicated rather complex development setup.
But then what happens is when you acquire new companies, which is the
case, slack itself is an acquisition, but when you're integrating with
other novel products, which were not built as part of your code base.
So now you're left with features that have footprint in multiple code bases,
often in different GitHub accounts, even different GitHub enterprise accounts.
You still need to make sure you provide your developers a rather manageable
development environment but still navigate these different code bases and, you
allow them to have end-to-end testing so that, the development can be smooth.
So it's going to be tough if you just, if you're making changes to one repo,
then you wait for it to sync, then you make changes to the other side.
You use some feature flags or some versioning to be able to
test things together, that's going to slow you down considerably.
So to solve that problem in particular, we had a product that had this
problem that developing that product required managing both code bases.
So what we did was we built this custom development environment wherein we had.
We had two docker containers running both code bases side by
side, and then we relied on port to port communication so that the two
containers can talk to each other.
So that gave us a environment where the product can be tested holistically.
And we had the tool chains to be able to manage both code bases also.
So this was really transformational in that suddenly you can iterate and develop
features for this product much quicker.
What are the benefits over traditional locals setups?
For one scalability and speed, like I said, you get high-end
compute, they're much more elastic.
Depending on your use case, you might get like a much more.
The VM that is provisioned to you much might be much higher compute versus when
you're doing something simple like maybe some STML JavaScript development you might
get a, a sort of lightweight machine.
Depends on what you're doing.
It makes for a consistent development experience.
You have your entire development team working on a setup that is consistent.
No, everyone's experience is the same.
It leads to much more secure and auditable development environment because
you're not having to pull oftentimes confidential company artifacts to your
local laptop, rather you working on a vm.
That is much more hardened when it comes to security practices.
You have the principle of least privilege and all other IEM role-based access
controls are enabled there and most importantly, least to a faster onboarding.
You have new engineers who are not spending time setting their laptops.
With one CLI command, they get a VM that is fully provisioned,
fully set up with everything that is required to development so they
can be productive from the day one.
So if you look at what is the, like a high level architecture overview for
these remote development environments?
So each developer gets a personal remote workspace that has isolated
resources that doesn't share, it shares resources with the other VMs, but
it doesn't share any logic or data.
So it's completely isolated and it's as.
As good as working on your own laptop, except it's faster.
You get, you're connected to these VMs through TLS encrypted
tunnels, usually through secure SSH.
The VMs that power this development experience usually are, they are
made off of like standard images.
So what you do is, let's say if you, even if you're supporting
three to four development environments the flavors of it.
You can identify the commonalities, like maybe you need Linux
installed on all of them.
Maybe you need some other binaries installed on all them.
So you can distill all of the common parts and bake those in into an image.
So you use these base images to build these development environments.
So again, that just means like those things are hardened and not left,
to be dealt with by the developer.
So that means like you're gonna have a much more smooth development experience.
Again, it's gonna be, it will by default.
It is going to be a much more secure experience just by the fact that you're
not using your own laptop for development.
Because it, it comes most of these development platforms, it's gonna have.
SSO enabled.
So you can have these VMs are mostly going to be in your
corp intranet behind firewall.
So without proper accesses they're not gonna be accessible to anyone.
And again, the usual privileges and access control mechanisms
will also be available here.
In terms of performance optimization you're not gonna have cold starts because
you'll be able to pre-built things or, and have things like remote cache.
So usually in a couple of minutes your machine is ready because when because of
the shared VM and shared pool, you will have prebuilt and the images that have
those dependencies already installed.
You will have a pool of VMs, some a warm pool of VMs ready where we basically,
these environments are ready to go because depending on your usage in you
might have a set of VMs that are already provisioned and they're just like.
Provide it to the developer when it is requested.
So you don't need, really need to wait for the developer to issue the
command before you start building these images these machines.
So we are, slack.
We used AWS auto scaling Group.
And like in that we have like sub two minute, around 92nd startup times.
Like I said, you can cash dependencies and builds so you don't have
to do everything all the time.
Only when something things change.
So what is a developer flow for while using these remote
development environments?
So first, you're going to request a new development development environment.
You'll do it with a simple CLI command.
You'll go to your terminal and maybe you'll write, get remote dev, your
branch name and some configuration.
Let's say you wanna do front end developments.
You'll say get remote dev branch name.
Front end or backend ml or any other configuration that you wanna provide,
depending on that per configuration the system will then, like our
platform, at least a Slack, it automatically provisions these VMs.
It's going to install all the required dependencies.
So if development environment requires Xcode utilities, or some secure
cryptographic utilities to for key encryption data encryption, those kind
of things get pre-commit hooks, post-it hooks any linkers that you might need
or any other tools that you might need.
Those kind of things will be installed.
So usually you have to write a script for these.
We and every enterprise can will have these things specific
to them for for at Slack.
We had our own sort of shell scripts and chef recipes that when these VMs
came up, those volumes are mounted.
The secrets are first from the secret store and and everything that
needs to be set up and installed on these machines was done.
Okay.
When it's provisioned, which like I mentioned, takes 90
seconds, it's done, your remote environment is now ready for use.
You can connect to it via SSH from vs.
Code cursor, wherever you know.
And this, you get access to a terminal.
So it's like working at your own laptop except you have access to a vm, which is
much more reliable and much more powerful.
And then, you can code away.
It's you get a branch checked out with the branch name you provided, and you
can start checking things and merging code because you're also integrated.
The VM is integrated with git with your source control system.
You can in, it at Slack at least, we have dynamic provisioning, which means.
In the config, in the CLI command, we indicate what kind of
development environment we need.
If it is front end development, maybe you will get a machine
with different parameters.
It's going to be, maybe you'll get like MX large or something like
I'm talking about Amazon ESGs.
And if you indicate you need machine learning data sounds
assigned backend, you can predefine.
Like what, how many cores of CPU, what kind of gigabits of
rams that should be provisioned?
And according to your enterprise development needs,
you can set these parameters.
So this gives you that level of control where if you doing something
really compute intensive, you can ask for a bigger much more powerful
machine in terms of accessing code.
On these VMs.
So obviously you, the first thing, once everything is set up on these VMs,
you wanna download your code, right?
You wanna check out your git repository.
So there are few ways to do that.
You can do that by on laptops you basically run your Git commands.
So what you can do is one of the options, at least when we were
thinking about this at Slack, we were evaluating these two approaches.
One is S agent for wording.
When we work at laptop, basically we use our SSH keys which are added to the
GitHub repo, and we are able to fetch and pull and push, merge our GitHub repo.
So we just basically for our SSH credentials to the vm, and then
we run these commands from V vm.
It uses our SSH credentials from the laptop.
And it's done over a security LS connection, this is
encrypted and completely safe.
And it doesn't require any other setup because it's as if you are issuing these
git commands from your local laptop.
The other way to do that is basically.
Authenticate to GitHub to be able to use the code is you
can create a GitHub oath app.
It is a managed way of doing things very easy setup and integration.
But you do have to be mindful of token scopes, revocation, making sure you,
the tokens are not expiring or you are renewing your secrets and keys.
And with token there is always this potential of
misuse if they're compromised.
So we ended up going with the SSS agent for all these reasons.
Now because you're building these features with this development and environment
yourself, you can build some unique features that cater to your use cases.
So I'm going to talk about one of them here.
So one, one of the features that we implemented for this is frontend grafting.
So frontend grafting is allows us to basically graft or put the
frontend assets or bundles, which means J-H-T-M-L, JavaScript or
react, all of that from another vm.
Onto, let's say, a different development environment or even staging or prod code.
So oftentimes we build something, but we are not able to test it with production
like data or shapes or configuration and the development environments we
use, they just lack that sort of data to be able to confidently test things.
So what this did was.
We basically built a grafting mechanism to test with like real world data
where you can go to your pro and then you can u using special query paras.
You are able to tell that fetch all the front end assets and bundles from this vm.
The VM that you are using, the remote development environment that
you are using to build front end.
Again, this is all secure.
It's a, it, you have to be within the company firewall and so there's no
potential of misuse here, but it's a very innovative way of basically
using front end that is supported from one environment and back end
that is from another environment.
And there's obvious benefits here.
You.
Because for our product we had to do it because the front end is from,
it lives in one code base at least some of the assets were, and the back
end of the services is in another.
So yeah, that was one innovative sort of solution that we came up with.
But building our own custom remote development of that
environment is what enabled us to be able to build this No other.
Off the shelf.
Cloud IDE would provide this.
So yeah, these development environments, they are integrated
with your usual enterprise systems that you need for development,
version control, pr, workflow.
I already mentioned that.
They, when the VMs come up you use Visual Studio Code, you already have your Git
code checked out, so you're able to.
Look at GI history.
We are able to create branches, forks do the usual things that you do, create,
pull requests, all sorts of things.
And they're also aligned integrated with the CICD systems because
you are doing it yourself.
It's like a bare metal vm and you can in, you can do anything on it.
Yeah, you you can have integration with CI CD pipelines.
You can run your tests issue commands to run test in your
Jenkins, if that's what you use.
You can trigger CI workflow.
From the cloud id any other tool that you might be needing?
Let's say if you have custom tools for code reviews, issue trackers, let's say
Jira, or triggers or hooks any other dashboards, everything can be accessed
because it's basically another computer that you've been given access to.
The integration with Enterprise Secret Management.
Most enterprise software companies, like there is some sort of secret store
that is used where the keys are pulled from the secret store at runtime.
And these keys are rotated at a cadence.
But yeah, that integration is also done.
And we did that for our use case.
In terms of how do you, so there needs to be a operational playbook
and like a rollout strategy.
If you build this from the ground up, it's not going to be like you certainly
announce it and that, it's ga you have to treat it like a product that just
happens to be used inter internally.
So maybe start with the small pilot team.
Get them to use this as they're using it because developers are your users.
They're going to have feedback that you can inculcate in the product lifecycle.
You can prioritize the features that developers need the most.
Maybe they need some tooling that is missing, or maybe they need linkers
that is extremely important, or the ability to run tests from these IDs.
So this kind of feedback you will get.
And you're gonna have these teams onboard in phases and that will allow
you the time and the feedback necessary to be able to build something useful.
You we have, you have Redfin playbooks for OP operational tasks, because this
is going to be new vm maybe some of the developers have only worked on
Windows or Mac, and suddenly they have to work with a Unix or a you, or open
to OS depending on what your VM has.
So you need to have these operational tasks, these scripts these life,
maybe even docker lifecycle commands, everything documented.
In a playbook so that developers can do that.
So yeah, that sort of is connected to the training and documentation
bit as, and when you get irate you can add more features to it because
there is some barrier to entry here.
There is a considerable in initial investment.
So you were not going to be able to build everything at once.
You have to prioritize what parts of the code, base or the development
flow can you support first.
And that is where the feedback is going to be important.
Obviously like for something like this you have VMs and you have these
these essentially autoscale groups.
And that are being shared, but you want to make sure that you
do it in a cost effective manner.
So what are some strategies that you can use to make sure there is no overspend?
There are a few strategies that we use.
Obviously these development environments are provisioned on
demand and they are ephemeral.
By default they will have a life cycle, let's say.
Maybe a week or two weeks, and then you get warnings and then they are
they, they're killed and you know that space is available for someone else.
But even other than that you can have.
In dev environments suspend or, may go to sleep when there is idle
time detected like 30 minutes, one hour, things like that.
You also have to be mindful of what size of instances you're using, so
if that's where the config comes in.
So if it's important that you map the hardware that you're providing.
To the use case that the developer has if they're doing some simple front
end development, you have to use the right size instance to make sure that
you're not just throwing hardware at a problem that doesn't require it.
This is multi-tenant by default because.
Behind the scenes is basically one machine that is serving these
that is using virtualization.
You have multiple VMs and they're all running these docker containers
that is helping you develop.
And yeah we need to be, you're keeping an eye on the cloud costs.
So you have to have dashboards billing and need to have some
alert set up to make sure that you are not crossing your budgets.
And if you are, you might need to provision more capacity.
It's a good problem to have if you get that kind of adoption.
It's a good problem to have, and, but it certainly requires preparedness.
And yeah, like I said, you cannot just throw like realistically speaking,
just more hardware at things.
So while you want, faster performance for your developments, you also
have to balance it with cost.
So every organization will stumble upon the right trade off.
That is good for them, depending on where they are.
But it's just something to keep in mind.
Having sort of these enterprise cloud IDE platforms you can, because
this is a consistent development experience provides parity.
You can enforce po policy guardrails, you can have auditing, you can have
access and compliance management.
For every audit because if you're creating an audit trail you can
be sock to compliant by default.
Admins will be able to restrict the images, machine sizes, the network access.
There's a lot more control that you will have.
When the development is happening on these remote development environments there
is going to be some level of developer autonomy for sure, because eventually what
you wanna do is you wanna give them a VM with their terminal where they can install
things if they need to for their use case.
And they want, let's say if there is ID extension they want, or they
want a custom linter or a DING package, they can do all of that.
So there's that level of autonomy too.
But the core pieces of the product are going to be hardened.
So let's talk about some of the lessons learned and some recommendations
that came out of this project.
So it's because there is some investment it, there is some barrier for entry here.
Yeah, it's probably more suited to a bigger code basis.
If, like for it to be a justifiable decision, you have to have
real issues scaling on your laptop development code flow.
So if you're having that, those issues, this is the right choice for you.
But if it is still manageable, because of the effort the effort is non-trivial.
It might not justify right away though if your company grows your code base grows.
Eventually you will end up using this.
Always I trade with feedback.
Prioritize the things that matters to the development team, to the
developers in your organization.
First provide environment FLA flavors, which meant.
Have these, category or different categories of development environments
where the, you have the right size of VM and compute resources available for.
The kind of work you're doing that will may help you be more cost effective.
Just as a sort of a side note, Uber created six flavors of these dev
environments to cater to different needs.
We, at Slack we also have at least four four or that I can
think of for different things.
Some real world success story.
I'm proud to say at Slack we've been very successful for these I.
Don't think I, there are many developers who use local laptop development for
this product in particular because it's just di very difficult to, I manage those
two code bases on the single laptop.
So around 90% reduction rate has been very successful for us.
And again, some other benchmarks.
Around 75% build time reduction, because you are using a bigger machine
and multiple machines that are able to have cashed bills and a remote cash.
So you're able to reduce that time, which is a huge boost.
Then there is a case study came out of Uber where they
had a internal dev PO system.
It allows choosing large machines up to 48 CPU codes.
That must be for a specific use case.
And then, obviously a laptop can never scale to that kind of requirement.
And the biggest plus is you are productive.
Your engineers are productive from day one because they're not spending
time just setting up machine.
So to conclude what it offers, I think the key takeaway is that moving
to enterprise development with cloud IDs, remote development environment.
It's going to speed up your engineering and while making your development
workflows extremely secure you, it'll enable us to you to leverage powerful
infrastructure and have some sort of centralized control to solve all
those reliability issues where things are working on once one developer's
computer, laptop, but not another.
So those kind of things that you can completely take out by moving
to remote development environments.
Yeah, and you can architect and cater it for your own requirements.
That's the thing.
So yeah, I'd a it's a very transformative project.
If it is successful, it changes.
It's a complete overhaul of your development experience.
It might have long lasting permanent changes to your engineering culture
even because if it improves development, velocity, productivity it it's a good
thing for the organization overall.
That's what, that's all I had for today.
Thank you for allowing me the opportunity to present at Platform Engineering Con 42.
Thank you.
Okay.