Land Technologies
< all posts

Internal python libraries with CodeArtifact and GitHub Actions

Posted by Elliot Iddon on June 13, 2022 · 4 min read awscdkcodeartifactpythonpypi

How we solved secure access to internal python libraries without a VPN in the data team. The Problem

The initial problem/How did we get here?

We had a growing bundle of shared “library” code in a python monorepo which was bundled into each of the builds. This was great for reducing code duplication in our projects but as the project grew this strategy presented some problems.

Since the shared libraries were version controlled with the main repository you always got the latest version of the library, whether you liked it or not! If there’s a breaking change you just had to accept and fix.

You also got all the libraries, as those added an ever growing list of dependencies you were faced with resolving the dependency collisions between your build and the shared libraries. Which was a continually moving target.

The solution was to extract each of these shared “libraries” into a true library in its own repository. Unfortunately GitHub doesn’t yet provide a hosted solution for python like it does for node. We landed on using git clone based dependencies which pip supports. This resolved the primary problems with getting all the libraries and latest version whether you liked it or not but it also had its own problems.

Git dependencies were painful to work with, it was complicated to pass the credentials to docker builds and meant we needed a 2-stage build to avoid burning them into the image. For reasons I no longer remember some places needed an https checkout with a GitHub personal access token and some places needed an SSH key which meant a nasty cat+grep invocation to switch between the two.

There also wasn’t a way to do nice version pinning, you either had to take a specific tag or the latest version. There was no way to loosely pin and take point releases.

We wanted to move to a pypi repo but there isn’t an obvious place to host it. We need our internal libraries to be accessible only by authenticated clients but we don’t operate a VPN so we need to secure at the protocol level rather than network. Our authenticated clients are: CI, dev machines, and docker containers on CI/dev machines

CodeArtifact

Enter AWS CodeArtifact, which solves all these problems and gives us some extra goodies. Deploying was a snap since we already do infrastructure as code with CDK. This also makes it easy to manage permissions on the repo since they are managed by IAM and we already have AWS identities for CI and dev machines.

We didn’t need any additional infrastructure. We don’t have to manage ec2 instances, networking or S3 buckets for our library code.

An unexpected boon was the ability to easily enable cross account read access. We have a single pypi repository that lives in our primary AWS account and we can get read pypi access with the appropriate identities from any of our nominated secondary accounts (which we use for things like development environments). This means that whichever of our AWS accounts we are using to develop a python project we can get a token for the single pypi repository. This eliminates worries about replicating releases between accounts and provides a nice experience for the end user (since they don’t have to worry about which account they are reading from).

CodeArtifact provides a built in mechanism to treat the public pypi as an upstream so we can have a single pypi index URL that provides both our internal and public dependencies. As an added benefit the public dependencies are cached so we are insulated from packages going missing in the upstream repository.

To enable pip install from CodeArtifact is simple with the AWS CLI

aws codeartifact login --tool pip --domain my_domain --domain-owner 111122223333 --repository my_repo
pip install private-lib==1.0.0 # now searches in my_repo first

CodeArtifact and its clients

Reusable GitHub Action to deploy

With reads solved for our uses we were left to create a mechanism to publish to our new repository. We use GitHub actions for our automated build jobs so produced a workflow for a single one of our libraries that:

  1. Checks out the repository
  2. Sets up AWS CLI
  3. Connects pip and twine config to CodeArtifact
  4. Installs the build and publish tools
  5. Builds the library
  6. Pushes it to CodeArtifact

We wanted to reuse this workflow in each of our library repositories to cut down on copy paste and reduce correspoding maintenance burden but discovered that access to reusable workflows depends on the workflow being in the same repository or the workflow being in a public repository.

This turned out to be a blessing in disguise as in concert with the DevOps team we were able to sanitise the account specifics out of the workflow and release it in a public repo under the MIT licence. Now you can deploy your library to CodeArtifact with just a little bit of configuration.

name: Your library deploy workflow
on:
  release:
    types: [published]
jobs:
  build-n-publish:
    uses: landtechnologies/codeartifact-workflows/.github/workflows/python_codeartifact_push.yml@main
    with:
      repository: YOUR_REPOSITORY_NAME
      domain: YOUR_DOMAIN_NAME
      domain_owner: YOUR_AWS_ACCOUNT_ID
    secrets:
      AWS_ACCESS_KEY_ID: YOUR_GITHUB_SECRET_FOR_ACCESS_KEY_ID
      AWS_SECRET_ACCESS_KEY: YOUR_GITHUB_SECRET_FOR_SECRET_ACCESS_KEY

awscdkcodeartifactpythonpypi
Land Technologies

We are the engineers behind LandInsight and LandEnhance. We’re helping property professionals build more houses, one line of code at a time. We're a remote-first company with our tech team in Europe, and yes, we're hiring!

© 2022, Built with Gatsby
LAND TECHNOLOGIES LTD, m: WeWork c/o LandTech, 10 Devonshire Square, London, EC2M 4PL t: 0203 086 7855
© 2016 Land Technologies Ltd. Registered Company Number 08845300.