The Person API is a web service that allows people to programmatically interact with common identity data about a person. We in the API Team within DoIT Enterprise Integration released this API back in the summer of 2022 and have been continuing to iterate on it to add new features and data. In this blog post, I’m going to share what we’re using to deploy and operate the Person API, and why we made some of the choices we did.
One of the benefits of using an API is that it allows the API provider and consumer to change their internal architecture without affecting the API integration layer. Given that, the information in this post will become outdated as the Person API continues to evolve.
Runtime Architecture
The Person API is written in Python and is deployed on Google Cloud Platform (GCP). We chose Python for several reasons:
- We already had a few other services written in Python, so we could benefit from existing knowledge.
- Python is a popular language, making it easy to troubleshoot issues or get ideas from online resources. Its popularity also helps when we need to add additional members to our team.
- We found a library which helps implement the JSON:API specification in Python.
- Python runs well in serverless function platforms like Google Cloud Functions (more on that later).
To run our code, we already knew we wanted to use a cloud platform such as Amazon Web Services (AWS). All our new services are deployed to a cloud platform by default, since it allows us to provision and interact with infrastructure through APIs, command line interfaces, and by using the platforms’ associated SDKs. Being able to manage infrastructure programmatically, without sending an email or scheduling a meeting, provides a huge boost in efficiency. When we plan our sprints, we try to increase the predictability of our work by decreasing the amount of dependencies we have on other work or teams. Cloud platforms are a key part in reducing cross-team dependencies.
Our team had already been running several other services in AWS, but we decided to use GCP for the Person API because our new API management platform, Apigee, was also deployed in GCP. Besides the work we did on Apigee, our team didn’t have much GCP experience, so we figured the Person API would be a good opportunity to see how it compares to AWS. We also liked the idea of reducing network hops by running the Person API in the same region (i.e. localized collection of data centers) as Apigee, since all traffic to the Person API is proxied through Apigee.
Cloud platforms also offer serverless platforms, which allow us to run our code without managing servers. As a natural evolution of virtual machines and containers, serverless platforms like GCP Functions allow you to run code in the context of a single HTTP request, and you’re only billed for the time the code executed as well as how much memory and CPU it needed to run (among other metrics). This provides a huge benefit in scaling, since we don’t have to guess how many servers we need to run an application. GCP takes care of scaling for us, and we just get billed based on how much the Person API is being used. We also don’t need to worry about making sure servers are patched and running.
So far, I’ve described how the Person API uses Apigee and GCP functions, but what about the data? We use Person Hub, which is a data source on-premises at UW-Madison that holds common identity data and takes care of merging people into one identity who come from multiple data sources (e.g. SIS and HRS). However, in the interest of reducing network hops and overall latency, we deployed a database in GCP to serve as an Operational Data Store (ODS) for Person Hub. This means that when the Person API gets called, the entire request is served from within one GCP region. We chose GCP’s US central region in Des Moines, Iowa, given it is physically closest to Madison.
For the ODS, we used GCP’s Cloud SQL offering, using MySQL. Cloud SQL is a managed service, meaning that GCP takes care of the underlying infrastructure to deploy and manage MySQL. It also takes care of automating backups. We just choose how big of an instance we want, and GCP takes care of the rest. In the initial implementation of the Person API, we tried using Firestore, which is GCP’s serverless NoSQL offering, but found many limitations that made it difficult to implement the Person API using JSON:API. The Person API and JSON:API itself is very relational, making a traditional relational database a good choice.
Making and Deploying Changes
Given the amount of control and scalability using GCP affords us, we are able to make multiple production deployments every sprint, without any downtime. However, all these deployments are not without rigorous automated tests and peer review. We use GitLab to store our code, facilitate team review and feedback, and test and deploy the Person API.
We use a single main branch in the Person API repository, and use environment variables during deployment to differentiate between environments. The main branch and our production environment are always at the same version. When making a change, we branch off the main branch (i.e. create a “working branch”), make our changes, and create a merge request. In GitLab, a merge request is a tool that makes it easy to visualize the changes that someone is wanting to merge into the main branch and deploy to production. Merge requests also provide an easy way for other team members to give feedback on a change, allowing comments on specific lines of code if needed.
When a merge request is created, a pipeline runs to test and deploy the code to the development environment in GCP, and then run some post-deployment tests. Any broken step in that process (i.e. a single test failing) stops that whole process until the error is corrected and committed to the working branch, before running the pipeline again.
The first step in the pipeline is to run pre-deployment tests on our code. This includes:
- Unit tests and integration tests on all our Python code. In addition to the hundreds of automated tests themselves, we use a tool to measure our code coverage to make sure that we are testing a large portion of our code. If that tool detects we are testing under 90% of our code, it fails the pipeline.
- Running our code through linting and formatting checkers.
- Validating that the example responses in our OpenAPI specification align with the schema documented in the specification.
- Validating our Apigee proxy configuration.
Next, we deploy the code to our development environment. While GitLab does the work of initiating the deployment, the main tool we use here is Terraform. Terraform is known as an Infrastructure as Code (IaC) tool, which is commonly used with cloud platforms. IaC is the practice of declaring the desired state of your infrastructure as code. We find several benefits by using this practice and using Terraform as our preferred IaC tool:
- We are able to ensure that our production and non-production environments are deployed in an identical way, increasing our confidence that a successful deployment to our development environment will also be successful in production.
- We are able to track changes to our infrastructure just like we track changes to our code, by using merge requests and version control.
- Terraform takes care of determining whether a change to a resource requires it to be updated or deleted/created. It also implicitly knows how each resource depends on each other, so it deploys resources in the correct order.
- IaC is a great disaster recovery tool, since we could use our Terraform code to redeploy the Person API in a brand new environment, if needed. Further, since all the data for the Person API is sourced from Person Hub, we could easily bring the Person API back online if our entire GCP account was destroyed.
After Terraform finishes, a tool called Liquibase runs to manage the schema for the Person API database. This is known as the practice of database migrations, which, similar to IaC, allows us to declare and manage the structure of our database in version control. Liquibase runs our database migrations to make sure our tables and users are set up correctly.
Next, we run a couple post-deployment tests:
- Load testing runs to see how the API performs under pressure. We simulate a high amount of API traffic to make sure the changes we made didn’t have an adverse effect on the APIs performance.
- We use a tool called Dredd to test the API by calling it and comparing the responses against the Open API specification. This ensures that our API is behaving how it said it would in its documentation.
If the pipeline passes and the merge request is approved by at least one other team member, the original author merges the working branch into the main branch and initiates a deployment to production. The production deployment process is very similar in that it runs our Terraform and Liquibase files against our production infrastructure.
Code Layout
The whole Person API is broken down into several parts in its repository.
API Layer
The API layer is responsible for receiving a request, gathering the necessary data from the Cloud SQL database, and returning it to the client. We use Google Cloud Functions to translate a request into the necessary interactions with the database, serialize the response into the JSON:API format, and return the response to the client. We use one Function for each endpoint (e.g. /people, /people/{id}, /people/{id}/names, etc.) which gives us better control of each endpoint’s configuration and makes it easier to look through logs. The /people endpoint gets the most usage, so we optimize that function for the most amount of traffic and memory/CPU intensive operations.
Apigee is the first location where an API request lands. Apigee handles authentication, authorization, applying policies such as quotas, and routing the request to the appropriate Google Cloud Function.
We also have a separate “Keep Alive” function which periodically sends requests to the Person API. This helps reduce clients experiencing “cold starts”, which is one of the downsides of using a serverless function architecture. Sometimes a request gets sent to an instance of a function that hasn’t started up yet, resulting in a slower response time. The keep alive function helps prevent clients from experiencing these slow response times caused by cold starts.
ETL Layer
The ETL layer is responsible for synchronizing data from Person Hub to the Cloud SQL database for the Person API. We primarily source data from a change log in Person Hub which is a table that indicates whether a person was created, updated, deleted, or merged. Merges happen when two or more people/identities were merged into one record. We run a process every minute to process the changes that happened since the last run. Each row in the change log has a sequence number that we use to keep track of where we left off.
We also have the ability to add new people, update existing people, or remove deleted people in the Cloud SQL database in an ad-hoc manner. This is helpful to resynchronize the database in the event of an issue/bug or in disaster recovery situations.
For normal change log processing, we use Cloud Scheduler to initiate a Cloud Run process every minute. Cloud Run is another serverless offering in GCP. It’s similar to Cloud Functions, but has some more flexibility for longer-running processes. For each new row in the change log table, a message is sent to Pub/Sub, which is GCP’s asynchronous messaging/queuing service. The message contains the change log information as well as the current state of the person in Person Hub. A downstream Cloud Function consumes the message and makes the necessary changes in the Cloud SQL database. That function also sends a message for webhook processing, which we’ll cover next.
By using Pub/Sub and a Cloud Function to process change log events, we can more easily scale in the event of a large amount of changes. Instead of needing to vertically scale by increasing timeouts and memory/CPU resources, we can lean on Google Cloud Functions to automatically start new instances during times of high changes, such as term changeovers. This allows us to scale horizontally without manual intervention or unused capacity. Hundreds of change log events can be processed at the same time by “fanning out” events through Pub/Sub to multiple instances of the Cloud Function.
The only constraints we implement are (1) grouping by person so that two functions aren’t processing events for a given person at the same time and (2) limiting the maximum number of Cloud Function instances that can run at a given time to prevent overloading the Cloud SQL database with too many connections, since each Cloud Function instance initiates a new connection to the database.
Webhooks
While processing changes from the change log, the Cloud Function that updates our Cloud SQL database also initiates a webhook event to our webhook topic. This is a separate Pub/Sub topic that has subscriptions for each application that is registered to consume webhooks. Messages sent to the webhook topic will be distributed to a downstream Cloud Function which then forwards the message to the http endpoint managed by the webhook subscriber. Webhooks allow Person API consumers to be notified when a person changes, rather than calling the Person API on a scheduled basis to load a large population of people.
Metrics
We have a few internal Cloud Functions for updating metrics that we use for monitoring and operations. The main metric we track is any differences between Person Hub and the Person API Cloud SQL database. There should be no differences in the number of people between the two databases, so if there are, we log the specific differences and send an alert to our team for further investigation.
Networking
There are some general networking resources used by most of our infrastructure. We have a VPC (Virtual Private Cloud) which contains a Cloud NAT (Network Address Translation). This handles all our outbound traffic to Person Hub so that our traffic comes from a static IP address. We also use a VPC access connector, which allows us to communicate with our Cloud SQL database from Cloud Run and Cloud Functions. The Cloud SQL database runs on an internal private subnet, so there is no way to access it outside of the VPC.
Summary
The Person API has been a great experience in learning new technologies and dealing with distributed systems. There’s a lot of knowledge our team will be able to reuse for future APIs, even ones deployed outside of GCP. Feel free to contact api@doit.wisc.edu if you have any questions about the details I’ve described in this post or about the Person API in general. More information about the Person API, including how to get access, can be found on the Developer Portal.
– Jared Kosanovic
Enterprise Integration – Technical Lead
Division of Information Technology