Wednesday, 31 July 2024

How to use HashiCorp Cloud as a remote storage for Terraform state file



Terraform state file keeps track of the infrastructure which is under Terraform's control. Terraform compares resource configuration files against it in order to find out which resource needs to be added, edited or deleted. If state file gets lost, Terraform will try to re-create all resources. 

By default Terraform state file (terraform.tfstate) is stored locally, on the machine where we initialize Terraform. But this carries the risk of adding this file (which may contain sensitive data) to the repository and pushing it to remote which can be a security risk or deleting it by chance which can be painful experience - see Lessons learned after losing the Terraform state file | Trying things.

To minimize chances of losing the Terraform state file and enable multiple contributors to work on the same infrastructure in parallel we should define a remote storage for it. We can store it in AWS S3 bucket, Google Cloud etc...but one of the totally free options, which also includes the shared state file locking mechanism, is Terraform Cloud.

Here are the steps which explain how to do it.

Sign Up for HashiCorp Cloud Platform (HCP):
  • Go to Terraform Cloud (https://app.terraform.io/) and create an account.
  • Create an organization (e.g. terraform-states) and a workspace (e.g. remote-state-demo) within Terraform Cloud. Workspaces are where state files are stored and managed.
 Configure Terraform Cloud Backend:
  • Add the following backend configuration to e.g. terraform.tf file:

terraform {
  backend "remote" {
    organization = "terraform-states"

    workspaces {
      name = "remote-state-demo"
    }
  }
}

Login to Terraform Cloud:

$ terraform login
Terraform will request an API token for app.terraform.io using your browser.

If login is successful, Terraform will store the token in plain text in
the following file for use by subsequent commands:
    /home/<user>/.terraform.d/credentials.tfrc.json

Do you want to proceed?
  Only 'yes' will be accepted to confirm.

  Enter a value: yes


---------------------------------------------------------------------------------

Terraform must now open a web browser to the tokens page for app.terraform.io.

If a browser does not open this automatically, open the following URL to proceed:
    https://app.terraform.io/app/settings/tokens?source=terraform-login


---------------------------------------------------------------------------------

Generate a token using your browser, and copy-paste it into this prompt.

Terraform will store the token in plain text in the following file
for use by subsequent commands:
    /home/<user>/.terraform.d/credentials.tfrc.json

Token for app.terraform.io:
  Enter a value: Opening in existing browser session.



Retrieved token for user <tf_user>


---------------------------------------------------------------------------------

                                          -                                
                                          -----                           -
                                          ---------                      --
                                          ---------  -                -----
                                           ---------  ------        -------
                                             -------  ---------  ----------
                                                ----  ---------- ----------
                                                  --  ---------- ----------
   Welcome to HCP Terraform!                       -  ---------- -------
                                                      ---  ----- ---
   Documentation: terraform.io/docs/cloud             --------   -
                                                      ----------
                                                      ----------
                                                       ---------
                                                           -----
                                                               -


   New to HCP Terraform? Follow these steps to instantly apply an example configuration:

   $ git clone https://github.com/hashicorp/tfc-getting-started.git
   $ cd tfc-getting-started
   $ scripts/setup.sh

During this process a Terraform Cloud token generation page opens in browser:


terraform login should automatically pick the token and save it but in case this fails, you can copy the token and paste it here:

/home/<user>/.terraform.d/credentials.tfrc.json:

{
  "credentials": {
    "app.terraform.io": {
      "token": "1kLiQ....h3A"
    }
  }
}


This authentication is necessary for the next step:

Initialize the Backend:
  • Run terraform init to initialize the backend configuration
If we don't login to Terraform first we'll get:

$ terraform init
Initializing HCP Terraform...
│ Error: Required token could not be found
│ 
│ Run the following command to generate a token for app.terraform.io:
│     terraform login

If we're authenticated with Terraform:

$ terraform init
Initializing the backend...

Successfully configured the backend "remote"! Terraform will automatically
use this backend unless the backend configuration changes.
Initializing provider plugins...
- Finding latest version of hashicorp/local...
- Installing hashicorp/local v2.5.1...
- Installed hashicorp/local v2.5.1 (signed by HashiCorp)
Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.

Let's assume we have the following resource:

main.tf:

resource "local_file" "foo" {
  filename = "${path.cwd}/temp/foo.txt"
  content = "This is a text content of the foo file!"
}


We can now see the plan:

$ terraform plan
Running plan in the remote backend. Output will stream here. Pressing Ctrl-C
will stop streaming the logs, but will not stop the plan running remotely.

Preparing the remote plan...

To view this run in a browser, visit:
https://app.terraform.io/app/terraform-states/remote-state-demo/runs/run-nbxxG2TBxSYGEgCm

Waiting for the plan to start...

Terraform v1.9.3
on linux_amd64
Initializing plugins and modules...

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # local_file.foo will be created
  + resource "local_file" "foo" {
      + content              = "This is a text content of the foo file!"
      + content_base64sha256 = (known after apply)
      + content_base64sha512 = (known after apply)
      + content_md5          = (known after apply)
      + content_sha1         = (known after apply)
      + content_sha256       = (known after apply)
      + content_sha512       = (known after apply)
      + directory_permission = "0777"
      + file_permission      = "0777"
      + filename             = "/home/tfc-agent/.tfc-agent/component/terraform/runs/run-nbxxG2TBxSYGEgCm/config/temp/foo.txt"
      + id                   = (known after apply)
    }

Plan: 1 to add, 0 to change, 0 to destroy.


Notice that plan is running in remote backend and file path is also the one on the remote Terraform cloud machine. This is because we left our workspace to use organisation's Execution Mode which is Remote - all resources will be created on the remote machine. But this is not what we want, we want remote to contain only state file. Therefore we need to change the setting:




We can now apply the configuration (after executing terraform init so the new Execution Mode gets picked):


$ terraform plan
local_file.foo: Refreshing state... [id=db5ca40b5588d44e9ec6c1b4005e11a6fd0c910e]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # local_file.foo will be created
  + resource "local_file" "foo" {
      + content              = "This is a text content of the foo file!"
      + content_base64sha256 = (known after apply)
      + content_base64sha512 = (known after apply)
      + content_md5          = (known after apply)
      + content_sha1         = (known after apply)
      + content_sha256       = (known after apply)
      + content_sha512       = (known after apply)
      + directory_permission = "0777"
      + file_permission      = "0777"
      + filename             = "/home/<user>/...hcp-cloud-state-storage-demo/temp/foo.txt"
      + id                   = (known after apply)
    }

Plan: 1 to add, 0 to change, 0 to destroy.


We can now execute terraform apply and changes will be done on the local machine.

If we create a resource on the remote (cloud), we can see it in web console:




If we by mistake create a resource on the remote (cloud), we can delete it by removing it from the state:

$ terraform state list
local_file.foo

$ terraform state rm local_file.foo
Removed local_file.foo
Successfully removed 1 resource instance(s).


All revisions of the state file are listed in Terraform Cloud. 






We can also roll back to some of the previous versions:










After this we need to unlock the state file:








Monday, 29 July 2024

Kubernetes Ingress Service


Ingress is a more flexible and powerful solution for managing external access to services within a Kubernetes cluster than Kubernetes LoadBalancer Service

It provides:
  • load balancing
  • SSL termination
  • name-based virtual hosting

Ingress controllers can be configured to handle traffic more efficiently and securely.


Manifest example:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
spec:
  rules:
  - host: myapp.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: my-service
            port:
              number: 80

---

Friday, 26 July 2024

Introduction to CI/CD Pipeline Development

CI (Continuous Integration) and CD (Continuous Delivery/Continuous Deployment) are practices in modern software development aimed at improving the process of integrating, testing, and deploying code changes. 

image source: A Crash Course in CI/CD - ByteByteGo Newsletter



Continuous Integration (CI)


Continuous Integration is a software development practice where developers regularly merge (integrate) their code changes into a shared repository multiple times a day. Each merge triggers an automated build and testing process to detect integration issues early.

Key Aspects:

  • Frequent Commits: Developers commit code changes frequently, at least daily.
  • Automated Builds: Each commit triggers an automated build to compile the code.
  • Automated Testing: Automated tests (unit, integration, functional) are run to verify that the new code does not break existing functionality.
    • Unit tests verify that code changes didn't introduce any regression at the function level. They should be fast and should be run on the dev machine before the code gets pushed to the remote (e.g. as part of git commit hooks) and also on CI server (which ensures they are 100% executed and also that they are executed in a non-local environment so there is not chance of having "it works on my machine" conflict).
    • Integration tests verify that all components/modules of the product work together. Example:
      • API endpoint /createOrder indeed creates an order with all attributes and this can be verified by verifying /getOrder response content
  • Immediate Feedback: Developers receive immediate feedback on the build and test status, allowing them to address issues promptly.
  • Shared Repository: All code changes are merged into a central repository (e.g., Git).

Stages: 

  • Locally (on dev machine):
    • code is added to the repository locally, to the feature branch
    • as part of the commit, unit tests are run locally
    • code is pushed to the remote
  • CI server:
    • detects new commit
    • runs unit tests
    • builds the output binary, package or Docker image
    • runs integrations tests 

Benefits:

  • Improve the feedback loop
    • Faster feedback on business decisions is another powerful side effect of CI. Product teams can test ideas and iterate product designs faster with an optimized CI platform. Changes can be rapidly pushed and measured for success. Bugs or other issues can be quickly addressed and repaired. [What is Continuous Integration | Atlassian]
      • Early detection of bugs and integration issues. -  Before new code is merged it must pass the CI test assertion suite which will prevent any new regressions.
      • Reduced integration problems.
      • Improved code quality.
      • Faster development cycles.
  • Enables Scaling
    • CI enables organizations to scale in engineering team size, codebase size, and infrastructure. By minimizing code integration bureaucracy and communication overhead, CI helps build DevOps and agile workflows. It allows each team member to own a new code change through to release. CI enables scaling by removing any organizational dependencies between development of individual features. Developers can now work on features in an isolated silo and have assurances that their code will seamlessly integrate with the rest of the codebase, which is a core DevOps process. [What is Continuous Integration | Atlassian]

Challenges:

  • Adoption and installation
  • Technology learning curve

Best Practices:

  • Test Driven Development (TDD) - the practice of writing out the test code and test cases before doing any actual feature coding.
  • Pull requests and code reviews 
    • Pull requests:
      • critical practice to effective CI
      • created when a developer is ready to merge new code into the main codebase
      • notifies other developers of the new set of changes that are ready for integration
      • an opportune time to kick off the CI pipeline and run the set of automated approval steps. An additional, manual approval step is commonly added at pull request time, during which a non-stakeholder engineer performs a code review of the feature
    • foster passive communication and knowledge share among an engineering team. This helps guard against technical debt.
  • Optimizing pipeline speed
    • Given that the CI pipeline is going to be a central and frequently used process, it is important to optimize its execution speed. 
    • It is a best practice to measure the CI pipeline speed and optimize as necessary.

Continuous Delivery (CD)


Continuous Delivery is a software development practice where code changes are automatically built, tested, and prepared for a release to production. It extends CI by ensuring that the codebase is always in a deployable state, but the actual deployment to production is done manually.

Key Aspects:

  • Automated Deployment Pipeline: Code changes go through an automated pipeline, including build, test, and packaging stages.
  • Deployable State: The codebase is always ready for deployment to production.
  • Manual Release: Deployment to production is triggered manually, ensuring final checks and balances.
  • Staging Environment: Changes are deployed to a staging environment for final validation before production.

Benefits:

  • Reduced deployment risk.
  • Faster and more reliable releases.
  • High confidence in code quality and stability.
  • Easier and more frequent releases.

Continuous Deployment (CD)


Continuous Deployment takes Continuous Delivery a step further by automatically deploying every code change that passes the automated tests to production without manual intervention.

Key Aspects:

  • Automated Deployment: Every code change that passes all stages of the pipeline (build, test, package) is automatically deployed to production.
  • Monitoring and Alerting: Robust monitoring and alerting systems are essential to detect and respond to issues quickly.
    • latency
    • performance
    • resource utilization
    • KPIs/Key business parameters e.g. number of new installs, number of install re-tries, number of engagements, monetization, usage of various features
    • errors/failures/warnings in logs
  • Rollbacks and Roll-forwards: Mechanisms to roll back or roll forward changes in case of failures.
    • ideally, rollbacks would be automated

Benefits:

  • Accelerated release cycle.
  • Immediate delivery of new features and bug fixes.
  • Continuous feedback from the production environment.
  • Higher customer satisfaction due to faster updates.


image source: Azure CI CD Pipeline Creation with DevOps Starter



CI/CD Pipeline


A CI/CD pipeline is a series of automated processes that help deliver new software versions more efficiently

The typical stages include:

  • Source Code Management
    • Developers commit code to a shared repository.
  • Build
    • The code is compiled and built into a deployable format (e.g., binary, Docker image).
  • Automated Testing
    • Automated tests are run to ensure the code functions correctly (unit tests, integration tests, functional tests).
  • Packaging
    • The build artifacts are packaged for deployment.
  • Deployment
    • Continuous Delivery: Artifacts are deployed to a staging environment, and deployment to production is manual.
    • Continuous Deployment: Artifacts are automatically deployed to production.
  • Monitoring
    • The deployed application is monitored for performance and errors

CI/CD Pipeline Development: Build and maintain a continuous integration/continuous deployment (CI/CD) pipeline to automate the testing and deployment of code changes.


image source: EP71: CI/CD Pipeline Explained in Simple Terms



Tools for CI/CD


  • CI Tools: 
    • Jenkins
    • GitHub Actions - The most popular free CI/CD platform
      • Linting & Testing
    • GitLab CI
    • CircleCI
    • Travis CI
  • CD Tools: 
    • Spinnaker
    • ArgoCD
    • Tekton
    • AWS CodePipeline
  • Testing Tools: 
    • JUnit
    • Selenium
    • Cypress
    • pytest
  • Build Tools:
    • Maven
    • Gradle
    • npm
    • Docker
  • Monitoring Tools: 
    • Prometheus
    • Grafana
    • ELK Stack

By implementing CI/CD practices, development teams can achieve faster delivery cycles, higher code quality, and a more reliable deployment process.


How to enhance automation and scalability in CI/CD?



Enhancing automation and scalability in CI/CD practices involves implementing strategies and tools that streamline processes, reduce manual intervention, and ensure that the system can handle increasing workloads effectively. Here are some key practices to achieve these goals:

CI/CD Practices for Enhanced Automation and Scalability:

  • Automated Testing:
    • Unit Testing: Automated tests for individual units of code ensure that changes don’t break functionality.
    • Integration Testing: Tests that verify the interaction between different parts of the application.
    • End-to-End (E2E) Testing: Simulate real user scenarios to ensure the application works as expected.
    • Continuous Testing: Running tests automatically on every code change.
  • Pipeline as Code:
    • Define CI/CD pipelines using code (e.g., YAML files) stored in version control.
    • This makes it easier to track changes, review pipeline modifications, and replicate environments.
    • Example: In TeamCity it is possible to enable storing build configurations as a Kotlin code, in a dedicated repository
  • Infrastructure as Code (IaC):
    • Use tools like Terraform, Ansible, or CloudFormation to manage infrastructure.
    • IaC allows for automated provisioning, scaling, and management of environments.
  • Containerization:
    • Use Docker or similar containerization technologies to create consistent environments.
    • Containers ensure that applications run the same way regardless of where they are deployed, simplifying scaling and deployment.
  • Orchestration:
    • Use Kubernetes or other orchestration tools to manage containerized applications.
    • Orchestration tools help in scaling applications automatically based on demand.
  • Parallel Execution:
    • Run tests and build processes in parallel to reduce overall pipeline execution time.
    • This is especially useful for large test suites and complex builds.
  • Caching:
    • Implement caching for dependencies, build artifacts, and other frequently used resources to speed up the CI/CD pipeline.
    • Cache mechanisms reduce the time required for repetitive tasks.
  • Artifact Management:
    • Use artifact repositories like JFrog Artifactory or Nexus to store build artifacts.
    • Proper artifact management ensures reliable and consistent deployments.
  • Environment Consistency:
    • Ensure development, testing, staging, and production environments are as similar as possible.
    • Consistent environments reduce the likelihood of environment-specific bugs.
  • Monitoring and Logging:
    • Implement monitoring and logging throughout the CI/CD pipeline.
    • Use tools like Prometheus, Grafana, ELK Stack, or Splunk to gain insights and quickly identify issues.
  • Feature Toggles:
    • Use feature toggles to control the release of new features without deploying new code.
    • This allows for safer and more controlled feature releases and rollbacks.
  • Scalable Architecture:
    • Design applications to be stateless and horizontally scalable.
    • Use microservices architecture to break down applications into smaller, manageable services that can be scaled independently.
  • Automated Rollbacks:
    • Implement automated rollback mechanisms in case of deployment failures.
    • This ensures quick recovery from failed deployments without manual intervention.
  • Security Automation:
    • Integrate security checks into the CI/CD pipeline using tools like Snyk, OWASP ZAP, or Aqua Security.
    • Automated security scanning helps in identifying vulnerabilities early.
By adopting these practices, organizations can achieve a highly automated, reliable, and scalable CI/CD pipeline that supports rapid and safe software delivery.

Example Workflow for Automation and Scalability


Here’s a high-level example of a CI/CD workflow that incorporates some of these practices:

github-actions-example.yaml:

name: CI/CD Pipeline
on: [push, pull_request]
jobs:
  build:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        node-version: [12, 14, 16]
    steps:
    - uses: actions/checkout@v2
    - name: Cache dependencies
      uses: actions/cache@v2
      with:
        path: ~/.npm
        key: ${{ runner.os }}-node-${{ matrix.node-version }}-${{ hashFiles('**/package-lock.json') }}
        restore-keys: |
          ${{ runner.os }}-node-${{ matrix.node-version }}-
    - name: Setup Node.js
      uses: actions/setup-node@v2
      with:
        node-version: ${{ matrix.node-version }}
    - run: npm install
    - run: npm test

  deploy:
    needs: build
    runs-on: ubuntu-latest
    steps:
    - name: Deploy to Staging
      run: |
        # Deployment commands
        echo "Deploying to staging environment..."
    - name: Automated Tests on Staging
      run: npm run test:e2e

  promote:
    needs: deploy
    runs-on: ubuntu-latest
    if: success()
    steps:
    - name: Deploy to Production
      run: |
        # Production deployment commands
        echo "Deploying to production environment..."




References:



Friday, 12 July 2024

Introduction to Containers

 


Hardware Virtualization


Let's review the layers of service abstractions:


source: The New Era of Cloud Computing: SaaS, IaaS and PaaS | LinkedIn


Infrastructure as a service (IaaS):

  • uses Virtual Machines to virtualize the hardware
  • allows us to share compute resources with other developers
  • each developer can:
    • deploy their own operating system (OS)
    • configure the underlying system resources such as disc space, disk I/O, or networking
    • install their favorite run time, web server, database or middleware
    • build their applications in a self contained environment with access to underlying hardware (RAM, file systems, networking interfaces, etc.)
    • The smallest unit of compute is an app with its VM
      • If we want to scale it the app, we'll also scale VM which is resource and time consuming

Shortcomings of (using only) hardware virtualization


Flexibility listed above comes with a cost:
  • Guest OS might be large (several gigabytes) and take long time to boot
  • As demand for our application increases, we have to copy an entire VM and boot the guest OS for each instance of our app, which can be slow and costly

OS virtualization

  • PaaS: abstraction of the OS
  • IaaS: abstraction of hardware


Container:

  • gives the independent scalability of workloads in PaaS and an abstraction layer of the OS and hardware in IaaS. 
  • an invisible box around our code and its dependencies with limited access to its own partition of the file system and hardware
  • only requires a few system calls to create 
  • starts as quickly as a process
  • All that's needed on each host is:
    • OS kernel that supports containers
    • container runtime
In essence, the OS is being virtualized. It scales like PaaS, but gives us nearly the same flexibility as IaaS. This makes code ultra portable, and the OS and hardware can be treated as a black box. 

We can go from development to staging, to production, or from our laptop to the Cloud without changing or rebuilding anything. As an example, let's say we want to scale a web server. With a container we can do this in seconds and deploy dozens or hundreds of them depending on the size of our workload on a single host. That's just a simple example of scaling one container, running the whole application on a single host. 

However, we'll probably want to build our applications using lots of containers, each performing their own function like microservices. If we build them this way and connect them with network connections, we can make them modular, deploy easily and scale independently across a group of hosts. The hosts can scale up and down and start and stop containers as demand for our app changes or as hosts fail.


Containers: Docker overview


Let's say we need to deploy a stack of various technologies: 
  • Web server Node.js Express
  • MongoDB
  • Redis messaging system
  • Ansible as orchestration tool
If we would go about deploying them on the bare metal host or VM, each of these components needs to be compatible with running host's hardware, OS and installed dependencies and libraries. But this is usually not the case. This is therefore named Matrix from hell.

Docker helps preventing these dependency issues. E.g. we can run each of these components in its own container, which contains libraries and dependencies that the component is compatible with. 

Docker runs on top of the OS (Win, Mac, Linux etc).

Containers are completely isolated environments. They have their own processes, network interfaces, mounts...just like virtual machines except they all share the same OS kernel (which is interfacing the hardware).

Docker adds an abstraction layer over LXC (LinuX Containers). Docker is like an extension of LXC. [LXC vs Docker: Why Docker is Better | UpGuard]

Ubuntu, Fedora, SUSE and CentOS share the same OS kernel (Linux) but have different software (GUI, drivers, compilers, file systems, ...) above it. This custom software differentiates OSes between each other.

Docker containers share the underlying OS kernel. For example, Docker on Ubuntu can run any flavour of Linux which runs on the same Linux kernel as Ubuntu. This is why we can't run Windows-based container on Docker running on Linux OS - they don't share the same kernel.
Hypervisor:
  • Abstracts away hardware for the virtual machines so they can run an operating system
  • Coordinates between the machine's physical hardware and virtual machines.
container engine (e.g. Docker Engine):
  • Abstracts away an operating system so containers can run applications
  • Coordinates between the operating system and (Docker) containers
  • Docker containers are process-isolated and don't require a hardware hypervisor. This means Docker containers are much smaller and require far fewer resources than a VM.

Unlike hypervisors, Docker is not meant to virtualize and run different operating systems and kernels on the same hardware.

The main purpose of Docker is to containerise applications, ship them and run them.

In case of Docker we have: 
  • Containers (one or more) containing:
    • Application
    • Libraries & Dependencies 
  • Docker
  • OS
  • Hardware
In case of virtual machine we have:
  • Virtual Machines (one or more) containing:
    • Application
    • Libraries & Dependencies
    • OS
  • Hypervisor
  • OS
  • Hardware
Docker uses less processing powerless disk space and has faster boot up time than VMs.

Docker containers share the same kernel while different VMs are completely isolated. We can run VM with Linux on the host with Windows.

Many companies release and ship their software products as Docker images, published on Docker Hub, public Docker registry.

We can run each application from the example above in its own container:

$ docker run nodejs 
docker run mongodb
docker run redis
docker run ansible

Docker image is a template, used to create one or more containers.

Containers are running instances of images that are isolated and have their own environments and set of processes.

Dockerfile describes the image.


References: 

Google Cloud Fundamentals: Core Infrastructure | Coursera

Tuesday, 9 July 2024

Google Cloud storage options

 


Most applications need to store data e.g. media to be streamed, sensor data from devices.
Different applications and workloads require different storage database solutions.

Google Cloud has storage options for different data types:
  • structured
  • unstructured
  • transactional
  • relational

Google Cloud has five core storage products:
  • Cloud Storage (like AWS S3)
  • Cloud SQL
  • Spanner
  • Firestore
  • Bigtable



(1) Cloud Storage


Object Storage


Let's first define Object Storage.

Object storage is a computer data storage architecture that manages data as “objects” and not as:
  • a file and folder hierarchy (file storage) or 
  • as chunks of a disk (block storage)


These objects are stored in a packaged format which contains:
  • binary form of the actual data itself
  • relevant associated meta-data (such as date created, author, resource type, and permissions)
  • globally unique identifier. These unique keys are in the form of URLs, which means object storage interacts well with web technologies. 

Data commonly stored as objects include:
  • video
  • pictures
  • audio recordings


Cloud Storage:

  • Service that offers developers and IT organizations durable and highly available object storage
  • Google’s object storage product
  • Allows customers to store any amount of data, and to retrieve it as often as needed
  • Fully managed scalable service

Cloud Storage Uses


Cloud Storage has a wide variety of uses. A few examples include:
  • serving website content
  • storing data for archival and disaster recovery
  • distributing large data objects to end users via Direct Download
Its primary use is whenever binary large-object storage (also known as a “BLOB”) is needed for:
  • online content such as videos and photos
  • backup and archived data
  • storage of intermediate results in processing workflows

Buckets


Cloud Storage files are organized into buckets

A bucket needs:
  • globally unique name
  • specific geographic location for where it should be stored
    • An ideal location for a bucket is where latency is minimized. For example, if most of our users are in Europe, we probably want to pick a European location, so either a specific Google Cloud region in Europe, or else the EU multi-region
The storage objects offered by Cloud Storage are immutable, which means that we do not edit them, but instead a new version is created with every change made. Administrators have the option to either allow each new version to completely overwrite the older one, or to keep track of each change made to a particular object by enabling “versioning” within a bucket. 
  • With object versioning:
    • Cloud Storage will keep a detailed history of modifications (overwrites or deletes) of all objects contained in that bucket
    • We can list the archived versions of an object, restore an object to an older state, or permanently delete a version of an object, as needed
  • Without object versioning:
    •  by default new versions will always overwrite older versions

Access Control


In many cases, personally identifiable information may be contained in data objects, so controlling access to stored data is essential to ensuring security and privacy are maintained. Using IAM roles and, where needed, access control lists (ACLs), organizations can conform to security best practices, which require each user to have access and permissions to only the resources they need to do their jobs, and no more than that. 

There are a couple of options to control user access to objects and buckets:
  • For most purposes, IAM is sufficient. Roles are inherited from project to bucket to object.
  • If we need finer control, we can create access control lists. Each access control list consists of two pieces of information:
    • scope, which defines who can access and perform an action. This can be a specific user or group of users
    • permission, which defines what actions can be performed, like read or write
Because storing and retrieving large amounts of object data can quickly become expensive, Cloud Storage also offers lifecycle management policies
  • For example, we could tell Cloud Storage to delete objects older than 365 days; or to delete objects created before January 1, 2013; or to keep only the 3 most recent versions of each object in a bucket that has versioning enabled 
  • Having this control ensures that we’re not paying for more than we actually need

Storage classes and data transfer


There are four primary storage classes in Cloud storage:
  • Standard storage
    • considered best for frequently accessed or hot data
    • great for data that's stored for only brief periods of time
  • Nearline storage
    • Best for storing infrequently accessed data, like reading or modifying data on average once a month or less 
    • Examples may include data backups, long term multimedia content, or data archiving. 
  • Coldline storage
    • A low cost option for storing infrequently accessed data. 
    • However, as compared to near line storage, coldline storage is meant for reading or modifying data at most, once every 90 days. 
  • Archive storage
    • The lowest cost option used ideally for data archiving, online backup and disaster recovery
    • It's the best choice for data that we plan to access less than once a year because it has higher costs for data access and operations in a 365 day minimum storage duration

Characteristics that apply across all of these storage classes:
  • unlimited storage
  • no minimum object size requirement
  • worldwide accessibility and locations
  • low latency and high durability
  • a uniform experience which extends to security tools and API's
  • geo-redundancy if data is stored in a multi-region or dual region. This means placing physical servers in geographically diverse data centers to protect against catastrophic events and natural disasters, and low balancing traffic for optimal performance. 

Auto-class


Cloud storage also provides a feature called auto-class, which automatically transitions objects to appropriate storage classes based on each object's access pattern. The feature:
  • moves data that is not accessed to colder storage classes to reduce storage costs
  • moves data that is accessed to standard storage to optimize future accesses
Auto-class simplifies and automates cost saving for our cloud storage data. 

Cloud storage has no minimum fee because we pay only for what we use. Prior provisioning of capacity isn't necessary.

Data Encryption


Cloud storage always encrypts data on the server side before it's written to disc at no additional charge. Data traveling between a customer's device and Google is encrypted by default using HTTPS/TLS, which is transport layer security. 


Data Transfer into Google Cloud Storage


Regardless of which storage class we choose, there are several ways to bring data into Cloud storage:

  • Online Transfer
    • by using Cloud storage, which is the Cloud storage command from the Cloud SDK
    • by using a Dragon Drop option in the Cloud console if accessed through the Google Chrome web browser
  • Storage transfer service
    • enables us to import large amounts of online data into Cloud storage quickly and cost effectively 
    •  if we have to upload terabytes or even petabytes of data 
    • Lets us schedule and manage batch transfers to cloud storage from:
      • another Cloud provider
      • a different cloud storage region
      • an HTTPS endpoint
  • Transfer Appliance
    • A rackable, high capacity storage server that we lease from Google Cloud
    • We connect it to our network, load it with data, and then ship it to an upload facility where the data is uploaded to cloud storage
    • We can transfer up to a petabyte of data on a single appliance
  • Moving data in internally, from Google Cloud services as Cloud storage is tightly integrated with other Google Cloud products and services. For example, we can:
    • import and export tables to and from both BigQuery and Cloud SQL
    • store app engine logs, files for backups, and objects used by app engine applications like images
    • store instance start up scripts, compute engine images, and objects used by compute engine applications
We should consider using Cloud Storage if we need to store immutable blobs larger than 10 megabytes, such as large images or movies. This storage service provides petabytes of capacity with a maximum unit size of 5 terabytes per object. 


Provisioning Cloud Storage Bucket


We can use e.g. Google Cloud console >> Activate Cloud Shell:









Then execute the following commands in it.

Create an env variables containing the location and bucket name:

$ export LOCATION=EU
$ export BUCKET_NAME=my-unique-bucket-name

or we can use the project ID as it is globally unique:

$ export BUCKET_NAME=$DEVSHELL_PROJECT_ID

To create a bucket with CLI:

$ gcloud storage buckets create -l $LOCATION gs://$BUCKET_NAME

We might be prompted to authorize execution of this command:


To download an item from a bucket to the local host:

$ gcloud storage cp gs://cloud-training/gcpfci/my-excellent-blog.png my-excellent-blog.png

To upload a file from a local host to the bucket:

$ gcloud storage cp my-excellent-blog.png gs://$BUCKET_NAME/my-excellent-blog.png

To modify the Access Control List of the object we just created so that it's readable by everyone:

$ gsutil acl ch -u allUsers:R gs://$BUCKET_NAME/my-excellent-blog.png



We can check in Google Console that bucket and the image in it:






(2) Cloud SQL


It offers fully managed relational databases as a service, including:
  • MySQL
  • PostgreSQL
  • SQL Server 

It’s designed to hand off mundane, but necessary and often time-consuming, tasks to Google, like 
  • applying patches and updates
  • managing backups
  • configuring replications

Cloud SQL:
  • Doesn't require any software installation or maintenance
  • Can scale up to 128 processor cores, 864 GB of RAM, and 64 TB of storage. 
  • Supports automatic replication scenarios, such as from:
    • Cloud SQL primary instance
    • External primary instance
    • External MySQL instances
  • Supports managed backups, so backed-up data is securely stored and accessible if a restore is required. The cost of an instance covers seven backups
  • Encrypts customer data when on Google’s internal networks and when stored in database tables, temporary files, and backups
  • Includes a network firewall, which controls network access to each database instance

Cloud SQL instances are accessible by other Google Cloud services, and even external services. 
  • Cloud SQL can be used with App Engine using standard drivers like Connector/J for Java or MySQLdb for Python. 
  • Compute Engine instances can be authorized to access Cloud SQL instances and configure the Cloud SQL instance to be in the same zone as our virtual machine
  • Cloud SQL also supports other applications and tools, like:
    • SQL Workbench
    • Toad
    • other external applications using standard MySQL drivers

Provisioning Cloud SQL Instance





SQL >> Create Instance:



...and then choose values for following properties:
  • Database engine:
    • MySQL
    • PostgreSQL
    • SQL Server
  • Instance ID - arbitrary string e.g. blog-db
  • Root user password: arbitrary string (There's no need to obscure the password because we use mechanisms to connect that aren't open access to everyone)
  • Choose a Cloud SQL edition:
    • Edition type:
      • Enterprise
      • Enterprise Plus
    • Choose edition preset:
      • Sandbox
      • Development
      • Production
  • Choose region - This should be the same region and zone into which we launched the Cloud Compute VM instance. The best performance is achieved by placing the client and the database close to each other.
  • Choose zonal availability
    • Single zone - In case of outage, no failover. Not recommended for production.
    • Multiple zones (Highly available) - Automatic failover to another zone within your selected region. Recommended for production instances. Increases cost.
  • Select Primary zone

click on image to zoom


During DB creation:

click on image to zoom


Once DB instance is created:



DB has root user created:


Default networking:





 Now we can:
  • see its Public IP address (e.g. 35.204.71.237)
  • Add User Account
    • username
    • password
  • set Connections
    • Networking >> Add a Network
      • Choose between Private IP connection and a Public IP connection
      • set Name
      • Network: <external_IP_of_VM_Instance>/32 (If chosen Public IP connection then use instance's external IP address)

Adding a user:


After user is added:



Adding a new network:


After new network is added:





(3) Spanner


Spanner:
  • Fully managed relational database service that scales horizontally, is strongly consistent, and speaks SQL
  • Service that powers Google’s $80 billion business (Google’s own mission-critical applications and services)
  • Especially suited for applications that require:
    • SQL relational database management system with joins and secondary indexes
    • built-in high availability
    • strong global consistency
    • high numbers of input and output operations per second (tens of thousands of reads and writes per second or more)

The horizontal scaling approach, sometimes referred to as "scaling out," entails adding more machines to further distribute the load of the database and increase overall storage and/or processing power. [A Guide To Horizontal Vs Vertical Scaling | MongoDB]

We should consider using Cloud SQL or Spanner if we need full SQL support for an online transaction processing system. 

Cloud SQL provides up to 64 terabytes, depending on machine type, and Spanner provides petabytes. 

Cloud SQL is best for web frameworks and existing applications, like storing user credentials and customer orders. If Cloud SQL doesn’t fit our requirements because we need horizontal scalability, not just through read replicas, we should consider using Spanner. 

(4) Firestore


Firestore is a flexible, horizontally scalable, NoSQL cloud database for mobile, web, and server development. 

With Firestore, data is stored in documents and then organized into collections. Documents can contain complex nested objects in addition to subcollections. Each document contains a set of key-value pairs. For example, a document to represent a user has the keys for the firstname and lastname with the associated values. 

Firestore’s NoSQL queries can then be used to retrieve:
  • individual, specific documents or 
  • all the documents in a collection that match our query parameters
Queries can include multiple, chained filters and combine filtering and sorting options. They're also indexed by default, so query performance is proportional to the size of the result set, not the dataset. 

Firestore uses data synchronization to update data on any connected device. However, it's also designed to make simple, one-time fetch queries efficiently. It caches data that an app is actively using, so the app can write, read, listen to, and query data even if the device is offline. When the device comes back online, Firestore synchronizes any local changes back to Firestore. 

Firestore leverages Google Cloud’s powerful infrastructure: 
  • automatic multi-region data replication
  • strong consistency guarantees
  • atomic batch operations
  • real transaction support
We should consider Firestore if we need massive scaling and predictability together with real time query results and offline query support. This storage service provides terabytes of capacity with a maximum unit size of 1 megabyte per entity. Firestore is best for storing, syncing, and querying data for mobile and web apps. 


(5) Bigtable

Bigtable:
  • Google's NoSQL big data database service
  • The same database that powers many core Google services, including Search, Analytics, Maps, and Gmail
  • Designed to handle massive workloads at consistent low latency and high throughput, so it's a great choice for both operational and analytical applications, including Internet of Things, user analytics, and financial data analysis. 

When deciding which storage option is best, we should choose Bigtable if: 
  • We work with more than 1TB of semi-structured or structured data
  • Data is fast with high throughput, or it’s rapidly changing
  • We work with NoSQL data. This usually means transactions where strong relational semantics are not required
  • Data is a time-series or has natural semantic ordering
  • We work with big data, running asynchronous batch or synchronous real-time processing on the data
  • We are running machine learning algorithms on the data

Bigtable can interact with other Google Cloud services and third-party clients. 

Using APIs, data can be read from and written to Bigtable through a data service layer like:
  • Managed VMs
  • HBase REST Server
  • Java Server using the HBase client
Typically this is used to serve data to applications, dashboards, and data services. 

Data can also be streamed in through a variety of popular stream processing frameworks like:
  • Dataflow Streaming
  • Spark Streaming
  • Storm
And if streaming is not an option, data can also be read from and written to Bigtable through batch processes like:
  • Hadoop MapReduce
  • Dataflow
  • Spark
Often, summarized or newly calculated data is written back to Bigtable or to a downstream database.

We should consider using Bigtable if we need to store a large number of structured objects. Bigtable doesn’t support SQL queries, nor does it support multi-row transactions. This storage service provides petabytes of capacity with a maximum unit size of 10 megabytes per cell and 100 megabytes per row. Bigtable is best for analytical data with heavy read and write events, like AdTech, financial, or IoT data. 


--- 

BigQuery hasn’t been mentioned in this section because it sits on the edge between data storage and data processing. The usual reason to store data in BigQuery is so we can use its big data analysis and interactive querying capabilities, but it’s not purely a data storage product.