May 25, 2021
BTEX 2021: Artificial Intelligence 101
At CDW's BTEX 2021 virtual event, Michael Traves presented on the uses of artificial intelligence and machine learning in a business environment.
At CDW's BTEX 2021 virtual event, Michael Traves, PrincipalField Solutions Architect at CDW Canada presented on the uses of artificialintelligence and machine learning in a business environment. Here are some ofthe highlights.
What isartificial intelligence, and how is it used in business?
Artificial intelligence is broadly defined as anysimulation of human intelligence, says Traves. Machine learning is typicallywhat we find customers engaged in. It's an approach to artificial intelligencethat involves building models and training computational runs, inferencing, butinvolves a fair amount of data. Depending on how much data, that's where theconcept of deep learning comes in.
When it comes to deep learning, Traves describes it asmore computationally intensive, going beyond what a CPU would provide. This iswhere customers would require graphics processing units (GPUs) or specializedhardware to meet their needs.
Defining datascience
Data science is the scientific methods, algorithmsand systems to extract insights and knowledge from Big Data, says Traves.You're dealing with very large data sets, machine learning, deep learningtasks and statistical methods.
Data science encompasses AI, machine learning anddeep learning. But deep learning is a subset of machine learning, machinelearning is a subset of AI, and there's a number of areas of researchincluding fraud detection, sentiment analysis and speech-to-text.
Foundationaltechnology of artificial intelligence
Here are some of the key hardware components thatpower AI technology:
Graphicsprocessing unit (GPU): A processor specially designed for the types ofcalculations needed in computer graphics; efficient for programming problemswith parallelization in the deep learning and machine learning space.
Fieldprogrammable gate array (FPGA): This general-purpose device canbe reprogrammed at the logic gate level.
Application-specificintegrated circuit (ASIC): Designed to be very effective for one applicationonly.
TensorProcessing Unit (TPU): The name of Google's architecture for machinelearning.
NVIDIABlueField Data Processing Unit (DPU): CombinesConnectX network adapter with an array of Arm cores, offering purpose-builthardware acceleration engines with full data centre infrastructure on chipprogrammability.
Machinelearning from development to production
We try to help customers on this journey from developmentto production, says Traves. Many may be starting with a workstation thatyou're putting a GPU in. That may be a workstation on your desk, or one you'vepurchased specifically for this purpose and are sharing it within a workgroup.But eventually, as you start moving from that experimentation phase into doingmodels and iterations on training runs, you're going to want to move somethinginto the data centre, probably with a different class of GPU.
That's where we get into training larger systems andnetworking those large systems together so they can run jobs in parallel. Thismeans multiple computational loads, multiple GPUs and a shared data repositorythey're going to be accessing to run workloads in parallel with each other.
Once we've done that model training at scale, we'll want to do inferencing, which is taking results of a model that have been successful and pushing that out to your end-users to consume. There's not as much data there, it's on-demand, client-facing, maybe on a webpage that users are accessing, or maybe on your cell phone.
What makes upan artificial intelligence platform
When it comes to building out an AI platform, thereare a few things to keep in mind. You'll want to consider workload types,orchestration, platform types, data sources, portability and scale.
Workload Types
- Interactive (user is creating something with aplaybook), ML pipeline (submit the job with the data and it runs against thecluster) or inferencing (client-facing)
- Development, training and production environments
Orchestration
- Scheduling workloads to run on a particular node vs.orchestration, i.e. having it run wherever resources happen to be available
- How to provide auditability so that users are gettingaccess to the equipment in a fair way
Platform Type
- Development workstation (at home or in the office), orat the edge on a mobile device
- On-premise cluster (more cost-effective for training),public cloud (tools have been provided for you, but more costly thanmaintaining own environment), managed platform (someone else is running thisfor you, providing the tools as a service)
Portabilityand Scale
- Containers can be deployed anywhere at scale (on-prem,on your desktop, in the cloud) and have it be the same code
- Able to leverage serverless functions
Data sources
- Data gravity where is the bulk of the data thatyou're using? If there's a lot of data, computation should be run close to thedata
- How am I going to get that data to the public cloud,if that's where the tools are?
- Data engineering how to bring data into anenvironment to do training i.e. ETL (transfer data into a format that'susable), batch updates of the data and stream processing updates of data
- Every time you update data, you have to decide if youwant to rerun your model, and validate that before you push into production
Kubernetes vs.Slurm: job scheduling or orchestration?
Kubernetes is very good for inferencing, saysTraves. Slurm is more about scheduling, and there's stronger auditabilityaround who gets to use what. A research or use case cluster might use Slurm,whereas a production-facing cluster might use Kubernetes. Traves provides abreakdown of the two platforms, as follows:
Pros and Consof Kubernetes
- Orchestrates scheduling, management and health ofcontainers (i.e. Docker)
- Excellent for web services (i.e. Inference Server)
- Not made with AI training in mind
- Should extend with a platform for AI training (i.e.Kubeflow)
- More difficult to configure user access, permissionsand security
Benefits ofSlurm for job training
- Schedules jobs to run on a subset of cluster resources
- Excellent for AI training
- Meant for highly performant work, i.e. multinode jobsleveraging Infiniband networking
- Closely tied to *nix systems, easy to integrate withexisting auth and security mechanisms
What is MLOps?
MLOps is a practice between data scientists, DevOpsand machine learning engineers, says Traves. It's designed to increaseautomation, and that's what CI/CD does in the DevOps space. It's all aboutautomating, infrastructure as code, being able to push code through a pipelineand when your code changes, you push the updates through.
It automatically gets tested at each step, pushed to your test, dev and production environments when it makes sense, and you're leveraging that software development lifecycle, continuous integration/continuous delivery, orchestration, monitoring what's happening at each step in this process, providing feedback to your developers. It's all the same concepts as what you deal with in a DevOps environment, with the machine learning in the front, which is really the model.
You want to use significantly more data, and theparameters associated with it, and that's what you end up pushing through andintegrating on.
How MLOps makeslife easier for data scientists
As a data scientist, you shouldn't have to learnabout Infiniband, hundred-gig networking or all-flash storage, says Traves.As opposed to becoming a platform engineer around machine learning, what youreally want to do is run your models, consume services. You want things to beeasy, predictable, repeatable. You want it to be automated and when you changeyour model or data, you want to be able to iterate on that, without reallychanging anything else.
MLOps is about building that workflow to allow for experimentation, for you to iterate and retrain as necessary. It's providing that framework for data scientists and analysts to submit jobs, and to package all of that in a uniform container that you can now take and run wherever data happens to be accessible, and have access to the right resources to run so that things are packaged and available.
As a data scientist, you don't want to be ML platform engineers. So the platform should be supported by Operations or a managed platform that lives on-prem or in the cloud. Operations may have the foundational knowledge, but they're really pulling from software engineering and DevOps methodologies.
Make sure to bookmark this page for more coverage of BTEX 2021.