{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to GPU computing with CUDA and PyTorch\n",
    "\n",
    "- How GPU computing works.\n",
    "- NVIDIA hardware and CUDA.\n",
    "- Host and Device parts of code.\n",
    "- Threads, blocks, warps, SIMT.\n",
    "- GPU memory management.\n",
    "- Code compilation and run.\n",
    "- Neural networks quick introduiction.\n",
    "- Jupyter hub access.\n",
    "- Pytorch exercises with neural networks.\n",
    "\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## HOW GPU accelerated computing works\n",
    "\n",
    "- Application runs essentially on a CPU\n",
    "- Some procedures can run on GPU\n",
    "- Data is sent back and forth between the devices over PCIe bus\n",
    "- CPU and GPU run cycles independently\n",
    "\n",
    "![GPU as PCIedevice](img/CPU_GPU_pcie.png)\n",
    "\n",
    "\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Type of compute tasks to run on GPU\n",
    "\n",
    "Single Instruction Multiple Data (SIMD) simulations can efficiently run across many light weight threads on GPU Stream Multiprocessors.\n",
    "\n",
    "For example, vector sum:\n",
    "\n",
    "\n",
    "![Single instruction, multiple data](img/Single_instruction_multiple_data.png)\n",
    "\n",
    "\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GPU computing cons and pros\n",
    "\n",
    "Two types of programming environments: CUDA and OpenAcc (less performance efficient).\n",
    "\n",
    "Pros: \n",
    "- computing power for Single Instruction Multiple Data tasks thanks to available many GPU threads.\n",
    "- Many applications are already developed and ported into CUDA: Matlab, Comsol, VASP, python wrappers (pycuda).\n",
    "- Good for neural network comuting since it is based on the tensor algebra.\n",
    "\n",
    "Cons:\n",
    "- Not beneficial for non SIMD computing.\n",
    "- Nontrivial programming in CUDA.\n",
    "- Hardware - CUDA compatibility and dependency on Nvidia support. Relatively old GPU hardware is no longer supported by latest CUDA releases.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Host and Device parts of a code\n",
    "\n",
    "- Data is initialized and may be computed on the CPU (host part)\n",
    "- CPU calls to allocate memory space for DATA on GPU (host part)\n",
    "- CPU sends data to the GPU (host part)\n",
    "- CPU invokes GPU part of the code, kernel, to execute (device part)\n",
    "- GPU executes its code (device part)\n",
    "- CPU receives data from GPU (host part)\n",
    "\n",
    "Example: `hello.cu`, without data send/receive. The _global_ is the device directive\n",
    "``` c\n",
    "#include<stdio.h> \n",
    "#include<stdlib.h>\n",
    "\n",
    "//Kernel definition (device part):\n",
    "   __global__ void print_from_gpu(void) {\n",
    "       printf(\"Hello World! from thread [%d,%d]From device\\n\", threadIdx.x,blockIdx.x);\n",
    "}\n",
    "\n",
    "//Host part:\n",
    "int main(void) {\n",
    "       printf(\"Hello World from host!\\n\");\n",
    "    // Kernel invocation with 2 threads\n",
    "       print_from_gpu<<<1,2>>>();\n",
    "       cudaDeviceSynchronize();\n",
    "       return 0;\n",
    "}\n",
    "```\n",
    "\n",
    "Example: `vector_add.cu` code snipped with data send/receive:\n",
    "``` c\n",
    "// Device code\n",
    "__global__ void VecAdd(float* A, float* B, float* C, int N)\n",
    "{\n",
    "    int i = blockDim.x * blockIdx.x + threadIdx.x;\n",
    "    if (i < N)\n",
    "        C[i] = A[i] + B[i];\n",
    "}\n",
    "            \n",
    "// Host code\n",
    "int main()\n",
    "{\n",
    "    int N = ...;\n",
    "    size_t size = N * sizeof(float);\n",
    "\n",
    "    // Allocate input vectors h_A and h_B in host memory\n",
    "    float* h_A = (float*)malloc(size);\n",
    "    float* h_B = (float*)malloc(size);\n",
    "\n",
    "    // Initialize input vectors\n",
    "    ...\n",
    "\n",
    "    // Allocate vectors in device memory\n",
    "    float* d_A;\n",
    "    cudaMalloc(&d_A, size);\n",
    "    float* d_B;\n",
    "    cudaMalloc(&d_B, size);\n",
    "    float* d_C;\n",
    "    cudaMalloc(&d_C, size);\n",
    "\n",
    "    // Copy vectors from host memory to device memory\n",
    "    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);\n",
    "    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);\n",
    "\n",
    "    // Invoke kernel\n",
    "    int threadsPerBlock = 256;\n",
    "    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;\n",
    "    VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);\n",
    "\n",
    "    // Copy result from device memory to host memory\n",
    "    // h_C contains the result in host memory\n",
    "    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);\n",
    "\n",
    "    // Free device memory\n",
    "    cudaFree(d_A);\n",
    "    cudaFree(d_B);\n",
    "    cudaFree(d_C);\n",
    "            \n",
    "    // Free host memory\n",
    "    ...\n",
    "}\n",
    "```\n",
    "\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Threads, blocks, warps, and grid\n",
    "\n",
    "- Single computing operation on GPU is done by a `thread`.\n",
    "- Threads are grouped into a `block`. \n",
    "- The number of threads in a block is defined by the code developer.\n",
    "- Each block performs the same operation on all its threads.\n",
    "- CUDA employs a Single Instruction Multiple Thread (SIMT) architecture to manage and execute threads in groups of 32 called `warps`. \n",
    "- Data for each thread may be different.\n",
    "- The blocks are scheduled and run independently of each other. \n",
    "- Blocks are organized into a one-dimensional, two-dimensional, or three-dimensional `grid` of thread blocks.\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "![Threads and blocks](img/grid-of-thread-blocks.png)\n",
    "\n",
    "\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Blocks on Streaming Multiprocessors (SM)\n",
    "\n",
    "A GPU with more multiprocessors will automatically execute the program in less time than a GPU with fewer multiprocessors.\n",
    "\n",
    "\n",
    "\n",
    "![Blocks computed on SMT processors](img/automatic-scalability.png)\n",
    "\n",
    "\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Thread indexing\n",
    "\n",
    "- Each thread in the grid is uniquely identified by its index, `ix`. \n",
    "- There is also the thread index within a block, `threadIdx.x`.\n",
    "- Each block within the grid can be identified by its own index, `blockIdx.x`\n",
    "- The relationship between the indexes:\n",
    "``` c\n",
    "int i = blockIdx.x * blockDim.x + threadIdx.x;\n",
    "```\n",
    "where `blockDim.x` is the size of the block.\n",
    "\n",
    "- For convenience of computing multi-dimensional data, the indexes can be defined as one-dimensional, two-dimensional, or three-dimensional:\n",
    "\n",
    "``` c\n",
    "int ix = blockIdx.x * blockDim.x + threadIdx.x;\n",
    "int iy = blockIdx.y * blockDim.y + threadIdx.y;\n",
    "int iz = blockIdx.z * blockDim.z + threadIdx.z;\n",
    "```\n",
    "\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Memory Hierarchy\n",
    "\n",
    "- CUDA threads may access data from multiple memory spaces during their execution.\n",
    "- Each thread has private `local memory`. \n",
    "- Each thread block has `shared memory` visible to all threads of the block and with the same lifetime as the block. \n",
    "- All threads have access to the same `global memory`.\n",
    "- Data from the host arrives into and departs from the `global memory`. The `global memory` is much slower than the other two.\n",
    "- There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces. \n",
    "\n",
    "\n",
    "\n",
    "\n",
    "![Different memory types](img/memory-hierarchy.png)\n",
    "\n",
    "\n",
    "\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "## The main challenge in CUDA computing is memory management\n",
    "\n",
    "- Data from the host arrives into the `global memory`, the slowest type of memory on th GPU.\n",
    "- One should avoid using the `global memory` for thread computing. Copy data into the `shared memory`, compute,\n",
    " then copy results back into the `global memory`.\n",
    "- Another challenge often happens due to noncalescent data alignment in `global memory`. Data is read in `warps` of size 32. For unstructured data, it may take multiple warp cycles to load the data into the `shared memory`.\n",
    "\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example: matrix multiplication\n",
    "\n",
    "[C] = [A] x [B] \n",
    "\n",
    "In terms of matrix elements:\n",
    "\n",
    "$c_{ij} = \\Sigma_{k} a_{ik} b_{kj}$\n",
    "\n",
    "Review and comparison:\n",
    "- Computing on the `global memory`. \n",
    "- Computing on the `shared memory`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "![Matrix multiplication](img/matrix-multiplication-without-shared-memory.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "## Thread computing on the `global memory`.\n",
    "\n",
    "``` c\n",
    "// Matrices are stored in row-major order:\n",
    "// M(row, col) = *(M.elements + row * M.width + col)\n",
    "typedef struct {\n",
    "    int width;\n",
    "    int height;\n",
    "    float* elements;\n",
    "} Matrix;\n",
    "\n",
    "// Thread block size\n",
    "#define BLOCK_SIZE 16\n",
    "\n",
    "// Forward declaration of the matrix multiplication kernel\n",
    "__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);\n",
    "\n",
    "// Matrix multiplication - Host code\n",
    "// Matrix dimensions are assumed to be multiples of BLOCK_SIZE\n",
    "void MatMul(const Matrix A, const Matrix B, Matrix C)\n",
    "{\n",
    "\n",
    "    // Load A and B to device memory\n",
    "    Matrix d_A;\n",
    "    d_A.width = A.width; d_A.height = A.height;\n",
    "    size_t size = A.width * A.height * sizeof(float);\n",
    "    cudaMalloc(&d_A.elements, size);\n",
    "    cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice);\n",
    "    Matrix d_B;\n",
    "    d_B.width = B.width; d_B.height = B.height;\n",
    "    size = B.width * B.height * sizeof(float);\n",
    "    cudaMalloc(&d_B.elements, size);\n",
    "    cudaMemcpy(d_B.elements, B.elements, size, cudaMemcpyHostToDevice);\n",
    "\n",
    "    // Allocate C in device memory\n",
    "    Matrix d_C;\n",
    "    d_C.width = C.width; d_C.height = C.height;\n",
    "    size = C.width * C.height * sizeof(float);\n",
    "    cudaMalloc(&d_C.elements, size);\n",
    "\n",
    "    // Invoke kernel\n",
    "    dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);\n",
    "    dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);\n",
    "    \n",
    "    MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);\n",
    "   \n",
    "    // Read C from device memory\n",
    "    cudaMemcpy(C.elements, d_C.elements, size, cudaMemcpyDeviceToHost);\n",
    "\n",
    "    // Free device memory\n",
    "    cudaFree(d_A.elements);\n",
    "    cudaFree(d_B.elements);\n",
    "    cudaFree(d_C.elements);\n",
    "}\n",
    "// Matrix multiplication kernel called by MatMul()\n",
    "__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)\n",
    "{\n",
    "    // Each thread computes one element of C\n",
    "    // by accumulating results into Cvalue\n",
    "    float Cvalue = 0;\n",
    "    int row = blockIdx.y * blockDim.y + threadIdx.y;\n",
    "    int col = blockIdx.x * blockDim.x + threadIdx.x;\n",
    "    for (int e = 0; e < A.width; ++e)\n",
    "      Cvalue += A.elements[row * A.width + e] * B.elements[e * B.width + col];\n",
    "      C.elements[row * C.width + col] = Cvalue;\n",
    "}\n",
    "\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computing with the `shared memory`\n",
    "\n",
    "Threads in their own blocks compute the product on the submatrices, then copy the result into the `global memory`.\n",
    "\n",
    "\n",
    "\n",
    "![Matrix multiplication by blocks](img/matrix-multiplication-with-shared-memory.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "``` c\n",
    "// Matrices are stored in row-major order:\n",
    "// M(row, col) = *(M.elements + row * M.stride + col)\n",
    "typedef struct {\n",
    "    int width;\n",
    "    int height;\n",
    "    int stride; \n",
    "    float* elements;\n",
    "} Matrix;\n",
    "\n",
    "// Get a matrix element\n",
    "__device__ float GetElement(const Matrix A, int row, int col)\n",
    "{\n",
    "    return A.elements[row * A.stride + col];\n",
    "}\n",
    "\n",
    "// Set a matrix element\n",
    "__device__ void SetElement(Matrix A, int row, int col,\n",
    "                           float value)\n",
    "{\n",
    "    A.elements[row * A.stride + col] = value;\n",
    "}\n",
    "\n",
    "// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is\n",
    "// located col sub-matrices to the right and row sub-matrices down\n",
    "// from the upper-left corner of A\n",
    " __device__ Matrix GetSubMatrix(Matrix A, int row, int col) \n",
    "{\n",
    "    Matrix Asub;\n",
    "    Asub.width    = BLOCK_SIZE;\n",
    "    Asub.height   = BLOCK_SIZE;\n",
    "    Asub.stride   = A.stride;\n",
    "    Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row\n",
    "                                         + BLOCK_SIZE * col];\n",
    "    return Asub;\n",
    "}\n",
    "\n",
    "// Thread block size\n",
    "#define BLOCK_SIZE 16\n",
    "\n",
    "// Forward declaration of the matrix multiplication kernel\n",
    "__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);\n",
    "\n",
    "// Matrix multiplication - Host code\n",
    "// Matrix dimensions are assumed to be multiples of BLOCK_SIZE\n",
    "void MatMul(const Matrix A, const Matrix B, Matrix C)\n",
    "{\n",
    "    // Load A and B to device memory\n",
    "    Matrix d_A;\n",
    "    d_A.width = d_A.stride = A.width; d_A.height = A.height;\n",
    "    size_t size = A.width * A.height * sizeof(float);\n",
    "    cudaMalloc(&d_A.elements, size);\n",
    "    cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice);\n",
    "    Matrix d_B;\n",
    "    d_B.width = d_B.stride = B.width; d_B.height = B.height;\n",
    "    size = B.width * B.height * sizeof(float);\n",
    "    cudaMalloc(&d_B.elements, size);\n",
    "    cudaMemcpy(d_B.elements, B.elements, size, cudaMemcpyHostToDevice);\n",
    "\n",
    "    // Allocate C in device memory\n",
    "    Matrix d_C;\n",
    "    d_C.width = d_C.stride = C.width; d_C.height = C.height;\n",
    "    size = C.width * C.height * sizeof(float);\n",
    "    cudaMalloc(&d_C.elements, size);\n",
    "\n",
    "    // Invoke kernel\n",
    "    dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);\n",
    "    dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);\n",
    "    MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);\n",
    "\n",
    "    // Read C from device memory\n",
    "    cudaMemcpy(C.elements, d_C.elements, size, cudaMemcpyDeviceToHost);\n",
    "\n",
    "    // Free device memory\n",
    "    cudaFree(d_A.elements);\n",
    "    cudaFree(d_B.elements);\n",
    "    cudaFree(d_C.elements);\n",
    "}\n",
    "\n",
    "// Matrix multiplication kernel called by MatMul()\n",
    " __global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)\n",
    "{\n",
    "    // Block row and column\n",
    "    int blockRow = blockIdx.y;\n",
    "    int blockCol = blockIdx.x;\n",
    "\n",
    "    // Each thread block computes one sub-matrix Csub of C\n",
    "    Matrix Csub = GetSubMatrix(C, blockRow, blockCol);\n",
    "\n",
    "    // Each thread computes one element of Csub\n",
    "    // by accumulating results into Cvalue\n",
    "    float Cvalue = 0;\n",
    "\n",
    "    // Thread row and column within Csub\n",
    "    int row = threadIdx.y;\n",
    "    int col = threadIdx.x;\n",
    "\n",
    "    // Loop over all the sub-matrices of A and B that are\n",
    "    // required to compute Csub\n",
    "    // Multiply each pair of sub-matrices together\n",
    "    // and accumulate the results\n",
    "    for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) {\n",
    "\n",
    "        // Get sub-matrix Asub of A\n",
    "        Matrix Asub = GetSubMatrix(A, blockRow, m);\n",
    "\n",
    "        // Get sub-matrix Bsub of B\n",
    "        Matrix Bsub = GetSubMatrix(B, m, blockCol);\n",
    "\n",
    "        // Shared memory used to store Asub and Bsub respectively\n",
    "        __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];\n",
    "        __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];\n",
    "\n",
    "        // Load Asub and Bsub from device memory to shared memory\n",
    "        // Each thread loads one element of each sub-matrix\n",
    "        As[row][col] = GetElement(Asub, row, col);\n",
    "        Bs[row][col] = GetElement(Bsub, row, col);\n",
    "\n",
    "        // Synchronize to make sure the sub-matrices are loaded\n",
    "        // before starting the computation\n",
    "        __syncthreads();\n",
    "        // Multiply Asub and Bsub together\n",
    "        for (int e = 0; e < BLOCK_SIZE; ++e)\n",
    "            Cvalue += As[row][e] * Bs[e][col];\n",
    "\n",
    "        // Synchronize to make sure that the preceding\n",
    "        // computation is done before loading two new\n",
    "        // sub-matrices of A and B in the next iteration\n",
    "        __syncthreads();\n",
    "    }\n",
    "\n",
    "    // Write Csub to device memory\n",
    "    // Each thread writes one element\n",
    "    SetElement(Csub, row, col, Cvalue);\n",
    "}\n",
    "```\n",
    "\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Access a GPU VM (Exercise)\n",
    "\n",
    "SSH to one of the VMs, which have Nvidia T4, CUDA installed and PyTorch configured. For the user name and password, use your Engineering account.\n",
    "\n",
    "- GPU VM 1:\n",
    "   ```bash\n",
    "   ssh -p 60097 -L 8000:localhost:8000  user_name@172.16.26.101\n",
    "   ```\n",
    "\n",
    "- GPU VM 2:\n",
    "  ```bash\n",
    "   ssh -p 60100 -L 8000:localhost:8000  user_name@172.16.26.102\n",
    "   ```\n",
    "\n",
    "- GPU VM 3:\n",
    "  ```bash\n",
    "   ssh -p 60094 -L 8000:localhost:8000  user_name@172.16.26.103\n",
    "   ```\n",
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "## Setup the environment for CUDA `nvcc` code compilation (Exercise)\n",
    "\n",
    "- CUDA 13.2 Toolkit is already installed on the dedicated virtual desktops.\n",
    "- CUDA compiler, `nvcc` is part of the Toolkit\n",
    "\n",
    "- Need to set the PATH environment variable to access the Nvidia toolkit: in your ```.bashrc``` add the two lines:\n",
    "```bash\n",
    "export PATH=/usr/local/cuda-13.2/bin${PATH:+:${PATH}}\n",
    "export LD_LIBRARY_PATH=/usr/local/cuda-13.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}\n",
    "```\n",
    "- Run ```bash source .bashrc```\n",
    "\n",
    "Check if the nvidia driver is loaded and the operating system can communicate with the GPU:\n",
    "```bash\n",
    "nvidia-smi\n",
    "```\n",
    "\n",
    "Create directory CUDA:\n",
    "```bash\n",
    "mkdir CUDA\n",
    "```\n",
    "Copy ```deviceQuery.cpp``` source code into directory CUDA, and compile it:\n",
    "```bash\n",
    "cd CUDA\n",
    "nvcc -o deviceQuery deviceQuery.cpp -I /usr/local/cuda-samples/Common\n",
    "```\n",
    "Run the executable to see the prameters of the GPU card on your desktop:\n",
    "```bash\n",
    "./deviceQuery\n",
    "```\n",
    "\n",
    "<hr>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compile and run `hello.cu` (Exercise)\n",
    "\n",
    "in directory CUDA, download a tar ball with cuda source codes, and extract the files:\n",
    "```bash\n",
    "wget http://linuxcourse.rutgers.edu/2024/html/lessons/HPC_3/GPU.tgz\n",
    "tar -zxvf GPU.tgz\n",
    "```\n",
    "Compile and run `hello.cu`:\n",
    "```bash\n",
    "nvcc -o hello hello.cu\n",
    "./hello\n",
    "```\n",
    "Edit file `hello.cu`, and change the number of blocks to 4 and threads per block to 8 in the line that starts the device code:\n",
    "``` c \n",
    "print_from_gpu<<<4,8>>>();\n",
    "```\n",
    "Recompile and run the code:\n",
    "```bash\n",
    "nvcc -o hello hello.cu\n",
    "./hello\n",
    "```\n",
    "See how the output from the blocks and the threads changes.\n",
    "\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compile and run the matrix multiplication codes (Exercise)\n",
    "\n",
    "A routine that computes the wallclock time for the device function has been added in the both\n",
    " matrix multiplication codes discussed above. \n",
    "\n",
    "Compile and run the code with `global memory` computation:\n",
    "```bash\n",
    "nvcc -o matrix_mult  matrix_mult.cu \n",
    "./matrix_mult\n",
    "```\n",
    "Compile and run the code with `shard memory` computation:\n",
    "```bash\n",
    "nvcc -o matrix_mult_shared matrix_mult_shared.cu\n",
    "./matrix_mult_shared\n",
    "```\n",
    "Compare the printed out wall clock times of the device part. Notice that `matrix_mult_shared` runs about 2.6 times faster than `matrix_mult`.\n",
    "\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  CUDA enabled programming environments\n",
    "\n",
    "For example: Matlab and Python\n",
    "\n",
    "CUDA support in Python comes in PyTorch, TensorFlow, pybind11, etc.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "\n",
    "## Perceptron\n",
    "\n",
    "\n",
    "![Perceptron](img/bhu.webp)\n",
    "\n",
    "Neural network model: $\\hat{y} = f(\\Sigma w_i\\;x_i + b)$ \n",
    "\n",
    "Input parameters: $X = (x_1, x_2, ..., x_n)$ \n",
    "\n",
    "Weights: $W = (w_1, w_2, ..., w_n)$ \n",
    "\n",
    "Bias: $b$\n",
    "\n",
    "Nonlinear activation function: $f$ \n",
    "\n",
    "True data: $y$\n",
    "\n",
    "Modeled result: $\\hat{y}$\n",
    "\n",
    "\n",
    "Loss or error $\\delta y = y -  \\hat{y}$\n",
    "\n",
    "Iterative training update of the weights:  $w_i = w_i + \\eta \\; (y - \\hat{y})$. Also called back propagation.\n",
    "\n",
    "Iterative training update of the bias: $b = b + \\eta \\; (y - \\hat{y})$.\n",
    "There are different algorithms that can be used for computing $\\eta$. \n",
    "\n",
    "The training finalizes the model with $(w_1, w_2, ..., w_n)$ and $b$. \n",
    "\n",
    "After training, the model can be used for predicting the final outome for new input parameters.\n",
    " \n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "## Neural networks\n",
    "\n",
    "In practical cases, the perceptrons are combined into neural networks.\n",
    "\n",
    "\n",
    "![Neural networks](img/_neural_network.webp)\n",
    "\n",
    "Python based PyTorch is a suit of methods for computing with neural networks."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "## Jupyter hub access\n",
    "\n",
    "\n",
    "Navigate your browser to:\n",
    "```bash\n",
    "http://localhost:8000\n",
    "```\n",
    "\n",
    "Login to the Jupyter hub with the same Engineering account."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "## Exercises with Pytorch\n",
    "\n",
    "In your home directory of the GPU VM, create directory Python and download ```cuda_Pytorch_sonar.ipynb``` into the directory:\n",
    "\n",
    "```bash\n",
    "mkdir Python\n",
    "cd Python\n",
    "wget --no-check-certificate https://linuxcourse.rutgers.edu/Files/cuda_Pytorch_sonar.ipynb\n",
    "```\n",
    "\n",
    "Navigate your Juputer lab to the file and load it in the cells.\n",
    "\n",
    "Alternatively, you can copy/paste the code below into a Jupyter cell.\n",
    "\n",
    "We utilize the ```Sonar, Mines vs. Rocks``` dataset downloaded from University of Chicago.\n",
    "It is directly downloadable into Pandas dataframe.\n",
    "\n",
    "By using ```StratifiedKFold``` module, we split the dataset into 80% training and 20% testing.\n",
    "\n",
    "We utilize two models:\n",
    "- with 1 hidden layer\n",
    "- with 3 hidden layers\n",
    "\n",
    "Apply the Mean Square Errors and Adam optimizer for back propagation.\n",
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The \"Sonar, Mines vs Rocks\" code on GPU."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%script false --no-raise-error\n",
    "\n",
    "# Import Pytorch \n",
    "import torch\n",
    "\n",
    "# If available, assign the device to GPU and print the GPU type\n",
    "if torch.cuda.is_available():\n",
    "    device = torch.device(\"cuda\")\n",
    "    print(f\"GPU Device: {torch.cuda.get_device_name(0)}\")\n",
    "else:\n",
    "    device = torch.device(\"cpu\")\n",
    "\n",
    "# Import neural networks and optimizers from the Torch:\n",
    "import torch.nn as nn\n",
    "import torch.optim as optim\n",
    "\n",
    "# Import the modules for data preprocessing:\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.model_selection import StratifiedKFold, train_test_split\n",
    "\n",
    "# Import Sonar dataset module with experimental sonar measurements.\n",
    "from ucimlrepo import fetch_ucirepo \n",
    "\n",
    "connectionist_bench_sonar_mines_vs_rocks = fetch_ucirepo(id=151)\n",
    "\n",
    "# Pull the sonar data into pandas dataframes \n",
    "X = connectionist_bench_sonar_mines_vs_rocks.data.features \n",
    "y = connectionist_bench_sonar_mines_vs_rocks.data.targets \n",
    "\n",
    "# metadata \n",
    "print(connectionist_bench_sonar_mines_vs_rocks.metadata) \n",
    "  \n",
    "# variable information \n",
    "print(connectionist_bench_sonar_mines_vs_rocks.variables) \n",
    "\n",
    "# Digitize the output values:\n",
    "y = y.replace('R',float(1))\n",
    "y = y.replace('M',float(0))\n",
    "\n",
    "# Check the data type of X and y:\n",
    "type(X)\n",
    "type(y)\n",
    "\n",
    "# Preprocess pandas dataframes into numpy arrays: \n",
    "X = X.values\n",
    "y = y['class'].values\n",
    "y = y.astype(np.float32)\n",
    "\n",
    "# Again, check the data type of X and y:\n",
    "type(X)\n",
    "type(y)\n",
    "\n",
    "# Import the numpy arrays into the torch tensors\n",
    "X = torch.FloatTensor(X)\n",
    "y = torch.FloatTensor(y)\n",
    "\n",
    "# Again, check the data type of X and y:\n",
    "type(X)\n",
    "type(y)\n",
    "\n",
    "# Define a one layer neural model. It works OK here.\n",
    "class sonar_neural_1(nn.Module):\n",
    "    def __init__(self,input_size,hidden_size1,output_size):\n",
    "        super().__init__()\n",
    "        self.layer1 = nn.Linear(input_size, hidden_size1)\n",
    "        self.act1 = nn.ReLU()\n",
    "        self.output = nn.Linear(hidden_size1, output_size)\n",
    " \n",
    "    def forward(self, x):\n",
    "        x = self.layer1(x)\n",
    "        x = self.act1(x)\n",
    "        x = self.output(x)\n",
    "        return x\n",
    "\n",
    "# Define a 3 layer neural model. It works better here than the above.\n",
    "class sonar_neural_3(nn.Module):\n",
    "    def __init__(self,input_size,hidden_size,output_size):\n",
    "        super().__init__()\n",
    "        self.layer1 = nn.Linear(input_size, hidden_size)\n",
    "        self.act1 = nn.ReLU()\n",
    "        self.layer2 = nn.Linear(hidden_size, hidden_size)\n",
    "        self.act2 = nn.ReLU()\n",
    "        self.layer3 = nn.Linear(hidden_size, hidden_size)\n",
    "        self.act3 = nn.ReLU()\n",
    "        self.output = nn.Linear(hidden_size, 1)\n",
    " \n",
    "    def forward(self, x):\n",
    "        x = self.act1(self.layer1(x))\n",
    "        x = self.act2(self.layer2(x))\n",
    "        x = self.act3(self.layer3(x))\n",
    "        x = self.output(x)\n",
    "        return x\n",
    "\n",
    "# k-fold data splitter. Split the X and y into the training and testing subsets: 80% and 20%, accordingly.\n",
    "kfold = StratifiedKFold(n_splits=5, shuffle=True)\n",
    "\n",
    "for train_index, test_index in kfold.split(X, y):\n",
    "    X_train, X_test = X[train_index], X[test_index]\n",
    "    y_train, y_test = y[train_index], y[test_index]\n",
    "\n",
    "# Load the training data onto the GPU device\n",
    "X_train = X_train.to(device)\n",
    "y_train = y_train.to(device)\n",
    "\n",
    "# Choose one of the two models defined above\n",
    "model = sonar_neural_1(input_size = 60, hidden_size1=120, output_size= 1).to(device)\n",
    "#model = sonar_neural_3(input_size = 60, hidden_size=30, output_size= 1).to(device)\n",
    "\n",
    "# Set the criterion function that drives the weight updates\n",
    "criterion = nn.MSELoss()  # Mean Square Errors\n",
    "\n",
    "# set the optimizer a mathematical algorithm used to minimize the error (loss)\n",
    "optimizer=   optim.Adam(model.parameters(),lr = 0.0001)\n",
    "\n",
    "# Increases the tensor's dimensionality by one without changing its underlying data\n",
    "y_train = y_train.unsqueeze(1)\n",
    "y_test = y_test.unsqueeze(1)\n",
    "\n",
    "# Proceed with iterative training\n",
    "losses = []\n",
    "num_epochs = 3000\n",
    "for epoch in range(num_epochs):\n",
    "  #Forward pass\n",
    "  y_pred = model(X_train)\n",
    "  loss = criterion(y_pred,y_train)\n",
    "\n",
    "  #Backward Pass\n",
    "  optimizer.zero_grad()\n",
    "  loss.backward()\n",
    "  optimizer.step()\n",
    "\n",
    "  if epoch % 100 == 0:\n",
    "    print(f'Epoch [{epoch}/100], Loss: {loss.item(): .4f}')\n",
    "  losses.append(loss.item())\n",
    "\n",
    "# Plot the loss value vs iterations\n",
    "import matplotlib.pyplot as plt\n",
    "plt.plot(range(num_epochs),losses)\n",
    "plt.xlabel('Iteration')\n",
    "plt.ylabel('Loss')\n",
    "plt.title('Loss over itrations')\n",
    "plt.show()\n",
    "\n",
    "# The backtest on the 20% of the data:\n",
    "with torch.no_grad():\n",
    "  y_test_pred = model(X_test.to(device)).to(device)\n",
    "  test_loss = criterion(y_test_pred,y_test.to(device))\n",
    "  print(f'Test Loss: {test_loss.item():.4f}')\n",
    "\n",
    "# The accuracy of the model\n",
    "mae = torch.abs(y_test_pred - y_test.to(device)).mean()\n",
    "print('Mean Absolute Error:',mae.item())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "References: \n",
    "\n",
    "[Cuda programming guide](https://docs.nvidia.com/cuda/cuda-programming-guide/)\n",
    "\n",
    "[Perceptron](https://www.geeksforgeeks.org/deep-learning/what-is-perceptron-the-simplest-artificial-neural-network/)\n",
    "\n",
    "[Neural networks](https://www.geeksforgeeks.org/deep-learning/neural-networks-a-beginners-guide/)\n",
    "\n",
    "[Using Pytorch in machine learning](https://machinelearningmastery.com/use-pytorch-deep-learning-models-with-scikit-learn/)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}