I have been working on several compilers since my graduate study.

    * XLA: performance/scalability/model parallelism
    * LLVM: SamplePGO, debug info of optimized code, IR/Codegen optimizations
    * GCC: AutoFDO
    * Open64: SampleFDO

    I am leading the machine learning performance focusing on TPU.

    * Optimize model/codegen performance for Google internal training/inference workloads.
    * Optimize TPU performance for MLPerf training/inference submission.
    * Research and apply high-performance techniques to improve ML efficiency and scalability.