Machine Learning for Large Scale Code Analysis

for the KTH TCS Seminar series

on March 26, 2019

by Hugo Mougard

Talk Plan

Introduction
Overview of source{d} stack
Latest ML team project

Introduction

source{d}

~40 employees, full-remote
ML team: 7 employees
Open-core company
Goal: make ML on Code easy

Introduction

ML on Code

Harder than Computer Vision, NLP

Less tooling
Fewer models
Less visibility

Stack Overview

Code as Data

Overview

3 sub-problems

Data Retrieval
Data Processing
Language Analysis

Code as Data

Data Retrieval

2 projects

rovers: Crawl the internet for git repositories
borges: Download all know repositories

Code as Data

Data Retrieval

Some gory details

51M repositories downloaded
26M siva files
47M left to go
~500TB total size
cluster with 2k threads, 11TB RAM and 1PB storage

Code as Data

Data Processing

gitbase: Expose git repos as SQL databases

Code as Data

Language Analysis

bblfsh: Turn language specific ASTs into UASTs (Universal ASTs)

Code as Data

Universal ASTs Specification

Types shared across languages

Identifier
String
QualifiedIdentifier
Comment
Block
…

Code as Data

Supported Languages

Bash
C++
C#
Go
Java
JavaScript
PHP
Python
Ruby
TypeScript

Code as Data

Demo time 😎

This way

Machine Learning on Code

Built on Code as Data

apollo: Duplicate detection
tmsc & snippet-ranger: Repository topic modeling
id2vec: Identifier embedding
ml: Vendor/garbage detection

Assisted Code Review

Lookout

Latest ML Team Project

Assisted code review

3 targets initial targets

Formatting
Typos
Best practices

Latest ML Team Project

format-analyzer

Goal: automate formatting

Model existing style in style repositories
Apply modeled styles to new code

Latest ML Team Project

Design choices

Must explain false positives

Must not have false positives

Latest ML Team Project

First experiment

Unsupervised learning with explainable rules

Learn a tree model on existing code
Transform the tree to rules
Apply modeled styles to new code

Latest ML Team Project

Problem statement & Features

This way

Latest ML Team Project

Model

Decision tree or random forest
Bayesian Process hyper-parameter optimization
Rule = conjunction of conditions on a branch
Condition filtering
Prediction filtering (constant UAST invariant)

Latest ML Team Project

Evaluation

Reproduction task: ~94.3% precision

Latest ML Team Project

Ways to improve

More expressive model
Autoregressive model
Use more data than a single repository to train

Latest ML Team Project

Second experiment

Ongoing, Early stage. Goal: use expressive models

Generative model AST → Code
Meta-learning on numerous repositories

Latest ML Team Project

Problem statement

Predict formatting characters before each leaf of the AST

Inter-leaves formatting
Sequence prediction

Latest ML Team Project

Model

GNN encoder
RNN decoder for each leaf
Property: short dependencies
Property: more autoregressive

Latest ML Team Project

Training

Supervised learning?

On 1 repo: can be small
On many repos: not consistent style
But model is expressive: more data is good

Latest ML Team Project

Meta-learning

2 steps

Learn how to learn style on many repositories
Learn style on a specific repository
Property: can use data
Property: can model small repos

Latest ML Team Project

Multi-task Learning

Each repository is considered as a task $\mathcal{T}_i$

$$\min_\theta\sum_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i}(f_{\theta})$$

Latest ML Team Project

Meta-learning

Model-Agnostic Meta-Learning approach

Minimize loss AFTER having optimized for a task $\mathcal{T}_i$

$$\min_\theta\sum_{\mathcal{T}_i \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i}(f_{\theta'_i})$$

$\theta'_i$ is optimized from $\theta$ on $\mathcal{T}_i$ at each iteration

Latest ML Team Project

Results

For the next seminar!

Conclusion

Lots of Open Source Software: sourced.tech
Git made easy with SQL interface
Universal ASTs
Fun with tree & meta-learning

Thank you for your attention!

Questions & Discussion

Hugo Mougard <hugo@sourced.tech>