Machine Learning for Large Scale Code Analysis

on April 18, 2019

by Hugo Mougard

Talk Plan

Introduction
Overview of source{d} stack
Latest ML team project

Introduction

source{d}

~40 employees, full-remote
ML team: 7 employees
Open-core company
Goal: make ML on Code easy

Introduction

ML on Code

Harder than Computer Vision, NLP

Less tooling
Fewer models
Less visibility

Stack Overview

Code as Data

Overview

3 sub-problems

Data Retrieval
Data Processing
Language Analysis

Code as Data

Data Retrieval

2 projects

rovers: Crawl the internet for git repositories
borges: Download all know repositories

Code as Data

Data Retrieval

Some gory details

51M repositories downloaded
26M siva files
47M left to go
~500TB total size
cluster with 2k threads, 11TB RAM and 1PB storage

Code as Data

Data Processing

gitbase: Expose git repos as SQL databases

Code as Data

Language Analysis

bblfsh: Turn language specific ASTs into UASTs (Universal ASTs)

Code as Data

Universal ASTs Specification

Types shared across languages

Identifier
String
QualifiedIdentifier
Comment
Block
…

Code as Data

Supported Languages

Bash
C++
C#
Go
Java
JavaScript
PHP
Python
Ruby
TypeScript

Code as Data

Demo time 😎

This way

Machine Learning on Code

Built on Code as Data

apollo: Duplicate detection
tmsc & snippet-ranger: Repository topic modeling
id2vec: Identifier embedding
ml: Vendor/garbage detection

Assisted Code Review

Lookout

Latest ML Team Project

Assisted code review

3 targets initial targets

Formatting
Typos
Best practices

Latest ML Team Project

format-analyzer

Goal: automate formatting

Model existing style in style repositories
Apply modeled styles to new code

Latest ML Team Project

Design choices

Must explain false positives

Must not have false positives

Latest ML Team Project

First experiment

Unsupervised learning with explainable rules

Learn a tree model on existing code
Transform the tree to rules
Apply modeled styles to new code

Latest ML Team Project

Problem statement & Features

This way

Latest ML Team Project

Model

Decision tree or random forest
Bayesian Process hyper-parameter optimization
Rule = conjunction of conditions on a branch
Condition filtering
Prediction filtering (constant UAST invariant)

Latest ML Team Project

Evaluation

Reproduction task: ~94.3% precision

Latest ML Team Project

Ways to improve

Better inductive bias
More expressive model
Use more data than a single repository to train

Latest ML Team Project

Inductive bias

Tweak of the general algo

→ helps to learn some problems

Latest ML Team Project

No bias

Video

Latest ML Team Project

Images

Locality is key

Demo

Latest ML Team Project

Text

Need to handle sequences of variable lengths

Demo

Latest ML Team Project

Text

Large vocabularies. To handle them:

You shall know a word by the company it keeps

Demo

Latest ML Team Project

Reasonning

Memory and pointers

Latest ML Team Project

Inductive bias for MLonCode

Which tool will be efficient?

Latest ML Team Project

Bimodality

Code = two separate channels:

Algorithmic channel: Computers, humans
Descriptive channel: Humans

Latest ML Team Project

Algorithmic channel

→ Close to parsing. Demo 1, Demo 2.

Latest ML Team Project

New inductive biases

code2vec

Latest ML Team Project

Second experiment

Ongoing, Early stage. Goal: use expressive models

Generative model AST → Code
Meta-learning on numerous repositories

Latest ML Team Project

Problem statement

Predict formatting characters before each leaf of the AST

Inter-leaves formatting
Sequence prediction

Latest ML Team Project

Model

GNN encoder
RNN decoder for each leaf
Property: short dependencies
Property: more autoregressive

Latest ML Team Project

Results

For the next seminar!

Conclusion

Lots of Open Source Software: sourced.tech
Git made easy with SQL interface
Universal ASTs
Fun with tree & meta-learning

Thank you for your attention!

Questions & Discussion

Hugo Mougard <hugo@sourced.tech>