Аспирантский семинар: Learning from Large Codebases

Мероприятие завершено

Where: Faculty of Computer Science, Kochnovskii proezd, 3, room 205

When: 18:10 – 19:30 on January 10, 17, 24, and 31  

Veselin Raychev's (ETH Zurich) dissertation, Learning from Large Codebases, recieved an Honorable Mention for the 2016 ACM Doctoral Dissertation Award. Key findings of the dissertation will be considered during the four meetings of the PhD Seminar on January 10, 17, 24, and 31. 

As the size of publicly available codebases has grown dramatically in recent years, so has the interest in developing programming tools that solve software tasks by learning from these codebases. Yet, the problem of learning from programs has turned out to be harder than expected and thus, up to now, there has been little progress in terms of practical tools that benefit from the availability of these massive datasets. This dissertation focuses on new techniques that learn probabilistic models from large datasets of programs, as well as new tools based on these probabilistic models which improve software development.

The thesis presents three new software systems (JSNice, Slang, and DeepSyn) that learn from large datasets of programs and provide likely solutions to previously unsolved programming tasks including deobfuscation, static type prediction for dynamic languages, and code synthesis. All three of these systems were trained on thousands of open-source projects and answer real-world queries in seconds and with high precision. One of these systems, JSNice, was publicly released and is already widely used in the JavaScript community.

An important ingredient of the thesis is leveraging static analysis techniques to extract semantic representations of programs and building powerful probabilistic models over these semantics (e.g., conditional random fields). The net result is that it allows us to make predictions with better precision than approaches whose models are learned directly over program syntax.

Finally, the dissertation presents a new framework for addressing the problem of program synthesis with noise. Using this framework, it is shown how to construct programming-by-example (PBE) engines that handle incorrect examples. A new learning approach based on approximate empirical risk minimization is introduced. Based on the framework, a new code synthesis system (DeepSyn), which generalizes prior work and provides state-of-the-art precision, is developed.

January 10

During the first seminar, Natalia Korepanova and Kirill Neklyudov will introduce motivation, challenges and applications of learning from large codebases considered in the thesis; they will also outline its key contributions.

The second part of the seminar will be devoted to a new approach for probabilistic prediction of program properties and the JSNice system for deobfuscation (renaming variables to meaningful names) and prediction of optional type annotations in JavaScript code. This approach is based on learning a probabilistic model from existing codebases already annotated with program properties (e.g., names of variables). The term "program property" here encompasses both classic semantic properties of programs and syntactic program elements. The core technical insight is transforming the input program into a representation that enables us to formulate the problem of inferring program properties as structured prediction with conditional random fields (CRFs), a powerful undirected graphical model successfully used in wide variety of applications including computer vision, information retrieval, and natural language processing. The thesis shows for the first time how to leverage CRFs in the context of programs. At the seminar, we will present theoretical foundations of this approach, its realization in the JSNice system, learning challenges, and how the author suggests to learn and evaluate the quality of the approach on program codes from public repositories.

Notation and definitions  Introduction  Discriminative models for predicting program properties  Prediction algorithm

January 17

Speaker: Gennadii Fedin, PhD student at the Department of MathematicsFaculty of Economic Sciences

To accomplish various tasks, programmers increasingly rely on the rich functionality provided by numerous libraries and frameworks. Unfortunately, a typical API can involve hundreds of classes with dozens of methods each and often requires specific sequences of operations to be invoked to perform a single task. Recent years have seen increasing interest in code search, recommendation, and completion systems.

The thesis tries to solve this problem by proposing a new approach to building probabilistic code models. The idea of the presented models is to reduce the problem of code completion to a natural-language processing problem of predicting sentence probabilities and use regularities found in sequences of method invocations to predict and synthesize a likely method invocation sequence for code completion. Two models were presented: N-gram Language, and Recurrent Neural Networks. They were trained and tested on a massive number of code snippets obtained from GitHub and other repositories. The models have been implemented in the Slang system. This software can be used by programmers to complete missing details in their code.
Slides

 

January 24

Speakers: Tatiana Makhalova and Anna Sokolova, PhD students at the Faculty of Computer Science

In recent years, a lot of efforts have been focused on generating programs by example (PBE). The main problem of the proposed techniques is inability to deal with incorrect examples. We will discuss an approach to building a noise-resistant program synthesizer. We consider its key components: dataset sampler and program generator, as well as some aspects of its implementation (called BitSyn). We will go into detail on two variations of the dataset samplers that were introduced and analysed in the thesis. Special attention will be paid to the performance of BitSyn under different noise settings, e.g., on a dynamically enlarged set of examples with bounded noise or a fully observable set of examples that contains an unknown number of errors.
Program synthesis with noise The case of bounded noise and implementation

 

January 31

Speakers: Dmitry Aldunin, PhD student at the School of Business Informatics, and Pavel Sulimov, PhD student at the Faculty of Computer Science

We address the problem of choosing an intermediate representation for a “Big Code” system and formulate it as a program synthesis problem. We will discuss how the concepts presented at the previous seminar are applied to handling noise in training data (e.g., some programs from “Big Code” contain bugs or are badly written and should be ignored) and performing fast and scalable search for the best intermediate representation.

We will also demonstrate how these concepts are combined with the probabilistic models discussed at the previous seminars.
Learning a synthesizer with “Big Code”