Code IntelliSense | LSP | LSIF

2019-07-16

Tech

TL;DR

LSIF is a file format for precomputed code intelligence data. It provides fast and precise code intelligence but needs to be periodically generated & uploaded into index repository. Using Alibaba OSS (Object Storage Service, a low-cost solution for backup and archiving of infrequently accessed data) to transfer massive amounts of data.
LSP standardize the protocol for how tools and servers communicate, so a single Language Server can be re-used in multiple development tools, and tools can support languages with minimal effort.

LSIF

LSIF defines a format that language servers or standalone tools emit to describe that the tuple ['textDocument/hover', 'file:///Users/username/sample.ts', {line: 0, character: 10}] resolves to the above hover. The data can then be taken and persisted into a database.

LSP requests are position based, however results often only vary for ranges and not for single positions.

LSIF uses graphs to emit this information. In the graph, an LSP request is represented using an edge. Documents, ranges, or requests (for example, the hover) are represented using vertices. This format has the following benefits:

For a given code range, there can be different results. For a given identifier range, a user is interested in the hover value, the location of the definition, or to Find All References. LSIF therefore links these results with the range.
Extending the format with additional request types or results can easily be done by adding new edge or vertex kinds.
It is possible to emit data as soon as it is available. This enables streaming rather than having to store large amounts of data in memory. For example, emitting data for a document should be done for each file as the parsing progresses.

For the hover example, the emitted LSIF graph data looks as follows:

// a vertex representing the document
{ id: 1, type: "vertex", label: "document", uri: "file:///Users/username/sample.ts", languageId: "typescript" }
// a vertex representing the range for the identifier bar
{ id: 4, type: "vertex", label: "range", start: { line: 0, character: 9}, end: { line: 0, character: 12 } }
// an edge saying that the document with id 1 contains the range with id 4
{ id: 5, type: "edge", label: "contains", outV: 1, inV: 4}
// a vertex representing the actual hover result
{ id: 6, type: "vertex", label: "hoverResult", result: { contents: [ { language: "typescript", value: "function bar(): void" } ] } }
// an edge linking the hover result to the range.
{ id: 7, type: "edge", label: "textDocument/hover", outV: 4, inV: 6 }

SG Docs

The idea of LSIF is to use compiler frontend to pre-compute code intelligence data in a project-specific build environment and then upload that data to SourceGraph. A one-shot command line tool that runs in the proper build environment and writes data to file is simpler than implementing a long-lived remote LSP server.

Kythe

The Kythe project was founded to provide and support tools and standards that encourage interoperability among programs that manipulate source code. At a high level, the main goal of Kythe is to provide a standard, language-agnostic interchange mechanism, allowing tools that operate on source code — including build systems, compilers, interpreters, static analyses, editors, code-review applications, and more — to share information with each other smoothly.

Kythe grew out of our experience creating a large-scale semantic index of cross-references for the enormous, multi-lingual internal codebase at Google.

Goals of Kythe

The best way to view Kythe is as a “hub” for connecting tools for various languages, clients and build systems. By defining language-agnostic protocols and data formats for representing, accessing and querying source code information as data, Kythe allows language analysis and indexing to be run as services. This, in turn, enables lightweight (“thin”) composition of analysis tools with client tools such as editors, IDEs, and code browsers.

A hub-and-spoke model reduces the overall work to integrate L languages, C clients, and B build systems from a worst-case of O(L×C×B) — combinatorial in the size of the ecosystem — to O(L+C+B): Implementing Kythe compatibility for a given compiler, editor, or build system is, roughly, a constant up-front cost for each component, after which that component can interoperate with all the existing pieces directly.

To make this model work, Kythe provides a language-agnostic graph structure to capture build-system and compiler metadata, as well as semantic information about source code such as cross-references (e.g., definitions and their usages, type information, and cross-language associations). By design, the Kythe graph schema is liberal and extensible — we’ve defined a number of useful subgraphs, but new node and edge kinds are structured so that the graph can easily be extended without recourse to a central authority.

What Kythe Provides

The core of the Kythe project centers around three themes, which are embodied in our open-source tools and supported by the Kythe team at Google together with any interested contributors:

Language-agnostic graph storage format: Kythe defines a simple, flexible, and portable graph representation that is easy to emit from an instrumented compiler, and for clients to consume
Graph schema: Kythe provides a simple, extensible graph schema for a variety of interesting semantic cross-reference data in various language.
Analyzers, tools and examples: The kythe project provides several open-source tools for generating and manipulating Kythe data, including indexers for C++, Java, and Go; a self-contained server that can use Kythe data to answer cross-reference queries; and some UI example code that shows how some of these pieces can be glued together.

What Kythe Requires

Essentially all that is needed to participate in the Kythe ecosystem is for a tool to consume and/or emit data in the Kythe format, and – where appropriate – to follow the Kythe schema. You’ll need:

To plug in a Language: A compiler that can be instrumented to produce an indexer that emits Kythe data about source code in that Language.
To plug in a build system: A tool that can “extract” compilation information from the build process, allowing a Language-specific analyzer to be run on the code and its dependencies.
To plug in a UI tool: Any tool that can consume Kythe graph artifacts can use Kythe data to answer questions about code.
To build a service or other analysis on Kythe data: A tool or service can quickly convert Kythe graph data into tabular or other structured formats for quick serving, graph exploration, visualization, etc

IDE Vendors & LSP

VSCode

Language support is typically complex and uses a lot of storage because a program is an AST… It makes sense to have potentially resource-killing programming language services run as independent processes to separate them from VS Code…. Therefore, an IDE just needs to implement one integration of the protocol to use all programming language servers “for free”, as long as they are using the same protocol.

VSCode runs extensions in a separate process and communicates with this so-called “extension host” via RPC. Extensions are isolated and the core of the IDE is protected.

Editors’ declarative Language features

Syntax highlighting
Snippet completion
Bracket matching / autoClosing
Comment toggling
Auto indentation

IntelliJ

https://www.reddit.com/r/rust/comments/6fs5q9/language_servers_and_ides/

…No, we plan to implement most of the language analysis from scratch. This is a lot of work, but the benefits are substantial. We would be able to leverage IntelliJ Platform infrastructure for incremental analysis and indexing. With our own analysis we can provide more flexible quick fixes, intentions and typing assistance. The same applies to the formatter (that is also from scratch, but we plan to support running rustfmt as an action). It is necessary for proper working of almost any feature which is modifying source code.

https://discuss.kotlinlang.org/t/any-plan-for-supporting-language-server-protocol/2471

(Yole) …We have no plans to support LSP at this time. Instead, we’re focusing on providing the best possible Kotlin development experience in our own IDE, IntelliJ IDEA. …The LSP doesn’t allow to build outstanding support for a language, it allows to build the “least common denominator” support only… We can develop our product far more efficiently if we can build the features we need as part of our product directly, not as extensions to a third-party protocol. Also, the quality of experience of people developing Kotlin in IntelliJ IDEA is far more important to us than the usefulness of our open source code to the community of developers not using IntelliJ IDEA.

Sourcegraph

Code Intel on Code Hosting:

Hover tooltips with documentation and type signatures
- Hover tooltips allow you to quickly glance at the type signature and accompanying documentation of a symbol definition without having to context switch to another source file (which may or may not be available while browsing code).
Go to definition
- When you select ‘Go to definition’ from the hover tooltip, you will be navigated directly to the definition of the symbol.
Find references
- When you select ‘Find references’ from the hover tooltip, a panel will be shown at the bottom of the page that lists all of the references found for both precise (LSIF or language server) and basic results (from search heuristics). Note: results sometimes truncated.
- Search heuristics: case-sensitive & word-boundary plain text search (not parsing into AST, so for precise code-intelliSense, use LSIF)
Symbol Search
- Use Ctags / LSIF to index symbols. These documentSymbols are used for symbol sidebar (outline / structure in IDE), which categories declarations by type (variable, function, interface, etc.). Clicking on a symbol jumps to the line where it is defined.

about JSON-RPC Protocol:

An important observation is that the JSON-RPC protocol is fully asynchronous. Responses to clients can be send out-of-order and without time restriction. This motivates the correct use of the id parameter, which can be used to map previously done requests with incoming responses. The protocol currently assumes that one server serves one tool. There is currently no support in the protocol to share one server between different tools.
By default, the protocol assumes that the server is started and closed by the client. Hence, the lifetime of a language server is fully determined by its user.

about Data Model:

Another key feature of LSP is the lack of any real data model for code. The protocol has no notion of namespaces, class hierarchies, definitions, or references.
In the case of LSP, the input is merely a filename, line number, and column ( the location of the reference ), and the output is a filename, line number, and column ( the location of the definition )
LSP does not attempt to model the semantic relationships in code at all. But if we are trying to build Code Intelligence, shouldn’t our protocol be aware of at least the basic semantic relationships in code? The answer is no for two reasons:
- Coming up with a language-agnostic data model is hard. Some languages are imperative. Others are functional. Some languages permit complex class hierarchies. Other languages don’t even have inheritance. Coming up with a data model that is both general enough to encompass every language feature in the wild and specific enough to be useful for answering user queries is incredibly tricky.
- Naming is hard and naming things in code is no exception. As an exercise, try coming up with a truly unique universal identifier for a symbol in code.
- None of this precludes building a semantic data model on top of LSP. In fact, at SourceGraph, we’ve done exactly that for some of our more advanced features.
Extensibility: … The creators of LSP foresaw that in the future, people would desire new functionality out of language servers beyond what was defined in the original spec. Sourcegraph needs to support features like cross-repository jump-to-definition and global usage examples that no editor or IDE currently offers.

IndexStoreDB

IndexStoreDB is a source code indexing library for use with sourcekit-lsp. It provides a composable and efficient query API for looking up source code symbols, symbol occurrences, and relations. IndexStoreDB uses the libIndexStore library, which lives in swift-clang, for reading raw index data. Raw index data can be produced by compilers such as Clang and Swift using the -index-store-path option. IndexStoreDB enables efficiently querying this data by maintaining accelerationi tables in a key-value database built with LMDB

sourcekit-lsp: LSP implementation for Swift and C-based languages. Sourcekit-LSP is built on top of sourcekitd and clangd for high-fidelity language support, and provides a powerful source code index as well as cross-language support.

sourcekit-lsp uses a global index called IndexStoreDB to provide features that cross file or module boundaries, such as jump-to-definition or find-references. To efficiently create an index of your source code we use a technique called “indexing while building”. When the project is compiled for debugging using swift build, the compiler (swiftc or clang) automatically produces additional raw index data that is read by our indexer. Producing this information during compilation saves work and ensures that any time the project is built the index is updated and fully accurate.

LSP 协议

The current protocol specification defines that the lifetime of a server is managed by the client (e.g. a tool like VSCode or Emacs). It is up to the client to decide when to start (process-wise) and when to shutdown a server.

LSP Base Protocol

Header Part 是冒号分割的键值对，目前支持以下头部字段

Content-Length: xx
Content-Type: application/vscode-jsonrpc; charset=utf-8

Content Part 使用 JSON-RPC 描述，示例

Content-Length: ...\r\n
\r\n
{
 "jsonrpc": "2.0",
 "id": 1,
 "method": "textDocument/didOpen",
 "params": {
     ...
 }
}

类型定义

Request Message
Response Message
Notification Message

其他：

补充协议类型，method 以 $/ 开头
如果不支持该 Method，返回 MethodNotFound
取消请求，method: $/cancelRequest

基本 JSON 数据结构

URI
Text Documents
Position
Range
Location
Diagnostic
Command
…

Initialize Request [C + S]

The initialize request is sent as the first request from the client to the server.

Until the server has responded to the initialize request with an InitializeResult, the client must not send any additional requests or notifications to the server.
In addition the server is not allowed to send any requests or notifications to the client until it has responded with an IntializeResult, with the exception that during the initialize request the server is allowed to send the notifications window/showMessage, window/logMessage and telemetry/event as well as the window/showMessageRequest request to the client.
The initialize request may only be sent once.

Request:

method: initialize
params: InitializeParams

interface InitializeParams {
    /**
     * The process Id of the parent process that started the server.
     * Is null if the process has not been started by another process.
     * If the parent process is not alive then the server should exit.
     */
    processId: number | null;
    rootPath?: string | null;
    rootUri: DocumentUri | null;
    initializationOptions?: any;
    capabilities: ClientCapabilities;
    trace?: 'off' | 'messages' | 'verbose';
    /**
     * The workspace folders configured in the client when the server starts.
     * This property is only available if the client supports workspace folders.
     * It can be `null` if the client supports workspace folders but none are
     * configured.
     *
     * Since 3.6.0
     */
    workspaceFolders?: WorkspaceFolder[] | null;
}

Response:

result: InitializedResponse

interface InitializeResult {
    /**
     * The capabilities the language server provides.
     */
    capabilities: ServerCapabilities;
}
interface ClientCapabilities {
    /**
     * Workspace specific client capabilities.
     */
    workspace?: WorkspaceClientCapabilities;
    /**
     * Text document specific client capabilities.
     */
    textDocument?: TextDocumentClientCapabilities;
    /**
     * Experimental client capabilities.
     */
    experimental?: any;
}

Initialized Notification [C]

method: initialized

The server can use the initialized notification for example to dynamically register capabilities.

Shutdown Request [C + S] & Exit Notification [S]

ShowMessage Notification [S]

The showMessage notification is sent from a server to a client to ask the client to display a particular message in the user interface.

method: window/showMessage

ShowMessage Request [S + C]

Sent from a server to a client to display a particular message in the user interface. In addition to the show message notification the request allows to pass actions and to wait for an answer from the client.

method: window/showMessageRequest

interface ShowMessageRequestParams {
    /**
     * The message type. See {@link MessageType}
     */
    type: number;
    /**
     * The actual message
     */
    message: string;
    /**
     * The message action items to present.
     */
    actions?: MessageActionItem[];
}
interface MessageActionItem {
    /**
     * A short title like 'Retry', 'Open Log' etc.
     */
    title: string;
}

LogMessage Notification [S]

method: window/logMessage

Telemetry Notification [S]

method: telemetry/event

Register Capability & Unregister Capability [S + C]

The client/registerCapability client/unregisterCapability request is sent from the server to the client to register for a new capability on the client side.

{
    "method": "client/registerCapability",
    "params": {
        "registrations": [
            {
                "id": "79eee87c-c409-4664-8102-e03263673f6f",
                "method": "textDocument/willSaveWaitUntil",
                "registerOptions": {
                    "documentSelector": [
                        { "language": "javascript" }
                    ]
                }
            }
        ]
    }
}

Workspace folder requests [S + C]

Many tools support more than one root folder per workspace. Examples for this are VS Code’s multi-root support
The workspace/workspaceFolders request is sent from the server to the client to fetch the current open list of workspace folders.

DidChangeWorkspaceFolders Notification [C]

DidChangeConfiguration Notification [C]

Configuration Request [S + C]

The workspace/configuration request is sent from the server to the client to fetch configuration settings from the client.
The request can fetch several configuration settings in one round-trip.
The order of the returned configuration settings correspond to the order of the passed ConfigurationItems

DidChangeWatchedFiles Notification [C]

The workspace/didChangeWatchedFiles notification is sent from the client to the server when the client detects changes to files watched by the Language Client.
It is recommended that servers register for these file events using the registration mechanism.

Servers are allowed to run their own file watching mechanism and not rely on clients to provide file events. However this is not recommended due to the following reasons:

to our experience getting file watching on disk right is challenging, especially if it needs to be supported across multiple OSes.
file watching is not for free especially if the implementation uses some sort of polling and keeps a file tree in memory to compare time stamps (as for example some node modules do)
a client usually starts more than one server. If every server runs its own file watching it can become a CPU or memory problem
in general there are more server than client implementations. So this problem is better solved on the client side

Workspace Symbols Request [C + S]

method: workspace/symbol

Workspace Exec Command [C + S]

method: workspace/executeCommand

DidChangeTextDocument Notification [C]

method: textDocument/didChange

interface DidChangeTextDocumentParams {
    /**
     * The document that did change. The version number points
     * to the version after all provided content changes have been applied.
     */
    textDocument: VersionedTextDocumentIdentifier;
    /**
     * The actual content changes. The content changes describe single state changes to the document.
     * So if there are two content changes c1 and c2 for a document in state S then
     * c1 move the document to S' and c2 to S''
     */
    contentChanges: TextDocumentContentChangeEvent[];
}
/**
 * An event describing a change to a text document. If range and rangeLength are omitted
 * the new text is considered to be the full content of the document.
 */
interface TextDocumentContentChangeEvent {
    /**
     * The range of the document that changed.
     */
    range?: Range;
    /**
     * The length of the range that got replaced.
     */
    rangeLength?: number;
    /**
     * The new text of the range/document.
     */
    text: string;
}

WillSaveWaitUnitTextDocument Request [C + S]

method: textDocument/willSaveWaitUntil

DidSaveTextDocument Notification [C]

method: textDocument/didSave

DidCloseTextDocument Notification [C]

method: textDocument/didClose

PublishDiagnostic Notification [S]

method: textDocument/publishDiagnostics

Diagnostics are “owned” by the server so it is the server’s responsibility to clear them if necessary. The following rule is used for VS Code servers that generate diagnostics:

if a language is single file only (for example HTML) then diagnostics are cleared by the server when the file is closed.
if a language has a project system (for example C#) diagnostics are not cleared when a file closes.
when a project is opened all diagnostics for all files are recomputed (or read from a cache).
when a file changes it is the server’s responsibility to re-compute diagnostics and push them to the client.
if the computed set is empty it has to push the empty array to clear former diagnostics.
newly pushed diagnostics always replace previously pushed diagnostics.
there is no merging that happens on the client side.

Completion Request [C + S]

Language servers usually run a separate process and client communicate with them in an synchronous fashion. Additionally clients usually allow users to interact with the source code even if request results are pending. We recommend the following implementation pattern to avoid that clients apply outdated response results:

if a client sends a request to the server and the client state changes in a way that the result will be invalid it should cancel the server request and ignore the result. If necessary it can resend the request to receive an up to date result.
if a server detects a state change that invalidates the result of a request in execution the server can error these requests with ContentModified. If clients receive a ContentModified error, it generally should not show it in the UI for the end-user. Clients can resend the request if appropriate
if servers end up in an inconsistent state they should log this to the client using the window/logMessage request. If they can’t recover from this the best they can do right now is to exit themselves. We are considering an extension to the protocol that allows servers to request a restart on the client side
if a client notices that a server exists unexpectedly it should try to restart the server. However clients should be careful to not restart a crashing server endlessly. VS Code for example doesn’t restart a server if it crashes 5 times in the last 180 seconds.