This post discusses my personal experience building an LLM application (BlogGPT) using the LangChain framework. The goal was to create an app that generates description and tags for Jekyll Front Matter blocks of a Markdown blog post, meeting my personal needs.
    Often, we pay less attention to the not-so-exciting parts of blog writing when using a static site generator (such as Jekyll, Hexo, Hugo, etc.), i.e., the blog metadata. However, including metadata fields can make the blog content more SEO-friendly (e.g. with the help of Jekyll SEO plugin), and LLMs are particularly good at summarization and reasoning.

    To create a functional app for this purpose, I needed to learn some dark magic of LLM application development. Initially, I had hoped to build a solid application by copying and pasting sample code from frameworks like LangChain and Semantic Kernel. However, after experimenting with several lines of demo code, I realized that I needed to rethink the entire engineering process. Here’s what I learned from the experience.

    The User Interface layer

    Firstly, it is important to clarify how I (and other potential users) interact with it. The most straightforward way is through a CLI tool, which can be integrated with existing Jekyll workflows.

    A VS Code extension or IntelliJ plugin would have fancier features, such as inline-completion of blog content and a ChatUI functionality similar to Notion AI, these would require more effort and may deviate from the original purpose of developing an LLM app.

    Creating a web app was also tempting, especially after reviewing the projects listed in awesome-langchain, where many are powered by streamlit and have an elegant WebUI. However, to focus on the LLM aspects, I decided to initially build a CLI version and consider adding a WebUI in future updates if the tiny project gains users attention.

    The Application stack

    The engineering choices are straightforward:

    • The project is written in Python since it utilizes LangChain-Python. Dependency management is handled by Poetry.
    • The models used from OpenAI are text-davinci-003 for text generation and text-embedding-ada-002 for embedding.
    • To avoid additional dependencies, FAISS is utilized as an in-memory vector index.
    • Gitpod is used for a standardize cloud workspace, and it also eliminate the burden of CLI distribution. This also addresses users’ concerns about a CLI binary requiring their OpenAI key. The CLI can be run transparently from the source code in a development environment.

    Understand the Pattern and Harnessing the power of LangChain

    There are numerous articles that discuss the common patterns for utilizing LLMs together with orchestration frameworks like LangChain and LlamaIndex for knowledge generation and reasoning. However, my development experience of BlogGPT is not as straightforward.

    Data Loading & Transformation:

    • This part involves reading a markdown file from the local file system and splitting the document into smaller chunks that are “semantically” relevant.
    • LangChain offers various loader and splitter classes, but combining them can be a bit tricky. For example, it may seem reasonable to use the UnstructuredMarkdownLoader to strip the markdown formatting while preserving the markdown structure with the MarkdownHeaderTextSplitter. However, they are not intended to be used together.
    • After some debugging and experimenting, I found that a combination of UnstructuredMarkdownLoader and RecursiveCharacterTextSplitter provides relatively stable results with my testing datasets. Adjusting the chunk_size / chunk_overlap parameters feels like a matter of chance, as different blogs behave in varying ways. For example, a technical blog with code blocks needs to avoid splitting them, and sometimes it is necessary to drop the Table of Contents (TOC) and extra metadata since they can subtly affect the output of the LLM.

    Data Summarization:

    • This part involves sending document chunks with summarization instruction prompts to LLM. The technical challenge is to encounter the context size limitation of LLM. LangChain simplifies the developer’s life by providing a PromptTemplate and built-in summarization chains.
    • However, I quickly realized that the summarization output is not suitable for a blog’s description metadata, since it has a significant “ChatGPT” stereotype and fails to capture the author’s tone. Thus, some form of prompt engineering is required.
    • Note that both the map_prompt and reduce_prompt matter if you need to fine-tune the behavior of LLM’s output. I tried turning some instructions uppercase and changing delimiters, spending a significant amount of time debugging the intermediate and combined output.

        map_prompt = PromptTemplate(
            template="write a concise summary of the following:\n\n\"{text}\"\n\n"
                     "CONCISE SUMMARY WITH THE AUTHOR'S TONE IN THE ORIGINAL LANGUAGE:",
            input_variables=["text"],
        )
        # Create an instance of the output parser and configure the response schema
        response_schemas = [
            ResponseSchema(name="summary",
                           description="PROVIDE A CONCISE SUMMARY IN THE ORIGINAL LANGUAGE "
                                       "WITH NO MORE THAN 3 SENTENCES AND USE THE AUTHOR'S TONE"),
            ResponseSchema(name="keywords",
                           description="NO MORE THAN 5 KEYWORDS RELATED TO THE TEXT"),
        ]
        output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
      
        # Define the PromptTemplate and set the output parser
        reduce_prompt = PromptTemplate(
            template="Write a concise summary of the following:\n\n\"{text}\"\n\n{instructions}",
            input_variables=["text"],
            partial_variables={"instructions": output_parser.get_format_instructions()},
            output_parser=output_parser,
        )
      
    • To make things more challenging, I hoped to output both “description” (summarization) and “tags” (keywords) in a single prompt completion, while also specifying the output in JSON format, so I could pipe the results to other native tools. I was aware of LangChain’s capabilities (e.g. Template variables, Output parsers) for these tasks. However, I encountered various issues with LLM, such as not following the specified number of returned keywords, missing comma or quotation mark between JSON fields, and incomplete JSON output.
    • This has made me question the suitability of LLM for a production-level application. My optimism is slowly diminishing after these hands-on experiments. However, I’m not building a mission critical tools with millions of requests. Nevertheless, I hope that both LLM and the application framework can make future progress in building more stable LLM-based applications.

    Question & Answering:

    • This part follows the pattern of Retrieval-Augmented Generation, providing contextual data (from a similarity search result from the vector store) with the query. Again, the quality of document chunks matters most, since LLMs are just doing in-context learning.
    • Since Q&A is not BlogGPT’s primary goal, and many existing GPT-powered applications are focusing on this area, I stopped at a worked version.

        def build_embedding(self, force_rebuild: bool = False) -> None:
                """
                Build embeddings using FAISS index
                :param force_rebuild: Rebuild the embeddings even if a previous index exists
                """
                embeddings = OpenAIEmbeddings()
                if force_rebuild or not os.path.exists(self.faiss_file):
                    db = FAISS.from_documents(self.splits, embeddings)
                    db.save_local(self.faiss_file)
                    self.db = db
                else:
                    self.db = FAISS.load_local(self.faiss_file, embeddings)
      
            def search(self, query: str, **kargs: Any) -> List[Document]:
                """
                Search by embedding similarity
                :param query: query term
                :param kargs: search args
                :return: matched Documents
                """
                return self.db.similarity_search(query, **kargs)
      
            def question_answer(self, query: str, **kargs: Any) -> str:
                """
                Question answering over the post index
                :param query: query term
                :return: answer
                """
                retriever = self.db.as_retriever(**kargs)
                qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=retriever)
                return qa.run(query)
      

    Summarization

    Let’s ask BlogGPT to summarize the article:

     python blog_gpt.py -f ../blog.md | jq
    {
      "summary": "This post is about the author's experience building an LLM application (BlogGPT) using the LangChain framework. The author experimented with UnstructuredMarkdownLoader and RecursiveCharacterTextSplitter to find a relatively stable result with their testing datasets, and tried to output both "description" (summarization) and "tags" (keywords) in a single prompt completion. However, they encountered various issues with LLM, and have stopped at a worked version for Q&A since it is not the primary goal of BlogGPT.",
      "keywords": "LLM, LangChain, Poetry, OpenAI, FAISS, Gitpod, UnstructuredMarkdownLoader, RecursiveCharacterTextSplitter, JSON, Q&A"
    }
    

    And what’s my thoughts about LLM application development?

    ➜ python blog_gpt.py -f ../blog.md -q "What's the author's thoughts about LLM application development"
    The author learned that LLM application development requires some dark magic and careful engineering. They needed to think through the user interface layer, such as a CLI tool, a VS code extension, or a web app, and choose the appropriate application stack, such as UnstructuredMarkdownLoader and RecursiveCharacterTextSplitter, OpenAI models, Poetry for dependency management, and Gitpod for a cloud workspace. Lastly, they needed to understand the common patterns for utilizing LLMs, as well as the data loading and transformation process.