LLMのTokenizationについての動画チュートリアルを公開して、本の章やブログの投稿として動画を自動的に変換するワークフローに関するチャレンジのアイデアをAndrej Karpathyさんが投稿したことが発端で、Anthropicの中の人が、Claude3を使ってやってみたとポストされたという流れがありました。

Fun LLM challenge that I'm thinking about: take my 2h13m tokenizer video and translate the video into the format of a book chapter (or a blog post) on tokenization. Something like:

1. Whisper the video
2. Chop up into segments of aligned images and text
3. Prompt engineer an LLM…
— Andrej Karpathy (@karpathy) 2024年2月22日

Claude 3 Opus is great at following multiple complex instructions.

To test it, @ErikSchluntz and I had it take on @karpathy's challenge to transform his 2h13m tokenizer video into a blog post, in ONE prompt, and it just... did it

Here are some details: pic.twitter.com/ABmMvIkoQ0
— Emmanuel Ameisen (@mlpowered) 2024年3月4日

さらに、Claude3を使ったプロンプトだけは公開されていたが、コードは公開されていなかったので、再現をしてコードを含めて公開したよという方がいました。LLMを使ったプロジェクトとして面白そうなので、自分で実行してみたいと思ったものの、Claude3に登録しようとするとBusiness tax IDが要求されて使えなかったので、GPT-4oで置き換えてみたというのがこの投稿となります。

Claude3版のMedium記事： Using Claude 3 to Transform a Video Tutorial Into a Blog Post | by Yann-Aël Le Borgne | Mar, 2024 | Towards AI

GitHub： GitHub - Yannael/video2blogpost

GPT-4oに置き換えたコード： GitHub - satojkovic/vid2blog

ワークフロー

基本的なワークフローは、Claude3版のMediumの記事通りです。

動画とトランスクリプトのダウンロード
チャプター分割
LLMでチャプター毎のブログをMarkdownで生成
チャプターを一つの記事に統合

1. 動画とトランスクリプトのダウンロード

動画とトランスクリプトはYouTubeから取得。pytubeを使用。おまけでprogress_barもつけてみた。

def progress_function(stream, chunk, bytes_remaining):
    progress_bar.update(len(chunk))


def download_video(video_id, output_path):
    youtube = pytube.YouTube(
        f"https://www.youtube.com/watch?v={video_id}",
        on_progress_callback=progress_function,
    )
    stream = youtube.streams.get_highest_resolution()
    total_size = stream.filesize
    global progress_bar
    progress_bar = tqdm(total=total_size, unit="B", unit_scale=True, desc="Downloading")
    video_path = stream.download(output_path=output_path, filename=video_id + ".mp4")
    progress_bar.close()
    return video_path

2. チャプター分割

LLMのAPIを利用するときのコンテキストウィンドウの制約とコストを考慮して、元記事では動画全体をチャプターに分割、チャプター毎に動画から均一サンプリングで画像を選択しています。 GPT-4o版でも、そのまま使いました。チャプターはYouTubeの動画で指定されているチャプターをリストとして保持しておき、開始・終了時間の間に含まれるトランスクリプトと、最大10フレーム分の画像をチャプター毎のフォルダに保存していきます。

# Chapter info
CHAPTERS_24 = """
00:00:00 intro: Tokenization, GPT-2 paper, tokenization-related issues
00:05:50 tokenization by example in a Web UI (tiktokenizer)
00:14:56 strings in Python, Unicode code points
00:18:15 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32
00:22:47 daydreaming: deleting tokenization
...

def chop_up_in_chapters(
    chapters_list,
    video_path,
    transcript,
    chapters_dir,
    timestamps_screenshots_list_seconds=None,
):
    n_chapters = len(chapters_list) - 1
    print(f"Number of chunks: {n_chapters}")

    for current_chapter in range(n_chapters):
        output_dir = os.path.join(chapters_dir, str(current_chapter))
        if not os.path.exists(output_dir):
            os.makedirs(output_dir)

        current_chunk_start_time = chapters_list[current_chapter]["timestamp"]
        current_chunk_end_time = chapters_list[current_chapter + 1]["timestamp"] - 1
        print(
            f"Chapter {current_chapter}; Start: {current_chunk_start_time}, End: {current_chunk_end_time}"
        )

        get_text_chapter(
            transcript, current_chunk_start_time, current_chunk_end_time, output_dir
        )

        if timestamps_screenshots_list_seconds is not None:
            get_frames_chapter(
                video_path,
                current_chunk_start_time,
                current_chunk_end_time,
                output_dir,
                timestamps_screenshots_list_seconds[current_chapter],
            )
        else:
            get_frames_chapter(
                video_path, current_chunk_start_time, current_chunk_end_time, output_dir
            )

3. LLMでチャプター毎のブログをMarkdownで生成

プロンプト（prompt_instruction）はClaude3用のものをほぼそのまま使っています。ただし、GPT-4oのレスポンスのmarkdownはコードブロック（```markdownという記述になってる）の中に全体が収められてしまうので、以下のように微修正を入れています。

- output valid markdown without wrapping the entire content in a code block

また、contentとして10枚の画像（screenshots_as_messages）とtranscriptを結合してmessagesに入れています。

# Generate the prompt for the current chapter
prompt_generete_markdown = get_prompt_as_messages(chapter, CHAPTERS_DIR)

# Create a message by invoking Claude with the prompt
message = client.chat.completions.create(
    model="gpt-4o",
    temperature=0,
    max_tokens=4000,
    messages=prompt_generete_markdown,
)

def get_prompt_as_messages(chapter_id, chapters_dir):
    folder_path = os.path.join(chapters_dir, str(chapter_id))
    with open(os.path.join(folder_path, "transcript.txt"), "r") as f:
        transcript = f.read()

    screenshots = sorted(glob.glob(os.path.join(folder_path, "*.jpg")))
    screenshots_as_messagges = get_screenshots_as_messages(screenshots)
    prompt_as_messages = [
        {
            "role": "system",
            "content": prompt_instructions,
        },
        {
            "role": "user",
            "content": screenshots_as_messagges
            + [{"type": "text", "text": f"<transcript>\n{transcript}\n</transcript>"}],
        },
    ]

    return prompt_as_messages

def get_screenshots_as_messages(screenshots):
    """
    The function iterates over all screenshots in order to describe each of them with two messages:
    - a text message that specifies the timestamp for the screenshot
    - an image message containing its base64-encoded representation
    """
    screenshots_as_messages = []
    for screenshot in screenshots:
        screenshots_as_messages.extend(
            [
                {
                    "type": "text",
                    "text": f"The timestamp for the following image is {Path(screenshot).stem}",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64, {base64.b64encode(open(screenshot, 'rb').read()).decode('utf-8')}",
                    },
                },
            ]
        )

    return screenshots_as_messages

4. チャプターを一つの記事に統合

最後に、チャプター毎のmarkdownファイルを統合して、一つのブログ記事に統合します。統合したファイルは、以下のような感じになりました。

まとめ

生成されたブログ記事は、動画全体を反映したものになっていて、GPT-4oを使ってClaude3版と同等のアプリケーションが実現できました。ただし、出力されたmarkdownを見ると、例えば、Chapter1とChapter2に同じ内容が含まれていたり（Common Issueが2回登場）、似たようなタイトルになっていたりするなど、生成品質が不十分な箇所も含まれています。この辺りは改良の余地がありそうですが、動画からテキストブログを生成するアプリは、LLMのマルチモーダルな応用例の一つとして面白い例だと思いました。

stMind

about Tech, Computer vision and Machine learning

GPT-4oを使って動画チュートリアルをブログ記事に変換する

ワークフロー

1. 動画とトランスクリプトのダウンロード

2. チャプター分割

3. LLMでチャプター毎のブログをMarkdownで生成

4. チャプターを一つの記事に統合

まとめ