Skip to main content

Command Palette

Search for a command to run...

Vibeo Concept Notes

Vibe Video, thoughts of video production in agentic world

Updated
14 min read
Vibeo Concept Notes
H

Just run the code, or yourself.

The term "vibe" is becoming everywhere -- started with Vibe Coding and now you can almost vibe everything.

My recent projects focsed a lot on agentic video production, and I find the term "vibe video", or vibeo, a fun way to describe this experience.

Agentic environment v.s. agent workflow

One may readily think of some products that can intake a simple prompt and generate a 1 minute video. It usually entails a well engineered workflow, say

  • 1) generate storyline;
  • 2) generate story board (image gen);
  • 3) use I2V to generate video clips;
  • 4) generate BGM ;
  • 5) generate narration;
  • 6) assemble them all together.

Although one may use agent frameworks, e.g. ADK, langgraph, crewAI, to implement this workflow, it is not "agentic" under the hood. A workflow is only customizable up to a certain level, limited by its parameter space. As to an agentic environment, we expct the assistant to be more autonomous and use multi-step approach with reflection to get undefined/ unforseen things done.

How to be agentic?

From the human perspective:

  • Declarative, not imperative
  • Issues, not solutions
  • Context, not tool specs
  • Behaviour patterns, not instructions

From the assistant perspective:

  • A super agent as the core (running ReAct like flow)
  • A mechanism of tool discovery (not tool spec)
  • A mechanims of knowledge discovery (not a priori)
  • Accurate vision feedback
  • Flexibility to customize behavioural patterns
  • Capability to learn in the context

From the environment perspective:

  • Safeguarding, or even better sandboxing
  • Support interruption
  • Session management: versioning, truncation, save/ load, forking -- go back in history when session is polluted
  • User notification: when intervention / approval is required

Example

Let's give one video generation example. Below are the "pseudo chats".

Example: A long take video generation

User: let's make a 40s long take of this cat exploring a dreamy world. [Declarative].

Assistant: discovered tools ffmpeg and veo_cli in the environment. [Tool Discovery].

Assistant: run veo_cli --help to get the helpdoc. Realized that veo generates 8 seconds clips. [Tool Discovery]

Assistant: So the plan is to generate 5x clips and use ffmpeg to assemble them. Ask for approval. [User Notification].

User: How do you ensure it is a long take? [Issues]

Assistant: A long take means that there is only one scene and one shot. With sophisticated camera movement tracking our subject, we can experience a scene without interruption. So here is the revised plan. I will generate 5x clips. The clip i uses the clip i-1's last frame as the input image. This ensures that it is seamless to switch from clip i-1 to clip i. I need to make sure the tool support this image + text input. [Knowledge discovery]

Assistant: run veo_cli submit --help. It sees the arguments support giving an image and a text as input. [Tool discovery]

Assistant: Given the above observation, here is the final plan, ..... (omitted).... [ReAct] , [User Notification].

User: proceed.

Assistant: veo_cli submit {folder 1} , wait for the operation to finish and download the video. (30 seconds passed)

Assistant: veo_cli submit {folder 2} , wait for the operation to finish and download the video. (40 seconds passed)

User: [Interrupt]

User: it takes a few seconds to minutes for veo to generate a clip. We can submit all the tasks in parallel and wait. [Context]

Assistant: That is great call out! Let me blablabla. [ReAct]

Assistant: If I submit the operations in parallel, how can I know the first frame before I have the last clip? Ah, I can generate the storyboard first, and use the first+last frame generation, so that the clips can connect seamlessly and we can accelerate our video production process.

Assistant: veo_cli submit --help again to validate the parameters. [Tool discovery].

Assistant: veo_cli submit {folder 3} , ... , veo_cli submit {folder 4}, ... , veo_cli submit {folder 5} ... [ReAct]

Assistant: ffmpeg ...

Assistant: Here is the final video! final-video.mp4

User: I want you to think like a senior film director. Identify possible next steps, and let me review. [Behavioural pattern]

Assistant: Sure, as a senior film director, blabla, ... First, I need to check if the final output is a long-take. Second, I'll try to identify if there is any glitches, especially at the boundary of clips.

Assistant: reading final-video.mp4 . [Vision feedback]

Assistant: I have identified that the last clip is not a long-take. Instead, it is a two-shots video with camera re-position. I will start to regeneration it, and then assemble the final video. [ReAct]

Assistant: rm final-video.mp4.

User: [Interrupt]

User: Do not remove files in our working folder. Always suffix the files properly so we can reproduce the whole set of experiments. [Behaviour pattern].

Assistant: Noted.

Assistant: generate clip5.mp4...

Assistant: assemble with ffmpeg, output as final-video-v2.mp4.

User: reflect on what we have done above and take note in note.md.

The super agent

I am currently exploring gemini-cli as this super agent. There are many similar alternatives after the initial launch of gemini-cli.

Advantages of this choice:

  • gemini-cli is by nature a super agent, disguised by the "CLI" part in its name.
  • The cli aka shell environment has 40 year's of history and is a rich collection of tools since UNIX time.
  • stdin/ stdout are a natural communication protocol. This is more flexbile / lightweight than other protocols like OpenAPI, MCP and A2A because the LLM's today has very good language capability.
    • Downside is that this interface is sometimes error prone, and easily polluted by outputs in non-happy paths.
  • Tools discovery and tool integration can be done "on the fly". My practice is to maintain a session of main work and another session of tool development. Once I finish developing / fixing my tools, I only need to go back to the main session and run a --help to update the context.
  • gemini-cli already integrates Google services like search, and like to integrate more going forward.

Challenges

I have intenstively tested this in the past weeks and see a few key challenges. The observation is taken on Sept 22, 2025.

Session

Ideally, we want a Git-like system to help organize our session (chat log). We can go back in history at any time, and fork from a good candidate.

Even better is to have Docker's layer like system, so that the disk state is maintained along side with session. That gives us flexibility to freely explore and find good alternative.

Behaviour pattern

Behaviour pattern can be set in GEMINI.md, or as part of the session. In my experience, the instruction at the beginning of each session is strongest. On the other hand, the instruction at the head of the a session is quite intermittant. When working with gemini-cli, the behaviour pattern set at head is often only reflect in the immediate task. Once the immediate task is finished, it falls back the overall pattern, derived largely from early conversation, and somewhat from GEMINI.md .

For example, at one time, I asked gemini-cli to be my DevOps engineer from the beginning. Then I realized that there is some product gaps we need to address. I use conversation to get new persona, and asked to iterate the product step by step. It turned out that gemini-cli finished one task as product desginer and immediately switched back to the original "DevOps engineer" role, by proactively building Dockerfile and facilitationg scripts. I tried to re-iterate the new role of product designer and added a new feature. After the new feature was fixed, gemini-cli switched to the DevOps engineer again. No matter how hard I tried, it just kept working on non-asked DevOps tasks, instead of helping me interating products.

This phenomenon feels very like how human's learn. As parents, you often need to set the right behavioural pattern from very young age. Or else, the behaviours that root in the kid becomes harder and hard to fix later.

In Session Tool Build

You may find my above practice:

My practice is to maintain a session of main work and another session of tool development. Once I finish developing / fixing my tools, I only need to go back to the main session and run a --help to update the context.

I tried a few times of in-sesion tool build -- that is, upon certian feature gap, I asked gemini-cli to write the tool itself.

This soon became a daunting process, where gemini-cli tried many iterations troubleshooting the tool, and the garbled output polluted the session. The consequence is that, it became stuck in a loopy behaviour, repeating similar errors again and again. It took about 20min for what could usually be done in one-step with a Thinking Model.

The trouble did not end there. After the tool was successfully built, I tried to steer gemini-cli back to the main task, and realized that the behaviour patterns we set earlier on was weakened, and the most relevant context was "lost in the middle". I had to manually write some context in order to steer the conversation back on track.

The in-session build is a current challenge. I work around with two sessions. One for main task, one for tool building and the tool building one is intended to be short-lived.

In an ideal world, the super agent needs to work like human -- identify the gaps and build tools just in time and use the tools to get things done.

Python identation

One very frequent error in coding task I observed is Python indentation...

The line repace tool sometimes do not work as intended, and result in indentation errors. Once a first occurrence of indentation error apears, it becomes more and more difficult to self-fix.

File lineage

The concept is borrowed from "data lineage".

When we work with gemini-cli, the file system is the main database. I find gemini-cli is good at maintaining the file lineage up to certian level, like dozens of folders and hundreds of files.

The challenge comes at two scenarios:

  • When the conversation goes too long, and when we need to apply multiple /compress.
  • When we exit the conversation and start a new session next time.

To work around the second issue, I often ask gemini cli to take a detailed note.md, but the information in note.md is often not enough for next gemini to pick up the full context.

I tried forward engineering, like "record each input/ output/ prompt/ error/ observations", and this became too "micro management", resulting in very rigid notes following the exact schema, but missing some key information.

I also tried open eneded way to ask Gemini just take notes. I find it often take note in a "weekly report" style. That is, the majority of information is about happy cases and there lacks details to reproduce the issues, or even lacks detalis for the next gemini-cli to reproduce the happy paths.

This challenge also reflected our human behaviour -- you take good notes during research, but could hardly comprehend them three months later.

Why is file lineage important? Give an example of generating slide show video with background music:

  • Generate a background image (of a page). theme.txt -> bg.png.
  • Use text box tool to write some texts on the image. bg.png -> bg_with_text.png
  • Use ffmpeg to zoom and jitter the image and output a video clip. bg_with_text.png -> clip-1.mp4
  • Assemble the video clip-1.mp4 and audio music-1.mp3. -> clip-1-with-bgm.mp4.
  • Repeat the above for other pages.
  • Concatenate all the clips. final-output.mp4

Suppose you finished one generation and watch the final video, and realized that the theme of clip-2 is wrong. You fixed theme2's txt file and let gemini know. It shall regenerate clip's bg.png, bg_with_text.png, clip-1.mp4, and then assemble the final output. It does not have to touch other clips's artifacts.

This is a common and natural expectation when working with Gemini CLI. In practice, it can well derive the data lineage for small set of files and short conversations. For longer conversation and larger sets of files, there is a challenge.

I suspect we can use GEMINI.md to build some behavioural pattern, e.g. write a structured files.md, or even maintaining a database of file lineage. We have not been there yet, but for sure would circle back some time, as it is a key part of vibeo environment.

When there is an open question, gemini-cli is up to various choices. It can build a tool (with excellent coding availability), or use something existing in the environment, or search the Internet for discussions/ best practices.

I find each gemini-cli session is inclined to certain resolution path -- this is also reminisent of us human -- some kid likes to ask around when there is a challenge; some kid likes to craft their own tools; some kids likes to learn before act...

I do not have an idea solution for it right now. I often need to interrupt certain task, and point gemini-cli to the right direction.

Network bottleneck

In the LLM and agentic world, the network traffic becomes highly assymetric. A common traffic pattern is that the input token is 100x larger than the output token. The reason is that most super agent implementation passes the entire chat log to the LLM to decide next actions. The output may be veryconcise response text or function call, but the input is a very long text part, with past chat histories, including function calls/ responses.

This is not a big problem when you work in normal settings like office network/ home network.

It could become a huge problem on the go. I recently tried vibeo during biz trip. I often use my cellphone tethering to get network access. Result is that, the roaming network is highly asymetric. The download speed is about 5M bps, but the upload speed is a few hundred K bps. Becase the uploading speed is limited and the input token size is large, the round trip of LLM decision becomes super long, and soon beyond effective working threshold.

Self interruption during long tool call

Differention of timeout / error/ working-in-progress is a challenge for LLM. They do not have built in sense of real world elapsed time. Under the hood, the super agent is desgiend like State Machine. It takes the state (e.g. chat history, context) and event (new message), and decides the next action.

The event may be the model's output, or the response of a function call. In the latter case, there is no good time out mechanism. Ideally, as a super agent, it will have a similar sense like our human being that which type of tasks take up which magnitude of time. When the tool calling is too long, it shall proactively notify the user, to decide whether to wait or interrupt. Preferablly, the agent can make some suggestions based on knowledge/ experience/ observed anomalies.

The solutions are two folds:

  1. We can improve the super agent by using a wrapper of any tool call. Upon periodical check-in, we take the interim output (stdout/ stderr), and elapsed time to the model and ask it to make decisions.
  2. On the other hand, we can also establish a best practice for tool builders by introducing a timeout by default, for any tool call. That is, make sure the tool being called by LLM can be returned within reasdonable amount of time, e.g. 10 seconds.
    1. For real long running tasks, we shall decouple it as three parts:
      1. Part 1: task submit operation
      2. Part 2: task status and task get-result operations
      3. Part 3: task daemon
    2. The part 1 and part 2 are supposed to be invoked by the agent, and it is up to the agent (subject to usr steered behaviour patterns) to decide when to call and how frequent to call those tools.
    3. Part 3 can be a long running service in the agentic environment, or even better to start on-demand when agent submits task.

In the vibeo environment, we adopted the second approach for long running tasks. One example is the veo_cli's built in submit/ status/ download sub-commands, where the "daemon" is purely conceptual and is veo API itself in this case.

Model fatigue

Let's use remaining context X% to represent how much of the (say 1 million) max context window are available. Every session starts with 100%, and lowers gradually with the progress of conversation.

I observed multiple times consistently that there is a "model fatigue" (borrowed from our ultra running terms) phenonmenon, when the X% is lower than 50%.

For example:

  • User: Create a report 1
  • Model: Here is my plan, I will create folder report/report_1/, and then .... (other parts of the plan).
  • User: What is the full path of the report.
  • Model: It is under home/report/report_1/.
  • User: I want it to be with other reports we just created.
  • Model: Revised the report path to be home/project/report/report_1/.
  • User: go head.
  • ...
  • (some rounds of working)
  • ...
  • User: Create a report 2
  • Model: Here is my plan, I will create folder report/report_2/, and then .... (other parts of the plan).
  • User: What is the full path of the report.
  • Model: It is under home/report/report_2/.
  • ...

Note that only after a few rounds, the model forgets the report need to be put in the same folder.

If this conversation is conducted when X% > 90%, we seldom observe this. Model can take note of the convetion and revise its plan. Furthermore, model may even realize the relative path issue, when user initially prompted "What is the full path of the report" -- this question apparently indicates there are something wrong, and a thinking model can well refelct on it and revise the action. When X% < 40%, the model merely acts on immediate ask without enough thinking of the "why" behind those user asks.

The is reminsent of the human's "fatigue" -- a state when the glucose are exhausted and the brain shuts down sports system to save energy for the most vital part of the body. The runner at that time can only act on one thing -- in the case of ultra running, they shuffle step by step, one step at a time, without thinking too much steps ahead.