AIGC Weekly #1 (08/26/2024 - 09/02/2024)



"Artificial Intelligence is the new electricity." 
--- Andrew Ng



✦ This Week's Highlights

1X releases consumer-grade humanoid robot NEO Beta

OpenAI-backed robotics startup 1X has unveiled its latest consumer-grade home robot, the NEO Beta. The demonstrations were quite impressive, and the showcasing of the robot wearing clothes to conceal its mechanical components has generated significant buzz. 1X anticipates delivering home robots to paying customers as early as 2025. Their robots feature a proprietary "tendon-driven" technology, emphasizing flexibility, safety, a low gear ratio, high-power motors, and a drive system akin to human muscles. The manufacturing process is highly vertically integrated, with in-house production from raw materials to finished products. Assembly is done in stages: core components -> subsystems -> final assembly -> validation testing. Based on the demonstrations, the robot is not yet capable of complex household chores. It can only assist with simple tasks like tidying up or fetching items. However, with the support of large language models (LLMs), the number of impressive robotics companies has been growing rapidly.


OpenAI Races to Launch 'Strawberry' Reasoning AI to Boost Chatbot Business

High-quality synthetic data has once again proven its significance.

OpenAI's logic is to use a sufficiently large and very expensive reasoning model (Strawberry) to produce high-quality synthetic data, which will aid in training the next-generation universal model (Orion). Meanwhile, a portion of the leaked synthetic data is incidentally used to fine-tune and distill the previous generation model, GPT-4, ensuring the continued, albeit incremental, improvement of the previous generation model (GPT-4o). According to The Information, OpenAI may release a ChatGPT version of Strawberry this fall. The reasoning ability of the Strawberry model is significantly enhanced compared to current models, and it can truly convert thinking time into output quality. Its enhanced logic should be able to more effectively address language-related challenges. Sam also mentioned that they have invited the US national security department to start testing their advanced models. Additionally, there's a planned flagship language model codenamed "Orion," aiming to surpass GPT-4. Strawberry will contribute by generating data for Orion. The combination of Strawberry and high-quality synthetic data may reduce errors in Orion. Strawberry might have used a method similar to Stanford's Quiet-STaR research. Reviewing the paper again, Quiet-STaR improves model reasoning ability through three steps:

  1. Parallel Reasoning Generation: First, multiple reasons are generated in parallel at each token position of the input sequence. Each reason has a length of t and has learned start and end tokens inserted at the beginning and end of each reason.
  1. Mixing Posterior Reasons and Base Predictions: Then, a mixing head is used to generate a weight from the hidden states output of each reason and the hidden states output of the original text tokens, which determines how much of the prediction logic of the posterior reason is used in subsequent token predictions.
  1. Optimizing Reason Generation: Finally, the REINFORCE algorithm is used to optimize the reason generation parameters to increase the probability of reasons that make future text more likely.

100M Token Context Window: LTM-2-Mini

Magic has released LTM-2-mini, a model with a context window of 100 million tokens. This is equivalent to approximately 10 million lines of code or 750 novels.

Unlike traditional models that rely on fuzzy memory, LTM models can process up to 100M tokens of context information during reasoning.

Existing long-context evaluation methods contain implicit semantic cues, which reduce the difficulty of evaluation and allow models like RNNs and SSMs to achieve good scores.

The Magic team proposed the HashHop evaluation method, which uses hash pairs to require models to store and retrieve the maximum possible amount of information, thereby improving the accuracy of evaluation.

When processing ultra-long contexts, the sequence dimension algorithm of the LTM-2-mini model has a much lower computational cost compared to the attention mechanism of the Llama 3.1 405B model.


Qwen2-VL: To See the World More Clearly

While there haven't been many high-quality, closed-source multimodal models, especially those supporting video understanding, within China, Alibaba has now open-sourced Qwen2-VL. However, the largest 72B variant of Qwen2-VL remains closed-source, with only the smaller 2B and 7B models being made publicly available.

Built upon Qwen2, Qwen2-VL offers several key advancements compared to its predecessor:

  • Understanding images of various resolutions and aspect ratios: Qwen2-VL has achieved world-leading performance on visual understanding benchmarks such as MathVista, DocVQA, RealWorldQA, and MTVQA.
  • Comprehending long videos: Qwen2-VL can understand long videos and utilize them for applications like video-based question-answering, dialogue, and content creation.
  • Enabling visual intelligent agents for phones and robots: With its complex reasoning and decision-making capabilities, Qwen2-VL can be integrated into devices like phones and robots to perform automated actions based on visual environments and text instructions.
  • Multi-language support: To serve a global audience, in addition to English and Chinese, Qwen2-VL now supports understanding multi-language text within images, including most European languages, Japanese, Korean, Arabic, Vietnamese, and more.
  • Native dynamic resolution support: A significant architectural improvement in Qwen2-VL is its full support for native dynamic resolution. Unlike its predecessor, Qwen2-VL can handle image inputs of any resolution, converting images of different sizes into a dynamic number of tokens, with a minimum of only 4 tokens.
  • Multimodal rotational position embedding (M-ROPE): Another important innovation is M-ROPE. While traditional rotational position embeddings can only capture the positional information of one-dimensional sequences, M-ROPE decomposes the original rotational embeddings into three parts representing time, height, and width, enabling large language models to simultaneously capture and integrate the positional information of one-dimensional text sequences, two-dimensional visual images, and three-dimensional videos.

In summary, Qwen2-VL represents a significant advancement in the field of multimodal large language models. By offering strong capabilities in understanding and generating text and visual content, Qwen2-VL has the potential to power a wide range of applications, from content creation to robotic control.


✦ Other Updates

  • MiniMAX has released a video generation model that currently only supports text-to-video generation and is free to use. 
  • Civitai, the largest image model sharing site overseas, has launched a new site called Civitai Green, which only contains safe images and models, with no explicit content.
  • Runway's Gen-3 video generation model now supports video extension, allowing videos to be extended up to 40 seconds.
  • AI telemarketing platform Bland AI has raised $22 million in funding. It supports conversations in any language or voice, allows users to customize their own customer service bots through agents, and can handle millions of calls simultaneously.
  • Midjourney has started developing hardware and is hiring. Last year, they recruited someone who previously worked on VisionPro at Apple.
  • Google's Gemini has been updated with a feature similar to GPTs and Claude projects called "Gem". It now also supports image generation using Imagen 3.
  • Google has released a new model called Gemini 1.5 Flash 8B. There is also a new version of Gemini 1.5 Pro, which features improved prompt response and coding capabilities.
  • Anthropic has done a good job with this. They have released a page that shows the changes in Claude's system prompts over time.
  • XLabs has launched the video generation project "Deforum" based on FLUX. Deforum can generate rapidly changing animations by controlling prompts at different stages.
  • Cohere has released new versions of the Command R and R+ models, which have improved performance in inference, coding, tool usage, and multilingual retrieval-augmented generation (RAG). Additionally, the price of these models has been reduced compared to the previous versions.
  • Zhipu AI has open-sourced the CogVideoX-5B DiT video generation model, which is likely the first large-scale DiT open-source video model.

✦ Product Recommendations

Tolan: Your Personalized AI Companion

Tolan is a customizable, AI-powered companion that can engage in natural conversations and help you brainstorm. It features real-time voice interaction and offers a variety of cute alien avatars.











Clockwise: AI-Powered Time Management Calendar

Clockwise is an AI-powered calendar that magically optimizes your schedule, making it easy to find time for everything, even in the busiest of times. It learns your preferences and automatically adjusts your calendar to find the best time for meetings.


Unriddle is a research tool backed by Y Combinator, designed to expedite the research paper reading and writing process for researchers and students. It offers AI-powered features for information retrieval, content understanding, and writing, supports multiple file formats, and provides collaboration and security features. Users can create AI assistants across multiple documents to quickly extract and summarize data by asking questions. Additionally, Unriddle offers advanced writing assistance such as paragraph polishing, translation, text summarization, outline generation, and citation from internet sources.

Hosts Noah Smith and Eric Benz interviewed Anthropic CEO Dario Amodei. In the interview, Amodei shared his academic background and passion for artificial intelligence (AI), as well as his work at Anthropic. He discussed the trajectory of AI development, including the breakthroughs in deep learning and his contributions to GPT and reinforcement learning during his time at OpenAI. Amodei emphasized the importance of scaling law, which suggest that AI performance improves as model size increases.

Amodei analyzed the competitive landscape of AI companies, suggesting that the future of AI might resemble the solar energy industry, where the technology’s value is immense but profits may be compressed. He also discussed the impact of AI on the workforce, including the compression of skill gradients and the reallocation of labor. Additionally, he explored the potential applications of AI in biotechnology and manufacturing, and how these technologies could potentially reshape the world.

In terms of safety and risks, Amodei discussed the possibility of AI autonomy and misuse, as well as national security implications. He mentioned the potential impact of AI on international relations, particularly in the context of US-China competition. He also raised concerns about AI's potential risks to individual privacy and surveillance.

Finally, Amodei expressed support for the upcoming California bill SB 1047, which aims to manage AI risks by requiring AI companies to develop safety plans. He argued that this bill strikes a balance between the need for speed and safety, and could foster progress in AI safety.


How Anthropic Built Artifacts

The development of Artifacts began with a crude prototype demonstrated by research scientist Alex Tamkin at a "WIP Wednesday" meeting. He aimed to accelerate development by reducing the cycle time for creating and reviewing HTML code.

Product designer Michael Wang then helped evolve this prototype into a production-ready experience. The tech stack included Streamlit, Node.js, React, Next.js, and Tailwind CSS.

Throughout the development of Artifacts, the team not only leveraged Claude to speed up software development but also utilized Claude's capabilities to solve coding problems and implement specific interaction patterns.

Security engineer Ziyad Edher ensured the model safety and product security of Artifacts. Despite the small team size, within three months, they successfully brought Artifacts from concept to product release.

The feature's success exceeded the development team's expectations and may signal a shift towards generative AI becoming a more collaborative tool.


Efficient Deep Learning: A Comprehensive Overview of Optimization Techniques

The article first introduces various data types (such as Int16/Int8/Int4, Float32, Float16, Bfloat16, TensorFloat32, E4M3, and E5M2) and their roles in memory consumption. It then analyzes the primary sources of memory consumption during model training: model state (including optimizer state, gradients, and parameters) and residual state (such as activations, temporary buffers, and memory fragmentation). The article further delves into quantization techniques, including symmetric and asymmetric linear quantization, as well as quantization timing, granularity, and handling of outliers.

In the section on Parameter-Efficient Fine-Tuning (PEFT), the article provides detailed explanations of LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA), and how they reduce memory and storage usage. The article also mentions techniques such as Flash Attention, gradient accumulation, 8-bit optimizers, and sequence packing, which contribute to improving training efficiency and reducing resource consumption. Additionally, the article introduces the use of torch.compile and how to optimize the attention mechanism through multi-query attention (MQA) and grouped-query attention (GQA).

Finally, the article discusses collective operations and distributed training methods, including data parallelism (DP), model parallelism, tensor parallelism, and pipeline parallelism, as well as the concept and workflow of Fully Sharded Data Parallelism (FSDP).


The Rise of the AI Engineer

This article primarily discusses the rise of the AI engineer role and its impact on the software engineering field. It emphasizes the distinctions between AI engineers and traditional machine learning engineers, highlighting the importance of AI engineers in productizing AI technologies and applying AI models.

With the increasing capabilities of foundational models and the availability of open-source APIs, the role of AI engineer has rapidly emerged. AI engineers are required to not only master API documentation but also write software, even software for AI itself. The role of AI engineers has found widespread application in various fields, from large corporations to innovative startups and independent hackers, playing a pivotal role in driving the transformation of AI technology into real-world products. The number of AI engineers is projected to surpass that of traditional machine learning engineers, making it the most sought-after engineering position of the decade.

The article further analyzes the differences between AI engineers and machine learning engineers, noting that AI engineers typically do not engage in model training but focus on evaluating, applying, and productizing AI technologies. The tools and techniques used by AI engineers include the latest AI models, open-source tools, and automated agents. Although the role of AI engineers is becoming well-defined, there is still semantic debate in the market regarding the differences between AI and ML.


✦ Key Research

CSGO: Content-Style Composition in Text-to-Image Generation

This model achieves a high-quality combination of content and style in text-to-image generation. The research team constructed a large-scale dataset named IMAGStyle, comprising 210,000 image triplets, for training and research. The CSGO model explicitly separates content and style features through independent feature injection, enabling image-driven style transfer, text-driven stylized synthesis, and text-editing-driven stylized synthesis. The model requires no further fine-tuning for inference and retains the original text-to-image model's generation capabilities.


GameNGen: Diffusion Models Are Real-Time Game Engines

Google has released GameNGen, a diffusion model-based real-time game engine. It can interact with complex environments for extended periods in high quality. It can simulate the classic game DOOM at over 20 frames per second on a single TPU.

Training is divided into two stages:

  1. Training an autonomous agent to interact with the environment and record the agent's training trajectories. These trajectories serve as input data for the generative model.
  2. Using the Stable Diffusion v1.4 model and conditioning it to generate images based on the agent's trajectories.

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

DeepSeek's new paper introduces a novel load balancing method named "Loss-Free Balancing". Traditional methods control load balancing by adding extra loss functions, which can interfere with model training. The new method avoids introducing additional losses and instead directly adjusts the routing scores of each expert, leading to better load balancing and model performance. Experiments demonstrate that this method achieves superior results on both 1B and 3B parameter models.


Automated Design of Agentic Systems

This paper introduces a novel research direction called "Automated Design of AI Systems" (ADAS). Its goal is to enable AI to automatically design more powerful AI systems, rather than relying on human design. The researchers have developed an algorithm called "meta-agent search," which allows a "meta-agent" AI to continuously attempt to write new AI agent programs. Through experiments on multiple tasks, they found that this approach can discover AI systems that outperform those designed by humans. This method has the potential to significantly accelerate the development of AI systems, but safety concerns should also be addressed.


Comments

Subscribe Here 🔔