AI Newsletter #004 (2023/06/12-2023/06/18)

Big Companies:

OpenAI: Big API Updates
OpenAI has released some important updates to Chat Completions API and base models, including:
1) New function calling capability in Chat Completions API
2) New 16k context version of gpt-3.5-turbo (vs the standard 4k version)
3) Reducing the prices for GPT-3.5-turbo and text-embedding-ada-002 by 25% and 75% respectively

GitHub: Survey Finds 92% of Programmers Are Using AI Tools
A survey conducted by Microsoft’s GitHub found that 92% of programmers at large companies are using AI tools in their workflow, with 70% of respondents reporting benefits from using them. These tools are helping developers create and debug code more quickly with improved code quality and fewer production-level incidents, suggesting that code volume may not be the best metric for measuring productivity.

Mercedes: Adding ChatGPT to its infotainment system
Mercedes is introducing OpenAI’s ChatGPT conversational AI agent to its MBUX infotainment system in the U.S. Starting June 16, owners of MBUX-equipped models can join the beta program by saying “Hey Mercedes, I want to join the beta program.”

Meta: Releases Open-Source ‘MusicGen’
Meta has released an open source AI model, MusicGen, which can generate music from simple prompts. It is based on a transformer language model and can be demoed through Hugging Face’s API. Meta plans to release more open-source models, though it recognizes that AI can be an unfair competition for artists.

Amazing New Projects

AI-Generated QR Code
This week, AI-generated QR code received wide attention. You can combine QR code with any style you like and the generated QR code works! Stable Diffusion is used to create the styles and controlnet is used to keep the structure of QR code. This project is initiated by four college students from China.

Research of The Week

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

This week, Meta has launched the first AI model based on the LeCun world model concept. The model is called the Image Joint Embedding Predictive Architecture (I-JEPA), which learns by creating an internal model of the external world and compares abstract representations of images rather than comparing the pixels themselves.

I-JEPA has achieved excellent results in various computer vision tasks and is computationally efficient compared to other widely used computer vision models. Additionally, the representations learned by I-JEPA can be applied to various applications without requiring extensive fine-tuning.

For example, Meta trained a visual transformer model with 632 million parameters using 16 A100 GPUs within 72 hours. They also achieved state-of-the-art performance in low-shot classification on ImageNet, where each class had only 12 labeled samples. Other methods typically require 2 to 10 times more GPU hours and have higher error rates when trained with the same amount of data.

The related paper titled “Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture” has been accepted at CVPR 2023. Of course, all training code and model checkpoints will be open-sourced.

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

Meta states that Voicebox is the most versatile text-guided generative model for speech at scale.

Similar to image and text generation, Voicebox can create various styles of speech output, including generating output from scratch and modifying given samples. Voicebox can synthesize speech in six languages and perform tasks such as noise removal, content editing, style conversion, and diverse sample generation.

Before the introduction of Voicebox, AI models for speech generation required specific training data carefully prepared for each task. However, Voicebox only needs to learn from raw audio and accompanying transcriptions, and it can modify any part of a given sample.

Voicebox is based on a method called Flow Matching, which has been proven to improve diffusion models.

In terms of generation quality, Voicebox outperforms the current state-of-the-art (SOTA) English speech generation model VALL-E in terms of intelligibility (word error rate: 1.9% vs. 5.9%) and audio similarity (0.681 vs. 0.580), while being 20 times faster.

Can Large Language Models Infer Causation from Correlation?

Causal inference is one of the hallmarks of human intelligence. While the field of CausalNLP has attracted much interest in the recent years, existing causal inference datasets in NLP primarily rely on discovering causality from empirical knowledge (e.g. commonsense knowledge).

In this work, the authors propose the first benchmark dataset to test the pure causal inference skills of large language models (LLMs). They formulated a novel task CORR2CAUSE, which takes a (set of) correlational statements and determines the causal relationship between the variables and evaluated 17 LLMs.

Through the experiments, researchers identify a key shortcoming of LLMs in terms of their causal inference skills, and show that these models achieve almost close to random performance on the task.

TryOnDiffusion: A Tale of Two UNets

Given two images depicting a person and a garment worn by another person, our goal is to generate a visualization of how the garment might look on the input person.

A key challenge is to synthesize a photorealistic detailpreserving visualization of the garment, while warping the garment to accommodate a significant body pose and shape change across the subjects. Previous methods either focus on garment detail preservation without effective pose and shape variation, or allow try-on with the desired shape and pose but lack garment details.

In this paper, we propose a diffusion-based architecture that unifies two UNets (referred to as Parallel-UNet), which allows us to preserve garment details and warp the garment for significant pose and body change in a single network.

The results look pretty natrual.

User studies. 15 non-experts were asked to select the best result or choose “hard to tell”. TryOnDiffusion significantly outperforms others in both studies.

Want to receive TechNavi Newsletter in email?