On the Fragility of LLM Benchmarks. A recent study involving researchers from Microsoft, DeepMind, and Mila highlights the fragile nature of LLM benchmarks when applied to real-world scenarios, specifically in solving math problems. The research paper highlights a critical gap: scores on popular math benchmarks don't effectively predict an LLM’s ability to engage in “second-hop” reasoning, where the solution to a subsequent math problem depends on the solution to a preceding one. It seems we’re still scratching the surface of what these benchmarks can tell us about actual LLM capabilities as they are used in real world applications.
California’s AI Legislative Landscape. Despite the downfall of AI Bill SB 1047, vetoed by Gavin Newsom, AI legislation in California is far from stagnant. According to Luiza Jarovsky on Twitter, the state is witnessing a surge in AI-related laws, with 17 new pieces of legislation this year focusing on generative AI. It’s a sign of how quickly the technology is evolving, as well as the gap filled by Federal laws, and it may portend the high degree of bi-partisan support for future AI regulations.
The Future of Call Centers. Is the call center as we know it, staffed by humans, dead, or merely evolving? With LLMs enhancing their audio capabilities to deliver high-quality, cost-effective, live dialogue — OpenAI's Realtime voice API runs at $9 per hour versus $15 for a human agent — the role of AI in customer service will soon be indisputable. Yet, can AI manage customer escalations or soothe an irate caller effectively? More testing is needed to assess the types and frequency of errors AI voice systems might make in different contexts.
Google’s Open-Source Paradox. In light of the recent court injunction ordering Google to open Android to app store rivals, Ben Thompson argues in Stratechery that this is ultimately about Google struggling with Android’s open-source nature while craving the control that Apple enjoys over iOS. Google’s challenge lies in maintaining a balance between openness and control, a dance becoming more complicated by legal challenges:
“Google wants the same control of Android that Apple has, while operating a licensable operating system for third parties. However, you can’t have both! Apple is able to dictate terms for iOS to a much greater extent than Google can for Android precisely because they own the entire thing; Google wants the benefit of “openness” and commoditizing OEMs without giving up control. The only way to accomplish that, though, is through contracts and deals that were found to be illegal.”
LLMs Know When They Could do Better. A new Stanford research paper looks at whether or not an LLM knows during inference time (while it is in the middle of generating an answer) if it can provide a better response to a user query. The researchers exploit this model awareness to improve LLM answer quality more efficiently, or even self-correct mid-generation. Some traditional methods to improve LLM inference (“test time”) computation, such as Tree of Thought or Best of N, make use of extra answer generations to improve performance, however, this could lead to wasted compute on both easy questions and answers for questions too difficult to improve upon. To address this, the authors develop a new method called “capability-aware self-evaluations”, which is faster while still improving performance. It utilizes an LLM’s cache (its working memory) by asking the model to figure out for itself when a new generation (response) to a user query is needed.
Crypto is Power. The extent of influence which Crypto now yields in Washington is frightening to those who believe that lobbying by corporations can subvert the democratic process, especially when it’s in the interest of a few. Brian Armstrong from Coinbase – the largest crypto exchange in the United States – is an important figure in what is now an influential movement, notes Gil Duran in this New Republic piece: “A Public Citizen study last month found that crypto companies, which contributed less than $10 million to super PACs over the past two election cycles combined, have raised more than $200 million in 2024 — accounting for nearly half of all corporate contributions this cycle”. Duran argues that a much deeper ideology underpins Armstrong’s lobbying: “He believes the United States is in “slow decline” and embraces the Network State”. Armstrong, along with a range of investors, one of whom is Sam Altman, support the creation of these Network States, which exist outside of the law and have access to “the permissionless transfer of property”. A veritable garden of eden for the wealthy. We wonder what role AI has to play in such a world?
P.S. We are hiring! Come work with us: if you are good with LLMs, care about the public good, and want to contribute to our agenda of transparency and accountability in the corporate use of AI.
P.P.S. Ridiculous finale. LLMs for meat chopping? Fleischhacker is a German surname, literally meaning "Meat Chopper." One thing which LLMs are surely helping “improve” are the surnames of bots on Twitter (along with their hyper-realistic photos). A recent follower of mine (@IlanStrauss):