TL;DR

Educated in several acronyms across the globe (UNISR, SFI, MIT), I am the co-founder of Bauplan, an agentic data infrastructure company based in SF.

I was the co-founder and CTO of Tooso, an AI startup providing search and recommendations to millions of users, before being acquired by Coveo (TSX:CVO). I led Coveo’s AI from scale-up to IPO, and built out Coveo Labs, an R&D lab rooted in open science: our libraries, models and datasets have collected thousands of stars and garnered tens of millions of downloads.

Throughout my career, I have been fortunate to collaborate with incredible teams (e.g. Netflix, NVIDIA, Stanford, Univ. of Wisconsin-Madison), while working on products spanning multiple fields: Artificial Intelligence, Data Management, Information Retrieval, Computer Systems. My research contributions are often product focused, and are memorable mostly for their titles (e.g. “Not all those who browse are lost”, “You don’t need a bigger boat”, “FaaS and Furious”).

While building my new startup, I moonlight as an Adj. Professor of ML Systems at NYU, which is only notable because it is the only job I have ever had that my parents understand.

Where is my mind?

I occasionally share code, ideas and teaching materials. Selected projects, talks, papers and datasets are highlighted below.

I recently started investing in startups, both directly and as an LP in AI funds: I’m always happy to chat with founders!

When stars align, I sometimes advise great teams on AI, Data, and IR: past engagements include Outerbounds (acquired by Anaconda), Objective (acquired by Upwork), and Plural (acquired by SAI360). If you think I can help, feel free to reach out.

Research

I have done research in a heterogeneous set of topics: Information Retrieval (e.g. RecSys, SIGIR), Machine Learning and model evaluation (WWW, NeurIPS), NLP (NAACL, ACL), data science (Nat. Sci. Rep., KDD), agentic AI and Large Language Models (ICML), data management (SIGMOD, VLDB), human-machine computation (HCOMP), computer systems (Middleware, FAST). Our paper on cognitively inspired query embeddings won the Best Paper Award at NAACL 21, and our talk on reproducible data pipelines on data lakes won the Best Presentation Award at DEEM (SIGMOD) 24.

I was the lead organizer of Supporting Our AI Overlords at ACM CAIS, the first-ever research workshop at the intersection of AI agents and data systems. I have been a co-organizer of SIGIR eCom (2022, 2023) and EvalRS (2022, 2023), Industry Sponsorship Chair for CIKM 2022, Industry Chair at UMAP 2025, and I have been involved in various capacities in several top-tier events (e.g. EMNLP, ACL, SIRIP, ECONLP, ECNLP, PaPoC).

As a true Santa Fe Institute alumnus, I am an old-fashioned generalist, and I made tiny contributions to other fields mostly as an excuse to spend time with old friends: logic and computation, cellular automata, computational social sciences, networks, philosophy of mind, political science, digital ethics.

Finally, some of my projects have been patented, but to this day nobody seems to really know why.

Old stuff

In previous lives, I managed to get a Ph.D., simulate a pre-Columbian civilization, document biases in national elections and give an academic talk on video games. Some of my improbable “achievements” received ample press coverage and earned a few sparks of Hacker News front-page popularity.

Having built end-to-end data pipelines at garage, growth and IPO scale, I happily shared all my mistakes in a series of articles that introduced the concept of Reasonable Scale.

Some time before Brad Pitt’s movie, I led one of the first attempts to run sophisticated analytics for a professional basketball team, and spearheaded the first data science effort on Milan’s bike-sharing service (no bikers or bureaucrats were harmed during the project).

About this page

The content of jacopotagliabue.it is released under the BY-NC-ND license; my chibi was designed by the incredibly talented wisesnail.

Last update: June 2026.

Appendix

Friends in industry and academia often invite me to talk about things I (sort of) know. Highlights include keynotes at KDD, SIGIR, RecSys, CiE, VLDB, and SRDS, plus talks at NVIDIA, Lyft, Home Depot, Pinterest, IBM, Columbia, Berkeley, and many others.

My publication list is available on Google Scholar: selected projects, talks, papers and datasets are collected here for convenience.

Selected Open Source Projects

FashionCLIP is a fashion-aware model based on CLIP. As the first-ever industry-aware CLIP model, FashionCLIP spawned an open-source repo, two papers (Nat. Sci. Rep. and ACL), and a popular Hugging Face release, with more than 10 million downloads in the first few months.
“You don’t need a bigger boat” is an introduction to modern ML pipelines at the Reasonable Scale, first published at RecSys and then presented at the Stanford MLSys Lectures in 2021.
RecList is a testing library for recommender systems: RecList spawned a sponsored open source package, a competition, an open source hackathon, a white paper on evaluation and three articles (WWW, Nature Machine Intelligence, CIKM). RecList collected tens of thousands of dollars in donations as an open-source project, and it is among the testing libraries recommended by the ACM.

Selected Talks

From LLMs to agents, Columbia University, 11/25
How Do We Sleep at Night? Distributed Systems at Startup Speed, SRDS 2025 (Keynote), 10/25
Speedrunning The Lakehouse, VLDB 2025 (CDMS Keynote), 09/25
Applied R&D at startup scale, SIGIR 2023 (Keynote), 04/23
Wild Wild Tests, Arthur Ground Truth Series (Invited Talk), 04/23
Mo’ models, mo’ problems: CLIP-like models and generalization in modern e-commerce platforms, KDD 2022 (Ecom-Gen Keynote), 08/22
Recs at reasonable scale (slides, repo), NVIDIA RecSys Summit (Keynote), 07/22

Selected Papers

Trustworthy AI in the Agentic Lakehouse: from Concurrency to Governance (AAAI26)
Bauplan: Zero-Copy, Scale-Up FaaS for Data Pipelines (Middleware 2024)
How Well Can LLMs Negotiate? NegotiationArena Platform and Analysis (ICML 2024)
Reproducible Data Science over Data Lakes (SIGMOD 2024, Best Presentation Award)
Contrastive Language and Vision Learning of General Fashion Concepts (Nat. Sci. Rep. 2022)
The Embeddings That Came in From the Cold: Improving Vectors for New and Rare Products with Content-Based Inference (RecSys 2020)
Query2Prod2Vec: Grounded Word Embeddings for eCommerce (NAACL 2021, Best Paper Award)
Cellular Automata (SEP 2017)

Datasets and Data Challenges

EvalRS 2023, augmenting the EvalRS 2022 dataset with lyrics and sentiment information, plus a whole new set of tests. The dataset was used to organize the first-of-its-kind hackathon at KDD.
EvalRS 2022, packaging the LFM-1b dataset into an easy-to-use, “testable” format for our data challenge. The competition featured 150 participants in 50 teams, and was at the heart of a Nature Machine Intelligence editorial.
SIGIR eCom Challenge 2021, the largest and most complete session-based dataset (at the time), released for SIGIR 2021.
Shopper Intent Prediction, from our paper.

Aside from research and tutorials, our datasets have been successfully used by dozens of graduate students to defend their theses at Tilburg University and Politecnico in Milan.