When Text Helps Time Series (and When It Doesn’t)

Time Series Foundation Models: What You Need To Know

The recent emergence of Time Series Foundation Models (TSFMs) offers a powerful new tool for forecasting. Their effectiveness, however, is often constrained by an architectural design that analyzes each time series as an independent entity. This approach overlooks the rich, structured context available in most enterprise data warehouses, where a product’s sales history is interconnected with customer behavior, marketing campaigns, and related product lines. An interesting and powerful perspective, championed by pioneers in Graph Transformers, argues that the most significant predictive gains come from modeling this entire relational ecosystem, allowing different time series to inform one another. While this relational approach represents the frontier, understanding the current state of the art is a necessary first step for any practitioner.

Zero-Shot Forecasting: Powerful Baselines with Practical Limits

The key advantage of Time Series Foundation Models (TSFMs) like Amazon’s Chronos and Google’s TimesFM is their ability to provide a strong initial forecast instantly, even for data they have never seen before. They can do this by leveraging their pre-training on vast, diverse datasets. This capability is strong enough that in one benchmark, TSFMs outperformed specialized, fully trained models on 10 of 21 datasets without any fine-tuning.

This out-of-the-box accuracy is not universal and hinges on how well your data matches the model’s training corpus. For idiosyncratic data — such as the electricity consumption of a single home — a small custom model with under 50,000 parameters has been shown to beat a fine-tuned 500M+ parameter TSFM. The key takeaway is to use zero-shot forecasts as a powerful starting point but always validate them against your specific data.

This newsletter is reader-supported. Become a paid subscriber.

Choosing Your Path: Built-In Intelligence or External Tools

Integrating time-series analysis with LLMs involves a choice between two strategies: native integration and tool-calling.

“Native integration” builds time-series capabilities directly into the model for deep, joint reasoning between signals and text. This is crucial for complex tasks; for example, a specialized model using this approach achieved a 69.9% F1 score on sleep staging by jointly analyzing waveforms and patient notes, far exceeding text-only or image-based baselines.

“Tool-calling” uses a general LLM to orchestrate specialized external libraries. It’s a modular, efficient, and reliable approach that is more practical for most standard business forecasting.

The takeaway is clear: use tool-calling for efficiency and robustness. Reserve native integration for high-stakes scenarios like proactive healthcare or industrial diagnostics, where discovering subtle patterns from combined data drives critical outcomes.

Purpose-Built vs. Adapted: Why Specialized Architectures Win

Despite the appeal of repurposing existing LLM infrastructure, models designed specifically for temporal data deliver superior performance. Benchmarks show that specialized time-series architectures, built to grasp seasonality and trends, outperform adapted text models. The gap exists because these adaptations work by first translating continuous time-series values into discrete, text-like tokens, a process that fundamentally mismatches the data’s properties like its multiple scales and non-linguistic structure.

While adapted LLMs offer deployment convenience, the performance penalty typically justifies separate model infrastructure. One representative study found that models treating time series as a distinct modality achieve state-of-the-art results, while LLM adaptations lag significantly, especially on complex multivariate tasks.

Match the Model to the Mission

TSFMs are not monolithic. The landscape is divided into encoder-only models (like MOMENT and MOIRAI), decoder-only models (like TimesFM and Lag-Llama), and encoder-decoder architectures (like TimeGPT). Encoder-only designs excel at tasks requiring analysis of complete sequences — anomaly detection, classification, or pattern recognition across entire time windows. Decoder-only architectures generate forecasts autoregressively, similar to how language models produce text, making them natural for long-horizon sequential predictions. Encoder-decoder systems balance both capabilities but require more computational resources.

For forecasting specifically, Transformer-based models using “patching” techniques — breaking sequences into chunks like PatchTST and iTransformer — consistently deliver the best results. However, for general-purpose analysis including classification and imputation, CNN-based models like TimesNet prove more effective. The practical implication: teams should not assume a state-of-the-art forecasting model excels at anomaly detection. Your integration path, infrastructure requirements, and fine-tuning approach all depend on matching architecture to use case.

Model Efficiency: Why Smaller Often Beats Bigger

While the largest models often capture the most attention, benchmarks consistently reveal a clear trade-off between accuracy, computational cost, and inference speed. Smaller models are often more practical, as simple architectures like DLinear can match the performance of complex ones for a fraction of the cost. Compact TSFMs like Chronos Bolt Tiny (48M parameters) provide forecasts in milliseconds, whereas the largest models can take seconds, making them impractical for real-time applications.

For large-scale operations in logistics or retail, a fast “good enough” model is far more valuable than a marginally more accurate one that fails to meet latency requirements. The best practice is to start with a smaller, efficient model and only scale up if the slight accuracy improvement is worth the significant increase in operational cost and latency.

Implementation Workflow: From Zero-Shot Baseline to Fine-Tuning

If native integration suits your use case, here’s a practical implementation workflow. First, always benchmark a pre-trained TSFM in a zero-shot setting to establish a strong, immediate baseline. If more accuracy is needed, the next step is fine-tuning, but this should be approached with caution. Fine-tuning can be computationally expensive, and some studies show that on narrow, single-domain datasets, it can even degrade performance compared to the model’s zero-shot capabilities, likely due to a lack of data diversity.

A more modern alternative is “in-context learning,” where a model like Google’s TimesFM-ICF is provided with a few examples from the target domain within the prompt at inference time. This allows the model to adapt on the fly without any gradient updates, offering the accuracy of a specialized model without the operational overhead of managing separate fine-tuning jobs.

The Value of Probabilistic Predictions

For many business applications, from supply chain management to financial risk analysis, understanding the uncertainty of a forecast is as critical as the point prediction itself. Many TSFMs, such as Chronos, MOIRAI, and Toto, are designed to be probabilistic, meaning they output a predictive distribution (e.g., a range of possible values and their likelihoods) rather than a single number. This allows downstream systems to make risk-aware decisions, such as calculating the optimal inventory level to hold to achieve a 95% service level. When evaluating models for such tasks, teams should prioritize those with native probabilistic capabilities and use metrics like the Continuous Ranked Probability Score (CRPS) in addition to standard error metrics like MAE.

Tokenization Strategy: Balancing Speed, Cost, and Precision

A TSFM’s processing method — how it “tokenizes” data — is key to its performance. Patch-based models like PatchTST and TimesFM, treat data chunks as tokens, making them 40-60% more efficient than point-wise methods. Alternatively, frequency-domain or wavelet transforms create robust vocabularies better suited for complex patterns like spikes and non-stationarity.

This choice is critical for latency-sensitive applications like real-time monitoring. The trade-off is simple: patching offers speed at the potential cost of temporal precision. Therefore, teams requiring sub-second resolution may need to accept the higher cost of point-wise processing, a technical choice that becomes a key practical constraint when scaling to production workloads.

Think in Time Scales: Unlocking Multi-Resolution Forecasting

Real-world time series exhibit structure at multiple temporal resolutions: hourly fluctuations, daily patterns, weekly cycles, seasonal trends. Processing data at multiple scales simultaneously — similar to how computer vision models analyze images at different resolutions — can improve forecasting accuracy by 20-30% The Multi-Scale Fine-Tuning (MSFT) framework offers a practical implementation: freeze the model backbone, add lightweight adapter layers for each temporal scale, and blend their forecasts.

This parameter-efficient approach — often training less than 1% of model parameters — consistently beats naive fine-tuning and even state-of-the-art models trained from scratch. While adding 2-3 scales increases computational overhead, the trade-off is often justified for critical applications. If your data exhibits clear periodicities— retail with weekly shopping patterns, energy with daily load curves — prioritize multi-scale adapters over full-model fine-tuning or larger models.

Deployment Choice: API Convenience vs. Open-Source Control

Choosing a TSFM involves a key decision between commercial APIs and open-source models.

Commercial APIs like TimeGPT offer plug-and-play convenience with a pay-per-forecast model ($1-3/million points), which is ideal for low-volume tasks. However, they can be expensive at scale, limit customization, and create data sovereignty issues.

In contrast, open-source models like MOMENT provide full control and are cost-effective at high volumes but demand in-house MLOps expertise.

Teams should calculate break-even volumes: APIs are best for prototypes and low-throughput needs, while open-source is superior for production scale, customization, and compliance.

Custom Solutions: When Standard Models Can’t Handle Your Data

For complex, messy data, a general-purpose TSFM may not be enough. Datadog’s Toto, designed for observability data, is a prime example. This data features extreme non-stationarity, frequent outliers, and thousands of interacting metrics — challenges standard models fail to address. Toto’s architecture was purpose-built with innovations that allow it to adapt to constantly shifting data patterns and accurately model extreme outliers.

The resulting model not only dominates its specialized benchmark but also generalizes well enough to achieve top ranks on general forecasting leaderboards. This demonstrates that designing a model to solve a difficult, domain-specific problem can lead to broadly beneficial architectural innovations.

Time Series in Context: Forecasting From Relational Data

Adding textual data like news headlines to forecasts is appealing, but for most businesses, a more powerful source of context is already available: their own structured, relational data. A product’s sales figures, for instance, do not exist in isolation. They are linked to customer demographics, marketing campaigns, and transaction histories stored across multiple database tables. While a standard TSFM might model the sales sequence directly, it would miss these critical relational signals.

An alternative approach treats this entire relational ecosystem as a graph. In this paradigm, entities like ‘customers’ and ‘products’ become nodes, and their interactions, such as ‘purchases’ or ‘reviews,’ become edges. Specialized architectures, particularly Graph Transformers, are designed to process this structural information alongside the temporal data. This approach bypasses the brittle and time-consuming process of manual feature engineering. Instead of data scientists creating aggregate features (e.g., ‘average sales last month’), the model’s attention mechanism learns directly from the raw, related events, such as every individual transaction or customer review.

It learns how signals propagate through the network — for example, how a marketing campaign targeting a specific customer segment influences sales of related products over time. More powerfully, it allows different time series to inform one another. To forecast sales for one product, the model can automatically borrow strength from the historical patterns of related products, connected through a product taxonomy, leading to far more accurate and robust predictions.

The practical implication is clear: before investing in enriching forecasts with external, unstructured text, teams should first ensure they are fully leveraging the relational context already present in their data warehouses. For many enterprise applications, the most valuable predictive lift comes not from what is written about the business, but from the patterns embedded within the structure of the business itself.

The post When Text Helps Time Series (and When It Doesn’t) appeared first on Gradient Flow.