I tried to create a company data API and failed… for now

Oct 2, 2025

For the Ombú project, I need an API to get data from publicly traded companies. At least to start, I’m interested in balance sheet, income statement, and cash flow data. Currently, there are several alternatives, but all are paid, and the free versions available for development are very limited. That’s why I decided to roll up my sleeves and try building my own API.

Every publicly traded company is required to report certain financial documents to the regulatory authority. In the U.S., that institution is the SEC. Among many other documents, you’ll find the ones I need: balance sheet, income statement, and cash flow statement (from now on, I’ll call them the three financial statements). Companies publish these documents annually and quarterly, submit them to the SEC, and the SEC makes them public.

First Attempt: Obviously, with AI

The SEC provides an API called companyFacts. For each company, identified by its CIK number, they publish a JSON file containing tons of information, including all the concepts reported in the financial statements. This file is very useful, but it’s not the official document reported by the company. It’s a JSON created by the SEC with processed data.

The advantage is that you get everything in a single file, all in the same format. But it has several complications.

{
  "Revenues": {
    "label": "Revenues",
    "description": "Amount of revenue...",
    "units": {
      "USD": [
        {
          "start": "2015-09-27",
          "end": "2016-09-24",
          "val": 215639000000,
          "accn": "0000320193-18-000145",
          "fy": 2018,
          "fp": "FY",
          "form": "10-K",
          "filed": "2018-11-05",
          "frame": "CY2016"
        }
      ]
    }
  }
}

We get the label and description, but how do we know which statement it belongs to? We could assume that if it has start and end dates, it represents a period and therefore belongs to the income statement or cash flow statement. Concepts with only an end date might be instant data, belonging to the balance sheet. But that’s just an assumption.

Another problem: how do we know if the concepts are totals or components? In the example, “Revenues” is clearly a total, but most of the time, it’s not that straightforward. In fact, it’s rarely straightforward.

Finally, each concept reported mixes everything: all fiscal years, all forms (10-K, 10-Q/A, 8-K, etc.).

My first approach was to pull companyFacts for the S&P 500, store each concept into a Postgres database, and then process the data to determine whether concepts were instant or periodic, while filtering by form and fiscal year.

Once I had all concepts, I defined metrics to calculate: Current Assets, Non-current Assets, Current Liabilities, Non-current Liabilities, Revenue, Cost of Goods Sold, Operating Expenses, Operating Cash Flow, and more. Each metric belonged to a statement (e.g., Current Assets → balance sheet, Revenue → income statement).

Finally, I used the OpenAI API in batch mode to classify each concept into a metric. After lots of prompt tuning, I realized this wasn’t the right path. Many concepts were correctly classified, but many weren’t, especially unusual ones. Comparing calculated metrics with real company data, I found some matched, some didn’t, and most importantly, results weren’t consistent across companies.

In the end, I learned a lot about how the SEC builds companyFacts and how to use the OpenAI API (especially in batch mode), but it wasn’t enough to build a reliable API.

Second Attempt: Discovering Company Submissions

Since AI wasn’t working, I dug deeper. I discovered that classification wasn’t even necessary to calculate metrics.

It turns out financial reports are standardized, or at least that’s the intention. Companies publish them in XBRL or iXBRL format, which are basically XML files with reporting concepts.

Even more interesting: companies also publish auxiliary files called Linkbases, which turned out to be exactly what I needed. These include:

*_cal.xml (Calculation Linkbase): Defines summations (e.g., Assets = Current Assets + Non-current Assets).
*_pre.xml (Presentation Linkbase): Defines the hierarchical order of concepts, how they’re displayed.
*_lab.xml (Label Linkbase): Contains human-readable labels (e.g., “Assets, Current”).

With these, you can reconstruct all concepts and calculate all metrics. But it’s not that simple. Companies publish these files every time they release a report, so you end up with multiple versions. You then need to standardize the concepts to rebuild the historical data.

No doubt this is the right path. Here lies all the information. It’s just a matter of reconstructing the reports and building the historical data.

Taking the Easy Way (for Now)

At this point, I knew where to get the information to build my API exactly as I wanted. But the process was going to take a lot of time. So, I concluded the best option for now was to start with a third-party API.

I found several options and chose AlphaVantage, for two reasons:

The simplicity of the API, which returns historical values for the most relevant metrics.
A free API key.

Soon after using it, I realized the free version only allows 25 requests per day! So, for now, I’ll stick with it, but at some point, I’ll need to switch.

The Ombú project has two pillars: the data, and how it’s displayed. In this post, I’ve focused on the data. But for this project, I want a fast, iterative development cycle, with weekly production releases.

That’s why I decided to pause the development of my own API, use AlphaVantage for now, build a usable UI, and later return to building this API.

👉 You can try the Ombú UI here: https://ombu-website.vercel.app
👉 I created a demo here: Analysis of Apple Inc.

Joaquín Bozzalla | bozzallajoaquin@gmail.com