Introduction to Working with the SEC’s EDGAR API

December 3, 2024

I. introduction

The SEC’s Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system contains a wealth of information for accountants and analysts in their day to day duties. Numerous tasks require EDGAR data, from equity analysis to ratio analysis to developing industry benchmarks.

While most accountants are familiar with the EDGAR search function, the EDGAR API provides a way to programmatically access this information for virtually any accounting or accounting adjacent task. While there are third party tools that fix many shortcomings of the EDGAR API, they either cost money, lock you into opinionated code, or both. Knowing how to access this data directly from the source is still valuable, depending on the use case.

As useful as the EDGAR API can be, its documentation is lacking. So, let’s break down the basics of the EDGAR API and write some functions to help us navigate it!

II. Finding Unique CIK Numbers and Accessing the Submissions Metadata

The API is described at https://www.sec.gov/search-filings/edgar-application-programming-interfaces, and the first endpoint listed is “https://data.sec.gov/submissions/CIK##########.json.”

Before calling this endpoint (or any other EDGAR endpoint, for that matter), we need a way to retrieve a company’s unique cik number. The file we need to do this is a JSON file located at https://www.sec.gov/files/company_tickers.json. As shown below, this file is a simple JSON structure of the 10,000 or so stock tickers that have SEC filing requirements. Each entry contains the company’s name, stock ticker, and cik number without any leading zeros.

Pictured below: sample of the data from company_tickers.json

json structure with sec ticker information

We can pull this file into our program using the Python requests library. The SEC wants a user agent containing an email address defined in all request headers, so we’ll also do that.

headers = {"User-Agent" : "[Your email address here]"}

def get_tickers_dict():
    company_tickers = requests.get(
        "https://www.sec.gov/files/company_tickers.json", headers=headers
        ).json()
    return company_tickers

With the tickers JSON file accessible, I wrote a very rudimentary search function to help navigate it. In short, this function takes a company name as input, searches for a match in the tickers file, and adds the matching company to a results list. If there is only one match in the results list, then our search was specific enough and the company cik is returned with the correct amount of leading zeros filled. For example, inputting “Johnson” results in three potential matching companies, and the search loop will continue. Entering “Johnson & Johnson” will return the cik for JNJ.

def company_search(tickers):
    searching = True
    while searching:
        company_name = input("Provide a Company Name: ").upper()
        search_results = []
        for company in tickers.values():
            if company_name in company["title"].upper():
                d = {"title": company["title"], "cik": str(company["cik_str"]).zfill(10)}
                if d not in search_results:
                    search_results.append(d)
        if len(search_results) == 1:
            return search_results[0]["cik"]
        elif len(search_results) > 1:
            print(f"Multiple possible matches, re-enter from the following list: {search_results}")
        else:
            print("No matches found.")
            return

After writing the result of company_search() to a variable, we can call the submissions endpoint using, again using Python requests.

def get_submissions(cik):
    company_submissions = requests.get(
        f"https://data.sec.gov/submissions/CIK{cik}.json", headers=headers
        ).json()
    return company_submissions

The submissions endpoint returns a variety of information about a given company’s filing history, and what we do next depends on what information we’re looking for. For example, the API returns metadata related to all of the company’s recent SEC filings.

Pictured below: example of the top level json returned by the submissions endpoint.

json structure with sec submissions information

Pictured below: the filing metadata available under the filings key. Note that each value is another object containing a list. The index of these lists identifies each unique filing, such that each filing has one accessionNumber, filingDate, etc.

json structure with lists of sec filing metadata

If we wanted a list of the unique accession numbers (an accession number is a unique identifier assigned to every accepted SEC submission) associated with only the company’s recent 10-K filings, then that function could look like this:

def get_form_list(all_submissions, form_type="10-K"):
    form_list = [accession_num for accession_num, form
                 in zip(all_submissions["filings"]["recent"]["accessionNumber"], 
                        all_submissions["filings"]["recent"]["form"])
                 if form == form_type]
    return form_list

This code works because each list of meta data accessed through the "recent" key values is the same length, so the data at index 0 is related to a single unique filing, index 2 relates to another unique filing, etc.

III. Company Concept Endpoint

The company concept endpoint returns a historical (and chronological) record of figures from a single company for a single “concept”. A concept consists of a “taxonomy and tag”. In this context, a taxonomy means an XBRL filing standard, like the us-gaap taxonomy maintained by FASB. A tag is a specific line item within that taxonomy. The endpoint example given with the EDGAR documentation is “https://data.sec.gov/api/xbrl/companyconcept/CIK##########/us-gaap/AccountsPayableCurrent.json”.

In this case, the request is for the “AccountsPayableCurrent” tag, which is a member of the us-gaap reporting taxonomy. Basically, any time a company has an accepted SEC filing, and that filing reports an amount of “AccountsPayableCurrent”, the amount will be saved along with other information such as the form type, period end date, filing accession number, etc.

One consistent challenge of working with the XBRL APIs is handling duplicate amounts. A 10-K or 10-Q filed in a given period will inevitably contain tables, schedules, calculations, and other disclosures that contain values from prior periods. All of these values from every accepted filing are maintained within the EDGAR system, and the API returns them all! Thus, when working with the EDGAR API it is not uncommon to see the same value repeated over and over again because that amount shows up in multiple filings.

Pictured below: a typical duplication pattern. Note that while each object's "val" and "end" are the same, the rest of the data, including accession number ("accn") is different.

json structure demonstrating the duplication of values and reporting end dates

I discuss some suggested ways of handling duplicate amounts later in this article. For completeness, here’s a function to retrieve a company concept.

def get_concept(cik, tag):
    full_concept = requests.get(
        f"https://data.sec.gov/api/xbrl/companyconcept/CIK{cik}/us-gaap/{tag}.json",
        headers=headers
        ).json()
    return full_concept

IV. Company Facts Endpoint

Once you have a grasp on the company concept API, the company facts API is easy to understand. The company facts API returns every “company concept” for a given company in a single API call. In other words, the company facts API gives you a complete chronological history of every line item in the us-gaap taxonomy filed by a given company. Each concept is listed in alphabetical order.

json structure showing result of company facts API call
def get_facts(cik):
    facts = requests.get(
        f"https://data.sec.gov/api/xbrl/companyfacts/CIK{cik}.json",
        headers=headers
        ).json()
    return facts

V. Handling Duplicate Values in Company Concepts

As mentioned previously, for better or worse, the EDGAR API returns numerous duplicate amounts. This is because the API generally returns the data for every single amount reported in every single filing. Inevitably, filings will contain time period comparisons, tables, etc. that re-use information already filed in prior periods. For example, a 10-K might contain a table summarizing quarterly performance. Quarterly figures were already captured by EDGAR when the 10-Qs were filed, and they are captured again when the 10-K is filed. Because these amounts are reported on unique filings, they are treated as unique entries by EDGAR. Thus, you’ll see objects with the same amount repeated in the value field, but with a different accession number and other attributes related to the form that the amounts appear on.

While there are numerous ways to filter out unwanted duplicate values, I came up with two different techniques depending on whether or not I wanted to preserve any duplicate data.

The first technique searches a concept for unique reporting end date (“end”) and value (“val”) combinations. This technique keeps all unique end and val combinations, meaning that it retains 10-Q income statement amounts that span multiple quarters (e.g., summary of results through Q2, Q3, etc.). This technique also retains entries in the event that the value for that specific reporting end date changed between filings for any reason.

The second technique relies on the “frame” attribute to filter out duplicates. Frame is used by the EDGAR frames API to “aggregate one fact for each reporting entity that is last filed that most closely fits the calendrical period requested.” Filtering for entries that only contain the frame attribute effectively eliminates all value duplications for a given reporting period end date. However, there is a cost. Filtering on frame means that the latest occurrence of a given value for a given period is saved, and the rest of the duplications are eliminated. If there are two different values for the same filing period in multiple filings, perhaps because of a change in estimate in the later filing, you won’t see that history. Additionally, a frame is generated for every separate calendar quarter in isolation, which means that any income statement disclosures summarizing performance over multiple quarters (e.g., performance from Q1 through Q3, etc.) will be eliminated in the process as well.

Pictured below: a single company concept object containing the "frame" key. In this case, the frame's value of "CY2022Q4I" indicates that this object's "val" is the latest available data for the period most closely matching calendar Quarter 4, 2022.

json structure showing an object with the frame key included

Both techniques are included in the handle_duplicates() function below. The allow_date_duplcates parameter is false by default, which uses the “frame” filtering technique discussed above. If allow_date_duplicates is set to True, then all unique combinations of reporting end date and value will be retained.

def handle_duplicates(concept, allow_date_duplicates=False):
    processed_concept = []
    for unit in concept["units"]:
        if allow_date_duplicates:
            duplicate_tracker = []
            for fact in concept["units"][unit]:
                temp = {fact["end"], fact["val"]} 
                if temp not in duplicate_tracker:
                    processed_concept.append(fact)
                    duplicate_tracker.append(temp)
        else:
            processed_concept = [fact for fact in concept["units"][unit] if "frame" in fact]
    return processed_concept

VI. Testing Duplicate Handling

Let’s confirm that duplication handling works as intended by running through an example using some of the functions discussed so far and also adding two testing functions.

We start by retrieving the SEC tickers json file and finding the unique cik for the desired organization (I arbitrarily chose to use “Johnson & Johnson” for this article).

screenshot of interactive python terminal showing execution of get_tickers_dict() and company_search()

Next retrieve the JNJ concepts for the “NetIncomeLoss” and “AccountsPayableCurrent” tags (to test our handle_duplicates() function against both an income statement and a balance sheet tag).

screenshot of interactive python terminal showing execution of get_concept()

Then run the handle_duplicates() function against the “NetIncomeLoss” concept, first filtering out all duplicate values, then allowing for reporting date duplications if the date/value combination is unique.

screenshot of interactive python terminal showing execution of handle_duplicates()

At this point we can observe that when all unique date/value combinations are counted, handle_duplicates() returns 103 records. When only a single value is allowed for a given reporting end date, the same function returns 75 records. We naturally want to determine the cause of the 28 records that are included in one result but not the other, without manually comparing both sets.

screenshot of interactive python terminal showing execution of handle_duplicates()

For income statement concepts, a likely source of different values for the same reporting end date are the various schedules and tables that summarize company performance through a given quarter, which aggregate the values of all previous quarters into a single amount (i.e., Net Income for Q1 through Q3). These types of duplications are intentionally retained when the allow_date_duplicates parameter is set to True.

The find_start_end_diff() function finds and counts these types of income statement duplications. As demonstrated below, 27 of the 28 differences in the number of records between our two duplication handling techniques are explained by this type of duplication.

def find_start_end_diff(data):
    count = 0
    for fact in data:
        period_start = datetime.strptime(fact["start"], "%Y-%m-%d")
        period_end = datetime.strptime(fact["end"], "%Y-%m-%d")
        diff = period_end - period_start
        if diff.days > 121 and diff.days < 335:
            count += 1
    return count

screenshot of interactive python terminal showing execution of find_start_end_diff()

…but that still leaves one remaining unexplained difference. Since we already counted all of the duplicate reporting end dates that contain data combining amounts from multiple quarters, this final record is likely to be an exact match on the time period. This can occur if a value is initially reported in one SEC filing and is later corrected to an updated amount in a subsequent filing. The following code looks for time period matches:

def count_time_period_dupl(data, type="income"):
    tracker = []
    count = 0
    for fact in data:
        if type == "income":
            temp = {fact["start"], fact["end"]}
        else:
            temp = fact["end"]
        if temp not in tracker:
            tracker.append(temp)
        else:
            count += 1
    return count

And it correctly counts the single remaining duplicate record.

screenshot of interactive python terminal showing execution of find_start_end_diff()

Below, we can repeat the previous steps on the “AccountsPayableCurrent” concept. Note that in this case, the number of records returned between the two duplication handling techniques is different by only 1. Balance sheet concepts show a snapshot in time, so duplicates will typically only be of the “same end date, different value” variety (due to corrections or updates to the value made across multiple filings, as discussed above).

screenshot of interactive python terminal showing execution of tests
screenshot of interactive python terminal showing execution of tests

VII. Concluding Thoughts

Hopefully this article helps someone out there working with the EDGAR API for the first time. Keep in mind that this is just an introduction, and there are definitely other ways to handle EDGAR endpoints depending on how you're using the data, but I still thought this article was worth writing as I’ve seen several other guides skip over the issue of duplicate amounts. Running analytics on EDGAR data without first knowing how you intend to handle duplicate data (or without even being aware of the duplicate data!) will result in incorrect conclusions. I also purposely wrote this article without using pandas, mainly because it wasn't necessary just to call the endpoints and clean up duplicates. Definitely consider pandas for further reshaping and analysis.