WiL - World Innovation Lab: Unleashing The Power Of Data In Vc 1 Addressing The Duplicates Problem

In venture, one challenge you encounter while working with startup data is the lack of a universally unique identifier (UUID), a 128-bit label used to uniquely identify records in databases. Whether you’re using automatically generated record ids from your CRM or creating your own in a cloud database, ids are typically tied to a specific startup’s name and, if available, their domain. But what happens if a startup rebrands? It’s not uncommon to see names change, domains redirect to new landing pages, or old domains that no longer work.

The Duplicates Problem

Take for example, the customer conversation intelligence platform Echo AI. Echo AI now uses the domain echoai.com but was formerly known as Pathlight and used the domain pathlight.com. Note that both domains are active with pathlight.com redirecting users to echoai.com. Another example is the benefits platform for adult loved ones, Cleo, which was formerly known as Lucy and operated under startwithlucy.com. Now, this domain is no longer reachable as Cleo uses hicleo.com instead.

Depending upon your CRM or database, you may end up with duplicates as new records are added or created. The presence of these duplicates poses a few problems:

Investors may be working asynchronously and unknowingly doing redundant work. In the example with Echo AI, perhaps one investor initially screened Pathlight before its rebrand to Echo AI, inputting some notes based on their research or email exchanges with the founders. What happens if another investor starts looking at Echo AI down the road but isn’t aware of the prior screening? More time and energy might be spent on work that’s already been done but lives under separate records.
Information gaps often result from duplicate records. Since there’s no industry standard UUID, company name and domain are crucial in order to get data from 3rd party providers. If records fail to match, any data you’re pulling via API or receiving via cloud delivery may be missing or entirely inaccurate. This might happen if you don’t catch a name change or rebrand but your data provider does - for example, you’d have Pathlight in your system and the vendor would have Echo AI. Whether the missing data was supposed to be used in a trigger to alert investors of some specified change or would have supported modeling and prioritization of deals, these information gaps can negatively impact the investment process.

Our Solution

One of the solutions we’ve developed for this challenge is to use code to check domains on a regular cadence to see where they land.

Using Python’s requests package and a list of different possible user agents–how computers represent people using different browsers on the web–we can see where the listed domain for a company lands, log this information, and review redirects and errors. Depending on the number of records to check, running this code in parallel can speed up job time since web requests require creating a connection to send, process, and receive information. Rather than waiting for each request to complete or fail, having several workers handle these requests results in faster run time, freeing up resources for other jobs and giving you insights into possible duplicates as soon as possible.

We also merge any duplicate records and keep a list of past and present domains in addition to other aliases and names for the company. This solution preserves as much data as possible. Of course, this exercise can get more complicated when dealing with companies that have mergers and acquisitions, buyouts, and other market activities without obvious duplicates; what you do in these situations requires more thought in order to maintain data for multiple companies.

For the majority of cases, we’ve found and implemented an approach that can clean up your system and support your team’s investment process.

For Part 2 of Unleashing the Power of Data in VC, click here: Breaking the Outreach Bottleneck. Stay tuned for more insights from Max on data in VC on the WiL blog! You can also find Max on LinkedIn.

Unleashing the Power of Data in VC #1: Addressing the Duplicates Problem

The Duplicates Problem

Our Solution