Navigating compliance when extracting web scraped alternative data for finance

When it comes to using web data as alternative data for investment decision making, one topic rules them all: compliance.

Regulatory compliance is such a pervasive issue in alternative data for finance, that it is often the number one barrier to investment firms using web data in their decision making processes. And matters aren’t helped by the regulatory ambiguity.

In this article, we’re going to breakdown the regulatory compliance issues facing investment institutions, hedge funds, and other financial institutions looking to use web data in their investment decision-making processes and discuss some of the best practices we’ve worked on with our clients to implement which have enabled them to extract their web data in a compliant manner.

Disclaimer: The recommendations and commentary included in this guide do not constitute legal advice. The information contained in this article was garnered through Zyte's experience working with our financial clients, and the independent research of our legal team. If you need specific legal advice regarding your use of web data as alternative data then you should consult a lawyer.

Regulatory compliance & risk

When it comes to regulatory compliance and alternative data, everything revolves around risk.

As the usage of alternative data continues to grow, it is expected that regulatory interest in alternative data will only increase. Significantly increasing the compliance risks for firms who’ve not ensured they have fully evaluated and managed the risks associated with their acquisition of alternative data and its incorporation in their investment decision-making processes.

Generally speaking, the risks associated with alternative data can be broken into four categories:

Exclusivity & Insider Trading
Privacy Violations
Copyright Infringement
Data Acquisition

The various types of alternative data all have different levels of exposure to each of these risk categories. However, for the purposes of this article, we’re going to focus exclusively on the compliance areas associated with web scraped data that any firm considering using web data should be aware of.

Compliance for web data

Web data extraction compliance is an evolving sector in the law and thus can pose some challenges when determining the legality of your scraping project. While there have been many cases related to web scraping in various different jurisdictions, the law remains unsettled and many of the case holdings are fact-specific rather than espousing an overarching law related to web scraping. As such, clear guidance is difficult to provide to the industry as a whole and particularly with regard to alternative financial data.

It’s this ambiguity, coupled with the scarcity of case law relating to asset management companies, which makes precisely identifying and mitigating the risks associated with using web data as part of an investment decision-making process more challenging. This is compounded by the fact that many web scraping cases are associated with aspects of web scraping that are not highly applicable to financial use cases -- such as the extraction of web data to compete with the target website or redistributing the data in a manner that negatively impacts the market position of the data owner.

In general, investors are seeking to gather web data to gain a better understanding of the wider trends impacting a market. Not to redistribute or compete with the original owner of the data. As a result, much of the current legal precedent for web scraping is of little relevance to alternative data for finance, requiring financial firms to dig a bit deeper into the case law for cases relevant to their use case. Of particular interest to financial firms would be cases involving breach of contract relating to terms of service, Computer Fraud and Abuse Act (CFAA), trespass, extraction of personal data, and overburdening the infrastructure of the target website during the data extraction process. However, as mentioned above, while many cases exist relating to these causes of action, no clear standard has emerged across the board.

Despite this legal ambiguity, financial institutions have widely adopted web data into their investment decision making processes. In the absence of specific legal guidance, the industry as a whole has managed to come to a general consensus on the specific compliance issues associated with web scraping and how they should be best dealt with.

In the following sections, we will discuss these conclusions for each of the four main risk categories: exclusivity & insider trading, privacy violations, copyright infringement, and data acquisition.

Exclusivity & insider trading

One of the larger risks associated with the use of data extracted from the web for investment decision making is the risk of obtaining insider information and subsequently buying or selling shares based on that information, aka insider trading. Insider trading is the illegal use of non-public material information for profit.

Non-public material information is any information that is not available to the general public. So, if the data you extract is not generally available to the public, your web data extraction could result in insider trading violations – which no financial firm wants to get involved with.

The key to avoiding obtaining insider information by way of web scraping is to ensure that all the data scraped is information available to the general public. Typically, if the information is available on a public website that any person can go to and see, you are on safe footing. The risk of obtaining insider information increases when the information is not public – for example, the information behind a login or paywall.

With information behind a login or paywall, there are clear restrictions over who can access the data, so there is a potential risk of obtaining insider information. However, logging in will not always present such a risk, as some sites allow any and everyone to log in, thereby mitigating the risk of getting inside information. Thus, if you plan to scrape data behind a login, it’s imperative to ensure that you understand the nature of the data and whether it really is available to the general public.

The information behind a paywall on the other hand is not generally available to the public, as it requires paying for access to the information, so unless you can verify that the information received is also generally available to the public elsewhere, you should avoid scraping information from behind a paywall.

Another factor to consider when logging in is that you are typically expressly agreeing to terms and conditions and/or privacy policies when you register to login to a site. It is important to read these terms very closely because many courts have found that by virtue of explicitly agreeing to those terms you are entering into a binding contract with the website. If the terms state that you may not scrape the site or use automated means to extract data from the site, your web scraping project may not only give rise to insider information issues but also a breach of contract claims. As such, always ensure that you are working with a scraping provider that understands when and how terms must be reviewed.

As an aside: at Zyte our Legal Team looks at projects that require agreeing to terms and conditions on a case-by-case basis to ensure our client's web scraping is legally compliant.

If you proceed with scraping data behind a login or paywall that is found to be non-public material information, you may be opening yourself up to prosecution for insider trading. As a result, it is imperative to take your review of the data extracted and the means for extraction very seriously.

If reviewing website terms and understanding the potential legal implications of agreeing to those terms is not a core competence to your company, we strongly recommend working with a data provider that has built-in compliance processes to manage this type of review.

Privacy violations

In recent years, personal data protection regulations have really come to the fore. Regulators are increasingly clamping down on companies collecting, storing, and processing the personal information of citizens who haven’t given their explicit consent to do so.

Regulations such as the EU General Data Protection Regulation (GDPR) affect all companies including financial institutions and can lead to hefty fines. Under laws like GDPR, you typically need a lawful basis to process personal data, which can include consent, contractual agreement, or legitimate interest. Absent one of these lawful bases, you should not be scraping personal data. However, this analysis will vary from region to region, so please ensure you are familiar with the data protection laws in the region in which you operate before scraping personal data.

Additionally, many personal data laws have notification and deletion requirements, so if you’re unable to notify the individuals whose data you scrape and provide them with adequate deletion and other rights, you could be in violation of the relevant data protection laws in your region.

For these above reasons, we always recommend either not scraping any personal data or anonymizing personal data where feasible. Web crawlers should be designed so that they only extract specific financial data that is valuable for the investment decision process, then verified to ensure there is in fact no personal data contained in the dataset during the QA process.

If you’d like an insider look at the four-layers of Zyte’s QA process and how you can build your own, then be sure to check out our Data Quality Assurance on-demand webinar.

If your crawlers can’t be built to completely avoid personal data, the next best option is anonymization. In our experience, our clients seeking alternative financial data want to obtain market and consumer trends, so the specific personal data that might be associated with those trends is not required. In these cases, anonymization is your best friend. If you anonymize all personal data, it no longer falls under the remit of laws like GDPR and you don’t have to worry about compliance.

Furthermore, the possession of personal data also poses a headline risk for firms. There is the damage any negative press a firm would receive if the media found out that they were using personal data to make investment decisions, along with the hefty fines regulators will issue if they are found to illegally hold or use any such data.

One final point to note is that web data can often be purchased in the form of off-the-shelf datasets. In these cases, asset managers will need to put internal processes in place to segregate this data from the rest of the organization and then identify and remove any personal data present before it is permitted to be used by investment teams. Working with a scraping company that already has personal data protection processes in place can significantly decrease the burden when obtaining off-the-shelf datasets, as you can be assured that the data you’re receiving has been obtained in a compliant manner.

Copyright infringement

The issue of copyright is also of concern in the case of alternative web data. Just because web data is publicly available on the internet doesn’t mean that anyone can extract and store the data.

In some cases, the data itself might be copyrighted, and depending on how/what data you extract you could be found to have infringed the owner’s copyright, creating additional risks for the users of this alternative data.

Typical types of web data that are at risk of copyright are:

Articles
Databases
Videos
Pictures
Stories
Music

For the purposes of web data, the most relevant data types are articles and databases, as these often are the best sources of useful data for investment decision making.

Copyright issues are sometimes surmountable if there is a valid exception to copyright within your use case. Some methods to achieve this are:

Fair use: For example, instead of extracting all the data from an article, you extract short snippets, which might constitute fair use.
Facts: Facts are typically not covered by copyright laws, so if firms limit what is being scraped to just the factual matters -- i.e. names of products, prices, etc, then it may be acceptable to scrape without violating copyright.

The other copyright risk is if the website can claim database rights. A database is an organized collection of materials that permits a user to search for and access individual pieces of information contained within the materials. Database rights can create additional risks for the use of web data in investment decision making if the data hasn’t been extracted in a compliant manner.

In the US, a database is protected by copyright when the selection or arrangement is original and creative. Copyright only protects the selection and organization of the data, not the data itself.

In the EU, databases are protected under the Database Directive which offers much broader protection for EU databases. The Directive has two purposes: (1) protect IP, like in the US, and (2) protect the work and risk in creating the database.

If you believe a data source might fall under database rights then decision-makers should always consult with their legal team before scraping the data and ensure they either:

only scrape some of the available data;
only scrape the data itself and not replicate the organization of that data; and
try to limit the data scraped to factual or other non-copyrighted data.

Both copyright and database rights pose additional compliance risks for financial institutions considering using web data in their investment decision-making processes, however, by following the simple guidelines outlined above these risks can be greatly reduced or negated. Furthermore, this type of analysis needs to be completed on a case-by-case basis, so it is always best to work with a web scraping company like Zyte that has internal copyright compliance policies built into our workflows.

Data acquisition

As we’ve already touched on, the data acquisition process itself presents some unique compliance risks for financial institutions -- extracting data behind logins, personal data, copyright, etc.

However, there are some additional risks that legal counsel needs to take into account when assessing the compliance risks of an off-the-shelf dataset or when developing their firm's own internal data extraction capabilities.

Typically, these risks fall under the more traditional risks associated with web scraping but can pose additional risks for financial institutions due to the high profile nature of their business and the insider trading risks they have to manage.

Here are some of the most important questions every legal counsel should be asking themselves and the data provider when evaluating a web datafeed:

Did the data/website owner issue (or try to lodge) an abuse report during the web scraping process?
Was there a cease and desist letter issued? Was it responded to and addressed in an adequate manner?
Did the crawlers put excessive pressure on the website's servers?
Did the extraction process materially harm the website or the company’s underlying business by extracting the data?
Was there any sensitive data, like medical or personal banking data, involved in the scraping?

If the answer to any of these questions isn’t a definite “No” then by using this data you are exposing your firm to additional legal risks.

When the firm is in full control of the data extraction process, they can take steps to ensure these issues never arise as they all fall under general web scraping best practices. However, if you are considering using an off-the-shelf dataset from a third-party provider it can often be much harder to ascertain if the data was extracted safely using web scraping best practices.

In a lot of cases, the data provider operates more so as a data marketplace -- collecting and organizing a wide variety of alternative data types and often outsourcing part or all of their web data extraction to a third party. As a result, the level of oversight of a large historical dataset can be quite patchy.

The only way to fully mitigate these risks is by directly controlling the data extraction process, either by moving the data extraction process in-house or partnering closely with a data extraction provider who has experience extracting alternative data for financial use cases. But beware, even some of the larger web data extraction companies have little to no compliance processes, so always ask about their compliance and best practices upfront.

Your project’s compliance requirements

As we have seen, the compliance requirements for using web data in the investment decision-making process can be quite demanding. However, the facts speak for themselves, web data is the most prevalent form of alternative data and can provide asset managers with a distinct informational advantage if used correctly.

If the guidelines outlined in this article are followed, then there is no reason why web data can’t contribute significant value to your firm without exposing yourself to undue compliance and regulatory risks.

At Zyte we have extensive experience developing data extraction solutions that overcome these challenges and mitigate the compliance risks associated with using web data in investment decision making.

Our legal and engineering teams work with clients to evaluate the compliance associated with web scraping projects and develop data extraction solutions that enable them to reliably extract the data they need.

If you have a need to start or scale your web data acquisition for your alternative data needs then our Solution architecture team is available for a free consultation, where we will evaluate and architect a data extraction solution to meet your data and compliance requirements.

At Zyte we always love to hear what our readers think of our content and any questions you might have. So please leave a comment below with what you thought of the article and what you are working on.

Navigating compliance when extracting web scraped alternative data for finance

When it comes to using web data as alternative data for investment decision making, one topic rules them all: compliance.

Regulatory compliance & risk

When it comes to regulatory compliance and alternative data, everything revolves around risk.

Generally speaking, the risks associated with alternative data can be broken into four categories:

Exclusivity & Insider Trading
Privacy Violations
Copyright Infringement
Data Acquisition