How data science fights modern insider threats

How data science fights modern insider threats

How data science fights modern insider threats

Insider threats are the biggest cybersecurity threats to firms, organizations and government agencies. This is something you hear a lot at security conference keynotes and read about in data breach reports, white papers and surveys — and these insider threats are becoming increasingly more difficult to detect and prevent, as well as more frequent.

This seemingly unstoppable growth accentuates the problem and shortcomings of current solutions, and warrants the need for new defensive technologies to detect and stop the digital daggers aimed at our backs.

Data science — the application of mathematics, big data analytics and machine learning to extract knowledge and detect patterns — is an emergent, advanced technology area that is proving its effectiveness in the realm of cybersecurity, including fighting insider threats. Here’s how it succeeds where legacy solutions fail.

The need to focus on user behavior

The wide adoption of cloud services and mobile technology in companies has transformed IT infrastructures considerably.

With physical boundaries of corporate networks and digital assets not as clearly defined as they once used to be, the focus in fighting insider threats needs to shift toward protecting user accounts. “Now that the traditional security perimeter has been erased by mobile and cloud computing, identities have become both an attack vector and security perimeter,” says Tom Clare, VP of marketing at cybersecurity startup Gurucul.

“What has changed recently is the fact that control of user accounts has become far more valuable than control of devices,” says Jarno Niemelä, lead researcher at F-Secure Labs. “Years back, we were fighting against keeping computers clean from infection just to keep the computers clean. Nowadays, we are protecting computers just to be able to protect the user accounts that are on the computer.”

Organizations try hard to protect user identities by adopting different security solutions and training employees on the basics of cybersecurity, but it’s not enough.

“Good data hygiene is critical, but it is not enough,” says Stephan Jou, CTO at Interset. “A negligent employee is unlikely to change regardless of training, and a third-party attacker often can operate outside employee-focused processes. More importantly, the insider stealing for espionage is motivated to break rules.”

Insider threats are becoming increasingly more difficult to detect and prevent, as well as more frequent.

The truth is that credential theft does happen, and it happens a lot. In fact, a Verizon 2015 data breach report found that the majority of confirmed security incidents occur as a result of compromised user accounts. Massive lists of user credentials and passwords are being sold on the Dark Web at low prices, and, for a small fee, anyone can obtain access to all sorts of enterprise networks and cloud services, and impersonate legitimate users.

Therefore, fighting insider attacks hinges on detecting anomalous user behavior. But this again presents its own set of challenges, because defining normal and malicious behavior is not an exact science and involves a lot of intricacies.

Traditional security defenses rely on setting static rules and alerts on user activities in order to define and identify indicators of compromise (IoCs). But when applied to tens, hundreds and thousands of users, this model ends up generating a noisy flood, and security teams have to struggle with wasted time and must sort through tons of unimportant events that are mostly false positives. Meanwhile, actions don’t necessarily explain intents, and savvy attackers will be able to cloak their malicious activities by keeping them within the defined set of rules.

The use of data science can help move away from static models toward dynamic ones that are able to define normal user behavior based on identities, roles and working circumstances. This approach is very effective in reducing false positives and highlighting behavior that truly accounts for malicious activities.

Cybersecurity firms are increasingly leveraging this technology to deal with insider threats.

Analyzing user behavior through machine learning

Gurucul’s Risk Analytics security platform combines machine learning models with big data to understand normal baselines of behavior and uncover anomalies, and to provide visibility that spans identities, accounts, access and activity. “This behavioral analytics approach, sometimes called user behavior analytics or UBA, can detect excess access permissions and activity, define roles and detect unknown threats,” says Gurucul’s Clare.

The wide adoption of cloud services and mobile technology in companies has transformed IT infrastructures considerably.

Gurucul’s Risk Analytics also gathers and monitors identity-based data and activity from both on-premises and cloud environments. Its machine learning algorithms, including self-learning and training behavioral profile algorithms, look at every new transaction and risk scores it. Using clustering and outlier machine learning makes suspicious behavior stand out from other benign activities.

One of the features of Gurucul is its concept of dynamic peer groups. The system automatically groups users based on the types of activities they typically perform and the types of identities and privileges they hold. This allows for a tighter clustering of behavior and better chances in highlighting outlier activities in behavior patterns.

So if a sales employee is downloading large amounts of company data for the purpose of later surrendering it to a competitor, they will stand out and be marked for investigation even if they have legitimate access to the information, because their behavior deviates from that of their peers.

The math behind insider threat detection

Interset is another cybersecurity platform that relies on semi-supervised machine learning and advanced behavioral analytics to examine and correlate scattered bits of data in order to find insider threats. Its platform analyzes data from multiple sources related to the movement of data across or within a network, while also gathering information about the entities involved, which include users, endpoints and applications.

The math behind Interset’s data science model is based on three key ideas. First, it replaces traditional boolean alerts with probabilistic models or risk factors. Models that emit probabilities are more effective than true/false alerts and allow the use of math to combine multiple pieces of evidence across different data sets to define the likelihood of a user account having been compromised or engaged in illicit activities.

Second, it uses machine learning to define dynamic thresholds for each actor based on gathered data, a much more flexible model than globally applied rules such as “how many megabytes of attachments are allowed.” The “mathematical fingerprint” that results from the analysis of user-generated data makes it much easier to identify anomalous behavior.

Shifting to new technologies such as data science can help find the needle in the haystack.

Third, the platform moves away from the event level and uses math to correlate, corroborate and aggregate events to attribute risk to the higher-level actors involved. What results from this model is the ability to name names, i.e. determine who is stealing data instead of figuring out which of the hundreds of transactional events indicate data is being stolen.

This is the platform that, according to Interset’s Jou, “would have detected and surfaced Edward Snowden’s activities in a matter of hours.”

Complementing analytics with human expertise

“From a technical point of view, we are looking at actions conducted by user accounts,” F-Secure’s Niemelä explains, “and it doesn’t really matter that much whether the malicious operations being carried out are by the original owner of the account, or has someone been able to compromise said account.”

The Finnish firm’s latest security offering, Rapid Detection Service (RDS), is a platform that protects against both inside and outside threats. Niemelä calls it “a system that is capable of detecting both insiders and attackers who have been able to compromise some user account and are, in effect, an ‘insider’.”

The managed service uses a combination of threat intelligence, big data analytics, machine learning and security experts to deliver accurate, actionable data about security alerts and detect anomalies and signs of insider threats.

“Most users have rather clean and repeating patterns in their work from a statistics point of view,” Niemelä says. “Thus, alarming changes in the users’ behavior can be detected with suitable near real-time statistics analysis tools, supported by heuristics and machine learning systems.”

Organizations try hard to protect user identities by adopting different security solutions and training employees on the basics of cybersecurity, but it’s not enough.

RDS collects data from different sources, including behavioral information from corporate endpoints, and detects when a user account starts behaving in an unusual manner. The use of near-real-time analytics, stored data analytics and big data analytics enables the RDS platform to compare user behavior against baseline standards, historical data and known threats in order to detect signs of malicious activities while filtering out false positives.

What’s unique about F-Secure’s approach is the team of human experts who verify and provide incident response on anomalies detected by its machine learning engine. When a breach is confirmed, the client is contacted and informed.

Amplifying malicious insider activity to ease detection

LogRhythm tackles insider threats from a slightly different perspective, and takes the mindset that the adversary has already likely breached the perimeter, explains Greg Foss, Security Operations Lead at the security vendor, “so our detections primarily focus on tracking attacker activity once they are inside.”

The company’s User Threat Detection module provides insider threat detection capabilities through honeypot analytics and open-source honeypot solutions. Honeypots are decoys or cyber traps that lure malicious hackers and enable security software to detect, deflect or counteract their nefarious activities.

LogRhythm has researched honeypots, deception and sensitive file tracking to determine ways to trick attackers and track them as they move through an organization. “The trick is not to make compromise impossible but to ensure that it is loud and noticeable so that the SOC can detect and respond to the threat,” Foss explains.

Foss also stresses network flow analysis as another key piece of the puzzle when it comes to detecting insider threats. “A lot of people ask what threat feeds they should use to help find bad guys on their network,” Foss says. “I often inform them that they already have everything they need right in front of them, they just need to start looking closely at the data they are already collecting.”

LogRhythm uses Deep Packet Analytics to investigate huge amounts of network traffic and catch malicious insiders when they want to exfiltrate sensitive information, and also to detect compromised network nodes such as machines conducting packet capturing activities.

Dealing with the threats of the future

With organizations using more online services and generating more data than ever, insider threats will become increasingly complicated and harder to find. Shifting from traditional methods to new approaches and technologies such as data science can help find the needle in the haystack and speed the process of detecting and blocking insider threats before they cause irreversible damage.

How data science fights modern insider threats


We need to talk about AI and access to publicly funded data-sets

We need to talk about AI and access to publicly funded data-sets

For more than a decade the company formerly known as Google, latterly rebranded Alphabet to illustrate the full breadth of its A to Z business ambitions, has engineered an annually increasing revenue generating empire which last year pulled in ~$75 billion. And it’s done this mostly by mining user data for ad targeting intel.

Slice it and dice it how you like but Google’s business engine needs data like the human body needs oxygen. Most of its products are thus designed to remove friction to accessing more user data; whether it’s free search, free email, free cloud storage, free document editing tools, free messaging apps, a fuzzy social network that no one loves but which is somehow still hanging around, free maps, a mobile OS platform that OEMs can load onto smartphone hardware without paying a license fee… Most of what Google builds it opens to all comers to keep the data pouring in. The bits and bytes must flow.

The trade off for consumers handing over data is of course access to a particular Google service without any up front cost. Or getting to buy a cheaper piece of hardware than they might otherwise be able to. Or the convenience of using a dominant digital service. Of course they are ‘paying’ with their data, but few will think of it that way. It’s an abstract idea for starters, and a personal cost that’s far harder to quantify given how unclear it is what Google really does with the data it gathers and processes in its algorithmic black boxes.

Google certainly isn’t spelling that out. Rather it makes noises about the benefits of it knowing more about you (savvier virtual assistants, more powerful photo search and so on). And without explicit knowledge of what the trade-off entails — coupled with noisy PR about the convenience of data-powered services — most consumers will simply shrug and carry on handing over the keys to their lives. This is the momentum that fuels Mountain View’s ad-targeting empire. The more it knows about you, the richer it bets it can get.

You can dislike Google’s business model but you can also argue that consumers do (in general) have a choice about whether to use its services. Albeit in markets where the company has a defacto monopoly there may be doubt about how much choice people really have. Not least if the company is found to have been abusing a dominant position by demoting alternatives to its services in its search results (Google is facing just such antitrust claims in Europe, where it has a hugely dominant marketshare in search, for example).

Another caveat is that Google has worked to join up more personal data dots, undermining how much control users have over how they share data with the centralizing Alphabet entity — by, for example, consolidating the privacy policies of multiple products to enable it to flesh out its understanding of each user by cross-referencing their usage of different services. That collapsing of prior partitions between products has also caused Google headaches with European data protection regulators. And contributed to a caricature of it as a vampire octopus with masses of tentacles all maneuvering to feed data back into a single, hungry maw.

But if you think Google has a controversial reputation at this point in its business evolution, buckle up because things are really stepping up a gear.

The Google/Alphabet octopus, via its artificially intelligent DeepMind tentacle, is being granted access to public healthcare data. Lots and lots of healthcare data. Now personal data doesn’t really get more sensitive than people’s medical records. And these highly sensitive bits and bytes are now being sucked towards Google’s algorithmic core — albeit indirectly, via the DeepMind division, which so far this year has two publicly announced data-sharing collaborations with the UK’s National Health Service (NHS).

The public data in question is tied to the two specific projects. But the most recent of these collaborations, with Moorfields Eye Hospital NHS Trust in London, entails DeepMind applying machine learning to the data. Which is a key development. Because, as New Scientist noted this week, Google will be keeping any AI models DeepMind is able to build off of this public data-set. The trained models are effectively its payment in this trade — given it’s not charging the NHS for its services.

So yes, this is another Google freebie. And the cash-strapped, publicly (under)funded NHS has obviously leapt at the chance of a free-at-the-point-of-use high tech partner who might, in time, help improve healthcare outcomes for patients. So it’s granting the commercial giant access to patients’ data.

And while we are told the first NHS DeepMind collaboration, announced back in February with the Royal Free Hospital Trust in London, does not currently involve any AI component, the five-year strategic partnership between the pair does include a wide ranging memorandum of understanding in which DeepMind states its hope to also conduct machine learning research on Royal Free data-sets. So advancing AI is the clear objective for DeepMind’s NHS engagement, as you’d expect. It is a machine learning specialist. And its learning algorithms need the lifeblood of data in order to develop and thrive.

Now we’re all, as individuals, used to getting Google freebies in exchange for sharing some of our data. But the thing is, the data trade off here — with the publicly funded NHS — is a rather different beast. Because the people whose personal data is being pumped into Google-owned databanks are not being asked for their individual consent to the exchange.

Patient consent has not been sought in either of the current NHS collaborations. In the Moorfields project, where the data is being anonymized (or pseudonymized), NHS information governance rules allow for data to be shared for medical research purposes without obtaining patient consent (although NHS patients can opt out of supplying their data to all research projects) — so long as the relevant Health Research Authority clears the project. And DeepMind has applied to be cleared access in this case.

In the first collaboration, with the Royal Free, where DeepMind is helping co-design an app to detect acute kidney injury, the patient data being supplied is not anonymized or pseudonymized. In fact full patient medical records are being shared with the company — likely millions of people’s medical records, given it’s getting real-time data across the Trust’s three hospitals, along with five years’ worth of historical inpatient data.

In that case patient consent has not been sought because the Royal Free argues consent can be implied as it claims the app is for “direct patient care”, rather than being a medical research project (or another classification, such as indirect patient care). There has been controversy over that definition — with health data privacy groups disputing the classification of the project and questioning why DeepMind has been handed access to so much identifiable patient data. Regulators have also stepped in after the fact to take a look at the project’s parameters.

Whatever the upshot of those complaints, it’s fair to say NHS rules on information governance are not an exact science, and do involve interpretation by individual NHS Trusts. There is no definitive set of NHS data-sharing commandments to point to to definitely denounce the scope of the arrangement. The best we have is a series of principles developed by the NHS’ national data guardian, Fiona Caldicott. And, perhaps, our public sense of right and wrong.

But what is absolutely crystal clear is that millions of NHS patients’ medical histories are being traded with DeepMind in exchange for some free services. And none of these people have been asked if they agree with the specific trade.

No one has been asked if they think it’s a fair exchange.

The NHS, which launched in 1948, is a free-at-the-point of use public healthcare service for all UK residents — currently that’s around 65 million people. It’s a vast repository of medical data so it’s not at all hard to see why Google is interested. Here lies data of unprecedented value. And not for the relatively crude business of profiling consumers via their digital likes and dislikes; but for far more valuable matters, both in societal and business terms. There could be considerable future revenue-generating opportunities if DeepMind’s AI models end up being able to automate and/or improve complex diagnostic and healthcare challenges, for example. And if the models prove effective they could end up positively impacting healthcare outcomes — although we don’t know exactly who would benefit at this point because we don’t know what pricing structure Google might impose on any commercial application of its AI models.

One thing is clear: large data-sets are the lifeblood of robust machine learning algorithms. In the Moorfields case, DeepMind is getting around a million eye scans to train its machine learning models. And while those eye scans will technically be handed back at the end of the project, any diagnostic intelligence they end up generating will remain in Google’s hands.

The company admits as much in a research outline of the project, though it steers the focus away from these trained algorithms and back to the original data-set (whose value the algorithms will now have absorbed and implicitly contain):

The algorithms developed during the study will not be destroyed. Google DeepMind Health knows of no way to recreate the patient images transferred from the algorithms developed. No patient identifiable data will be included in the algorithms.

DeepMind says it will be publishing “results” of the Moorfields research in academic literature. But it does not say it will be open sourcing any AI models it is able to train off of the publicly funded data.

Which means that data might well end up fueling the future profits of one of the world’s wealthiest technology companies. Instead of that value remaining in the hands of the public, whose data it is.

And not just that — early access to large amounts of valuable taxpayer-funded data could potentially lock in massive commercial advantage for Google in healthcare. Which is perhaps the single most important sector there is, given it affects everyone on the planet. If you don’t think Google has designed on becoming the world’s medic, why do you think it’s doing things like this?

Google will argue that the potential social benefits of algorithmically improved healthcare outcomes are worth this trade off of giving it advantageous access to the locked medicine cabinet where the really powerful data is kept.

But that detracts from the wider point: if valuable public data-sets can create really powerful benefits, shouldn’t that value remain in public hands?

Or shouldn’t we at least be asking if we have a public duty to disseminate the value of publicly funded data as widely as possible?

And are we, as a society, comfortable with the trade off of a few free services — and some feel-good but fuzzy talk of future social good — for prematurely privatizing what could be our core IP?

Shouldn’t we, as the data creators, as the patients, at least be asked if we are comfortable with the terms of the trade?

Fiona Caldicott’s, the UK’s national data guardian, happened to publish her third review of how patient data is handled within the NHS just this week — and she urged a more extensive dialogue with the public about how their data is used. And a proper informed choice to opt in or out.

The old rules about information governance — which still talk in terms of shredding pieces of paper as a viable way to control access to data — have certainly not kept up with big data and machine learning. Stable doors and bolting horses spring to mind when you combine these old school data access rules with the learning and evolving character of advanced AI.

Access to data-sets is undoubtedly the core competitive advantage for AI builders because really good data is hard to come by and/or expensive to create. And that’s why Google is pushing so hard and fast to embed itself into the NHS.

You can’t blame the company for this healthcare data-grab. It’s just doing what successful commercial enterprises do: figuring out what the future looks like and plotting the fastest route to get there.

What’s less clear is why governments and public bodies find it so hard to see the value locked up in the publicly funded data-sets they control.

Or rather why they fail to come up with effective structures to support maintaining public ownership of public assets; to distribute benefits equally, rather than disproportionately rewarding the single, best-resourced, fastest-moving commercial entity that happens to have the slickest sales pitch. It’s almost as if the public sector is being encouraged to privatize yet another public resource… ehem

Inject a little more structured forward-thinking and public healthcare data could, for example, be contributed (with consent) to machine learning research departments in domestic universities so that AI models can be developed and tested ‘in house’, as it were, with public parents.

Instead we have the opposite prospect: public data assets stripped of their value by the commercial sector. And with zero guarantees that the algorithms of the future will be free at the point of use. Of course Google is going to aim to turn a profit on any healthcare AI models DeepMind creates. It’s not in the business of only giving away freebies.

So the really pressing question — roundly ignored by web consumers going about their daily Googling but perhaps moving into clearer focus, here and now, as commercial thirst to accelerate AI advancements is encouraging public sector bodies to over-hastily ink wide-ranging data-sharing arrangements — is what is the true cost of free?

And if we’ve inked the contracts before we even know the answer to that question won’t it be too late for us to haggle over the price?

Even DeepMind talks publicly about the need for new models of information governance and ethics to be put in place to properly oversee the coupling of AI with data…

Screen Shot 2016-07-09 at 12.38.45 PM

So we, the public, really need to get our act together and demand a debate about who should own the value locked up in our data. And preferably do so before we’ve handed over any more sets of keys.

We need to talk about AI and access to publicly funded data-sets

Chief analytics officer: The ultimate big data job?

As organizations seek to not simply corral data, but apply it strategically across the business, analytics experts are making their way into the C-suite.

The C-suite may need a bigger boardroom. As organizations expand their executive teams with new C-level titles that underscore their digital transformations in-progress, the role of chief analytics officer is gaining traction.

Driven by organizations’ desire to turn big data into a strategic asset, the CAO is finding a home in data-rich industries such as financial services and healthcare. Although still not as prevalent as two other newish C-suite roles — the chief digital officer and chief data officer — the CAO may represent an inflection point in an organization’s digital journey, signaling a move from managing data to applying it more strategically across the business.

Chief analytics officer: The ultimate big data job?

Intelligent Data Centers Are The Foundation Of A Better Connected World

Enterprises are often challenged by the complexity and fixed nature of their aging data centers. We commonly see the lack of performance capacity necessary to handle the goals of advanced “Industry 4.0” application scenarios and reduced energy consumption. To address these challenges, organizations are increasingly moving to intelligent data center solutions that deliver the agility, reliability, and efficiency that support the rapid service deployment needed to compete and win in this new era.

Intelligent Data Centers Are The Foundation Of A Better Connected World

The Cloud Could Be Your Best Security Bet

Conventional IT wisdom says that you’re safer and more secure when you control your own on-premises datacenter. Yet if you think about every major data breach over the last two years, whether Anthem, Sony, JPMorgan or Target, all involved on-premises datacenters, not the cloud.

In fact, if a cloud service has proper controls, it could be safer than running your own datacenter.

Having your data onsite is no guarantee it’s going to be safe, quite the opposite. The cloud may offer the best hope we have at this point, which is fairly ironic given that security has often been the chief criticism of cloud computing from the start. Yet in the end, the cloud may be your most secure bet.

The Cloud Could Be Your Best Security Bet

Governing the Smart, Connected City

Cities are where the action is when it comes to using technology to thicken the mesh of civic goods — more and more cities are using data to animate and inform interactions between government and citizens to improve wellbeing.

Governing the Smart, Connected City.

Cryptography: Note to future self

A bid to put encrypted data into a kind of time capsule gets a kick-start.

An encrypted “time-capsule” service whose aim is to enable scholars and journalists to securely send a message, in effect, into the future—encrypted in such a way that it cannot be read by anyone until a certain date or event.

Cryptography: Note to future self.

Billions Worth of Data Is Free For the Taking

Get ready for a brainteaser: Name a commodity that’s worth billions of dollars—and yet is available free to anyone who wants it.

The answer is open data—machine-readable information, particularly government data, that can be used any way anyone wants, without regard to copyright, patent or other restrictions.

Billions Worth of Data Is Free For the Taking.