Cloud security · 8 MIN READ · ETHAN CHEN · JAN 16, 2025 · TAGS: Get technical / Kubernetes
TL;DR
- This is part I of a two-part blog series on cloud security alerts—keep an eye out for part II, coming next week!
- Part I covers the foundational differences between cloud and on-prem security alerts
- Part II covers best practices for cloud security alerts, and how Expel supports cloud security (regardless of vendor)
Cloud security alerts play a critical role in detection and response for any organization that uses cloud environments. Like the alerts generated in on-prem environments, cloud alerts notify security teams about potential threats, vulnerabilities, and suspicious activities. But there are also important differences between cloud alerts and their on-prem counterparts.
Cloud environments are inherently complex and constantly changing. As a result, they generate a far higher volume of alerts, including a higher rate of false positives. These alerts tell you that something notable has occurred, but it’s hard to know what’s happened—or what you should do next. Team silos create further challenges, as security teams often lack the permissions needed to make changes, and they have to pull in stakeholders in engineering to help.
These factors make it difficult for organizations to maintain effective security across cloud environments. And in broadly distributed, highly interconnected hybrid infrastructures, the impact of errors can be much greater.
This blog series will explore key differences and risk factors associated with cloud alerts—and best practices to overcome these challenges. We’ll also explore how Expel can help security teams ensure accurate detection and rapid response across cloud environments.
What are cloud alerts?
Cloud alerts are real-time notifications triggered by events in cloud resources that meet criteria for abnormal or suspicious activity. Tools, services, and systems throughout the environment generate these alerts based on their analysis of raw data logged from network traffic, user behavior, system and configuration changes, and other events.
There are several key types of cloud alerts, and each brings with it its own unique logs and data.
Cloud configuration change alerts
Security logs in a cloud environment capture updates to security groups, resource or network configurations, and identity and access management (IAM) policies. When these updates fall within the scope of a rule, they trigger a real-time alert. These alerts can be a key indicator of unauthorized, non-compliant, or illicit changes that can create a misconfigurations.
Identity and access management (IAM) alerts
IAM events—including API calls, user authentication and authorization events, role assumption, and IAM-related system events—can signal atypical or improper resource access or privilege escalation that might indicate a real-time breach or compliance violation. This data is typically captured by a cloud platform’s IAM logs.
Data on IAM events includes the behavior of both human users and non-human identities (NHIs), such as service accounts, API keys, and OAuth tokens. In fact, NHIs contribute an extreme volume of IAM alerts because they typically outnumber human identities many times over. Service accounts, a common type of NHI, often perform automated tasks that operate continuously, generating a constant stream of IAM events.
Data alerts
Data alerts play important roles in data loss prevention (DLP) and in secrets monitoring for Kubernetes environments. A broad range of data can trigger these alerts, including:
- API logs can show suspicious or unauthorized attempts to access cloud-based data or Kubernetes secrets.
- Network traffic logs track the flow of data to and from cloud environments, using SSL/TLS inspection to examine encrypted traffic for sensitive data.
- Data from scans of cloud storage services and databases implemented to detect and classify sensitive data at rest.
Network alerts
Network alerts play an especially important role in ensuring the security of distributed cloud and hybrid infrastructures. The raw data that can trigger these alerts comes from numerous security and networking components, including:
- Network flow logs of IP traffic flowing through network interfaces in the cloud environment.
- Load balancing logs indicating incoming traffic patterns and distribution across servers, which can indicate a possible distributed denial-of-service (DDoS) attack.
- Network firewall logs tracking traffic monitoring and filtering activity to protect the network infrastructure from unauthorized access, including allowing and blocking connections to cloud resources.
- Intrusion detection systems (IDS) and intrusion prevention system logs that can flag network traffic for suspicious activities and potential attacks.
- DNS server logs that can reveal potential domain-based threats.
Kubernetes events
The security of Kubernetes environments is vital, given the role these environments play in orchestrating communications and moving resources across networks. CSP logs capture events in various Kubernetes components, including:
- Kubernetes API server events within the cluster, such as state changes, errors, and activities.
- Kubelet events related to pod and container lifecycle, resource usage, and node-specific issues.
- Container runtime events like image pulling or the starting and stopping of containers.
Platform-specific Kubernetes events are captured in logs for Amazon Elastic Kubernetes Engine (Amazon EKS), Azure Kubernetes Service (AKS), Google Kubernetes Engine (GKE), and Oracle Cloud Infrastructure (OCI).
Endpoint events
Endpoint agents collect data directly from hosts. This data includes details on process creation and termination, file system changes, network connections, Windows registry modifications, and other activity that can indicate potential compromises of cloud resources.
Application events
Logging data on application activity can provide visibility into illicit activity, misconfigurations, performance issues, unauthorized access, and other potential vulnerabilities or indicators of compromise (IOCs). Application events are triggered by the analysis of data from sources that include:
- API logs capturing requests, responses, and errors for API calls made to cloud services.
- Web application firewall (WAF) logs showing activity to protect web applications and APIs from application-layer attacks, such as SQL injection or cross-site scripting, by analyzing HTTP traffic to detect and block malicious requests.
- Runtime application self-protection (RASP) logs of authentication and authorization events, data exfiltration attempts, abnormal input, database query anomalies, API misuse, and other real-time application security events.
- Application logs generated by the an app itself to capture specific events, errors, and user actions.
SaaS events
SaaS solutions hosted and run in the cloud can represent a major part of an organization’s attack surface. Security logs for these applications are captured by both secure access service edge (SASE) and cloud access security broker (CASB). They include information on user authentication and authorization attempts, user activity, data flows between users and SaaS solutions, and the devices accessing cloud resources. Based on this data, SASE and CASB solutions can provide visibility into signs of malware, account compromise, and anomalous behaviors such as unusual login patterns or data access attempts that can indicate a security incident.
How are cloud alerts different—and what risks do they pose?
Cloud security alerts and their on-prem counterparts indicate many of the same kinds of events, and they play similar roles in security operations. But it’s important to understand what makes cloud alerts different so you can navigate them effectively, and beware of common red flags popping up, too.
Complexity
Distributed environments pose major hurdles to visibility and understanding. Cloud resources are often spread across multiple regions and services, making it hard to correlate alerts and identify broader attack patterns. These problems are multiplied when working with multiple CSPs, as most organizations do.
Consider this scenario: A multinational corporation uses a multi-cloud strategy with services spread across AWS, Azure, and Google Cloud Platform (GCP). Its development teams in North America, Europe, and Asia each use different cloud services and regions within these platforms. One day, the company’s security team receives the following cloud alerts:
- Unusual spikes in API calls from an IP address in Eastern Europe to an AWS S3 bucket in the US.
- Several failed login attempts to an Azure Active Directory instance in Western Europe.
- Unexpected outbound traffic from a GCP instance in Asia to an unknown IP address.
Each alert arrives on a separate monitor in a different format, and each refers to a different type of service. The team can’t see the alerts together within a unified view of their entire multi-cloud ecosystem—which makes it hard to see how they might relate to each other. Meanwhile, new alerts continue to flood into the security operations center (SOC) to compete for their attention.
Signs that your team might be facing this issue include:
- Fragmented alerting systems: Separate alerting mechanisms for each cloud platform make it challenging to track and correlate incidents.
- Inconsistent alert prioritization: Different platforms categorize and prioritize alerts differently, leading to confusion and inconsistent response protocols.
- Redundant alerts across platforms: The same incident generates alerts on multiple platforms, creating noise and confusion with no unified view.
Constant change
The real-time scalability and automation of cloud resources enable constant, rapid change in the environment—ensuring the flexibility and agility organizations seek from the cloud. However, these changes lead to a high volume to manage and decipher, including many of the types that could indicate an attack. Constant change also makes it hard to establish a baseline of expected behavior for the environment to help tune out false positives. On the CSP end, frequent updates to products, features, and APIs add yet another dimension of change.
Your team may be feeling the effects of constant change if you’re experiencing:
- Alert fatigue: Overwhelming alert volume can lead team members to ignore or delay alert responses.
- High number of unresolved alerts: Backlogs of unresolved alerts often grow—their volume exceeding team capacity.
- Repeated alerts from the same source: Frequent alerts on the same issue can point to an underlying problem or ineffective alert thresholds. Unless overburdened security teams can find time to adjust their alert thresholds, they have no way of knowing.
Prioritization
When security teams face deluges of cloud alerts, they have to make fast decisions about what to address first. With important details missing—for example, the IP address of the relevant device or the activity history of the user involved—they can’t accurately assess each alert’s relevance and impact. So their decisions are based on gut instinct rather than real insight, making mistakes and unfortunate outcomes inevitable.
Prioritization is likely a challenge for your team if you’re seeing:
- Failure to identify high-impact alerts: Teams can’t quickly distinguish critical alerts from minor ones, leading to delayed response for high-risk incidents.
- Over-prioritization of non-critical alerts: When minor alerts take up a disproportionate amount of response time, high-risk alerts may be missed.
Context
Even prioritized alerts lack the context to help teams understand what they mean and what to do about them. To begin with, it can be hard to determine whether an alert is a true positive, which is unlikely, given that nearly all alerts—cloud or otherwise—are actually false positives. And with cloud alerts, the sheer volume of alerts makes the challenge that much greater.
And the alert is just the tip of the iceberg. It’s enough to indicate that something has happened. But far more information remains hidden beneath the surface, including the details security teams need to answer their most urgent questions:
- Does it matter?
- What is it?
- Where is it?
- When did it get here?
- How did it get here?
- How did we detect it?
- What should we do?
For CISOs and directors of security, signs that security teams need more context for their alerts include:
- True positives ignored: If your security team has been ignoring or dismissing legitimate alerts that might have helped avert an incident, it may be because they’re spending too much of their time chasing false positives.
- Delays in responding to real threats: The significant manual effort required to understand and prioritize a large volume of contextless alerts often leads to delays in responding to real threats.
- Missed critical alerts: If the security team is unable to accurately determine whether alerts are relevant, or whether multiple alerts are connected to the same underlying incident, it may be because they lack key details like user or system data.
Organizational silos
It can take multiple teams to understand, investigate, and respond to the situation indicated by one or more cloud alerts. Security teams may spot the problem, but they often lack the privileges or authority to remediate it directly. Instead, they have to go to developers, cloud architects, or cloud infrastructure teams and convince them to fix it. This can cause disastrous delays and bottlenecks.
Silos can come into play before an incident even arises. As infrastructure teams build out the cloud environment, they don’t always put security front and center. By the time SecOps teams come into the picture, there’s already a gap between what security requires and what the infrastructure team has set up.
The signs of problematic organizational silos include:
- No defined incident escalation path: Unclear protocols for escalating security incidents lead to delays and missed SLAs.
- Lack of follow-up after incidents or pen tests: If real or simulated security breaches aren’t leading to cross-team collaboration on systemic changes, your organization will remain at risk for similar incidents.
- Inconsistent incident documentation: A lack of comprehensive record-keeping on alert response actions across teams results in incomplete follow-up and analysis.
- No post-incident review: When teams don’t coordinate across silos to review and learn from incidents, the organization can’t identify areas for improvement in alert handling.
Ready to talk cloud alert management? Reach out to us here to get started. And keep an eye out for part II of this blog series on best practices for cloud alerts.