How to Build Your Own Security Data Pipeline (and why you shouldn’t!)
Kyle Polley & the Tarsal Team
Kyle Polley & the Tarsal Team
In the evolving world of Software-as-a-Service (SaaS), Infrastructure-as-a-Service (IaaS), and Platform-as-a-Service (PaaS), security teams face a complex landscape. Managing the growing number of log sources demands a scalable and adaptable platform. It’s not just about ingesting logs; teams must filter, enrich, transform, and even predict the relevance of these logs to ensure they receive actionable insights. Building such a platform requires multi-dimensional, scalable, fault tolerant, resilient, and intelligent alerting mechanisms.
This article will take you through the complexities of building an in-house data pipeline, and why you should use Tarsal instead.
Building a secure and efficient data pipeline requires robust infrastructure. Some options are Kafka, AWS Kinesis, or GCP PubSub for data streaming. You can use tools like Vector, Apache Spark, or DBT for data transformation. Scalability is critical, so solutions like Kubernetes or IaaS serverless functions are popular.
Unlike other platforms, Tarsal was engineered from the ground up by data experts, built specifically for security teams, and operates seamlessly at petabyte-scale.
Data sources are numerous and varied, each with unique methods of data ingestion. The infrastructure must adapt seamlessly between Kafka, S3, and API polling. Challenges multiply when considering that each data source schema may change, requiring the data operations team to stay constantly updated.
A prime example of this complexity is needing to build a custom API poller, which introduces a series of challenges. It needs to poll for new events at regular intervals, efficiently handle those events, and send them to your data streaming infrastructure. Designing such a poller requires careful consideration of potential failures, error handling, and the need for continuous maintenance. Custom-built API pollers also require managing pagination and state to ensure it is not missing or duplicating events.
Managing source connectors requires ongoing support and adaptation to underlying APIs and schema changes. Tarsal innately supports many security schemas, including Okta, AWS, and Slack, and ensures source connectors are always up-to-date. Tarsal also manages the API polling infrastructure, so you don’t have the hassle of designing and maintaining your own.
Data transformations are a critical yet intricate part of the pipeline-building process. Your team must decide which events to filter or transform for each data source. Whether it’s the decision to filter out endpoint file modification events or keep only new file creation events with specific extensions, each transformation must be meticulously crafted. The system must also handle computationally-heavy transformations such as aggregations and joins across data streams.
Data normalization involves unifying the structure of your logs to make them actionable, understandable, and easy to query across different platforms. This task may seem straightforward, but the choices and considerations make it an intricate and often overwhelming process.
Different systems may require different normalization schemas, such as Elastic Common Schema (ECS), Common Information Schema (CIS), or the Open Cybersecurity Schema Framework (OCSF). Selecting the right schema is just the beginning; you must build specific transformations for each log source and ensure the data is mapped correctly. Errors in this mapping process can lead to misinterpretations or loss of crucial information.
Tarsal gives you a platform to build and deploy any type of transformation in minutes, providing predefined templates and guided workflows that make it easy to design complex transformations. It offers a flexible environment where adjustments can be made swiftly, minimizing the risk of errors and freeing your team to focus on strategic security goals. With Tarsal, what could have been weeks of development and testing can be achieved in a fraction of the time, bringing efficiency and reliability to a crucial part of the security data pipeline.
Data enrichment is essential to adding contextual information to the transformed and normalized logs, turning raw data into actionable intelligence. One common enrichment might be GeoIP enrichment, where the data pipeline appends information like country, city, and WHOIS details to IP addresses. Another could be Threat Intel enrichment, flagging domains or executables known to be malicious or suspicious. These enrichments provide the security context for effective threat detection and swift incident response.
Like building a custom API poller, building a data enrichment pipeline is far from trivial. It requires setting up internal or external API services to handle your log throughput without hitting API rate limits. The complexity further magnifies when considering scalability, as handling an increasing volume of data requires a robust and efficient enrichment process.
Tarsal’s data enrichment capabilities are designed to ease this burden. Offering common enrichments like GeoIP out of the box, Tarsal enables security teams to enhance their data in mere minutes.
Destination connectors translate and transmit the enriched and normalized data to the final destination, whether that’s Snowflake, Splunk, BigQuery, or other platforms. This stage requires particular attention to detail and a deep understanding of the intricacies of various destination systems.
Each destination type has its unique method of ingestion and supports different data types. For instance, some destinations might not support `map` types, opting for a `json` format instead. The translation of the normalized schema to the destination must also be exact; errors in this translation can lead to data loss or corruption.
Optimizing the data pipeline for the destination is another essential consideration. This includes defining how data partitioning will be handled, such as partitioning data based on date and hour timestamps for security events. Incorrect partitioning can lead to inefficient data querying and increased costs in storage and processing. Designing and implementing these processes can require extensive knowledge of the data pipeline architecture and the specific requirements of the destination platform.
Tarsal’s approach to destination connectors is designed to simplify this complex process. Supporting several popular destinations, Tarsal allows security teams to send security audit logs to their chosen platforms within minutes. It provides a seamless integration that ensures accuracy, efficiency, and adaptability. The Tarsal team maintains the connectors, understands the nuances of each destination type, and ensures that they are always up-to-date and aligned with the latest requirements.
In a robust security data pipeline, fault tolerance, monitoring, and alerting are essential components that dictate the success and efficiency of the system. The complexity of handling these aspects requires a comprehensive strategy.
Building a fault-tolerant system means planning for inevitable failures and errors. How does the pipeline react when it expects an int but receives a string? What if an enrichment provider is down, and the API sends back 404s? Decisions need to be made, and systems need to be built to handle whether to drop the event, send it to a dead letter queue (DLQ), or keep it stalled in the pipeline until the schema is fixed. These aren’t just technical decisions; they impact the data’s accuracy and the system’s reliability.
Constant and effective monitoring of the data pipeline is crucial for its ongoing health. Understanding what’s happening inside the pipeline is vital for performance optimization and error detection. Small issues can evolve into significant problems affecting the entire system. This requires technology and human expertise to interpret the data and make informed decisions.
Coupled with monitoring is the need for a responsive alerting mechanism. An advanced data pipeline must have robust alerting that notifies the appropriate team members when things aren’t working as expected. This goes beyond simple notifications; it involves setting internal SLAs, defining who should get alerted, and determining the appropriate response to different errors or anomalies. An effective alerting system can mean the difference between a minor hiccup and a major pipeline failure that isn’t noticed until the Incident Response team cannot query necessary data.
Building a security data pipeline is an endeavor that goes beyond technical expertise. It requires a nuanced understanding of the data’s nature, security landscape, tools, and underlying business requirements. The process is fraught with complexities at every stage, from selecting the right underlying infrastructure, to implementing source connectors, data transformations, normalizations, enrichments, destination connectors, and ensuring fault tolerance, monitoring, and alerting.
Each stage is essential and demands careful planning, execution, and continuous maintenance. Building a custom pipeline will inevitably involve a steep learning curve, with many potential pitfalls and challenges that can consume significant time and resources. The dynamics of the evolving technology landscape further complicate things, demanding constant vigilance and adaptability.
This is why Tarsal’s data pipeline platform is an invaluable tool for security teams. Tarsal eliminates the overwhelming complexities and pitfalls of building and maintaining a robust security pipeline. Tarsal lets security teams deploy an advanced and robust security pipeline within minutes by handling everything from the initial setup to ongoing maintenance and updates. Security teams can concentrate on using the data effectively rather than getting bogged down in the technicalities of wrangling data. Tarsal lets security teams focus on the core mission of understanding, analyzing, and responding to security threats.