So you're a developer or operations engineer (or both, DevOps) and work in a small team that either has no access to a Solution Architect.
or you do but expected to re-architect solutions for security and reliabilty
Get to know the AWS Well-Architected Framework and the new Operational Excellence Pillar white papers well. Do a review of your apps individually against this framework, and don't worry if you find serious issues because it's well understood within AWS that even the AWS Solution Architects 'always' find critical problems in every workplace they visit to review.
In your architecture diagrams be sure to include the consistently forgotten critical infrastructure components such as;
How AWS Accounts connect
Define these properly, show each account ID visually, and show not just VPC but also the availability zones and their subnets and endpoints.
If a RDS is multi-az, show this as 2 or 3 RDS instances not one, same for EC2 Auto-scale groups, show the EC2 instance in each AZ.
For Lambda it helps to know which subnet it deploys too and if there are VPC endpoints for things like S3, because accessing S3 over-the-internet is sometimes not ideal
Host OS agents
AWS CloudWatch logs is extremely easy to setup/install and will eliminate the need to ever SSH into an instance again. Once a log group or stream is created you can easily set an expiry so the archived logs are reliably purged via a 1-time cli command or in the console.
The SSM agent is another great tool to allow easy patching and provide reporting on compliance with AWS Config and Inspector.
And don't forget any 3rd party intrusion prevention or endpoint protection agents such as Trend Micro Deep Security.
Single-Tenant vs Multi-tenant
Make sure you show tenancy aspects of the solution design.
This is key for any ISO (Information Security Officer) or auditor to know from you eventually anyway, so be sure to properly understand your choices earlier in the planning stage rather than have to resort to compensating controls much later in the implementation - which usually come at the highest developer time/cost when there is little or no time left on the project.
Be sure to identify and visually represent SSM Parameter Store for encrypted secret strings such as tokens used by 3rd party integrations, and Secrets Manager for managing database passwords seamlessly.
KMS or CloudHSM and the requirements of Key management such as revokation and rotation that developers will be required to implement should be clear visually represented and such requirements annotated as these will likely require some additional infrastructure (Lambda) to achieve.
Events and Integration Hooks
Just like in the key management point, we have many used and unused CloudWatch Events, S3 lifecycle, and service specific events (like Kinesis and AWS Glue).
These events are usually of very high business value, under utilised, and extremely easy for a developer to leverage in most cases. But if a requirement never makes it's way to a developer because a business unit was not aware of the capability - then as an architect you've pretty much failed to identify and produce value from an essentially free resource that has potential for some ground breaking user experiences and statistical or behavioural data.
Monitoring is usually an afterthought.
But alerting capabilities are often never given a first thought.
SNS Topics, Pager Duty, and Raygun.io are all great, but what I am talking about is CloudWatch Rules. Just like some of the events mentioned earlier, these can alert you for actions that CloudWatch Alarms cannot. One of my favorite uses for a CloudWatch Rule is with Glue Data Catalog because the Apache Hive Metadata store generated is useful for EMR, Athena, Redshift Spectrum, and Glue ETL - and with a CloudWatch Rule I am notified when interesting things happen like a Database schema change or a failed crawl of S3 so I can automate a retry or changes to services downstream in my Athena SQL or Models running in EMR.
Tips by service
- Free service
- Protects: ELB, CloudFront, Route 53
- SYN/UDP floods
- Reflection attacks
- Layer ¾ attacks
- Can only be carried out on certain services;
- EC2, RDS, Aurora, CloudFront, API Gateway, Lambda, Lightsail, DNS Zone Walking.
- Small or micro RDS instance not permitted.
- m1.small, t1.micro or t2.nano EC2 not permitted
AWS Certificate Manager
- Cannot export
- Only for Route53 registered domains
- Works with CloudFront and Load Balancers
- Always throttles requests (default max 10k per sec)
- If burst-limit exceeds 429 Too Many Request is returned
- Use TTL caching to mitigate, max 3600 sec
- Parameter Store: EC2 (RunCommand), Lambda, CloudFormation, API
- RunCommand works on instances or Tags, and execute as root
- Replication uses SSL by default, versioning must be enabled at both ends
- Object DELETE are replicated, versioned DELETE are not
- Enforce SSL access using Condition aws:SecureTransport bucket policy
- Pre-signed URL default 1hr, you can pre-sign PutObject data too
- Delivered to S3 every 5mins, with 15min delay
- Only logs API, does not record instance-data, ssh or rdp
- Logs include api request metadata, identity, time, sourceIP, parameters, and response elements
- Use DIGEST for integrity validation of logs using SHA-256 hashing with RSA for signing
- Events are near real-time of: resource changes, CloudTrail, scheduled, or custom from code
- Rules match incoming Events, Rule Targets can be Lambda, SNS, SQS, Kinesis
- Dedicated instances are account locked and may share hardware with other instances that are not dedicated if they are of the same account
- AWS Staff have access to Hosts and Hypervisors, they cannot access guest operating systems
- RAM and storage are securely scrubbed before delivery to customers
- Resource inventory, configuration history, with change notification
- Trigger periodic, or filtered snapshots
- Need EC2 agents installed on the assessment target
- Create assessment templates to run and verify rules against findings
- Detects common vulnerabilities, CIS benchmark, and runtime behaviour analysis of network, file, and process as well as advises remediation
- Cost optimisation, performance, availability, and limited security
- Requires business/enterprise support for unlocked features
- Only CloudHSM (full control of keys) meet level 3 FIPS 140-2, KMS does not.
- CloudHSM offers asymmetric encryption, KMS only has symmetric available.
- CloudHSM is single tenanted, KMS is multi-tenanted.
- 4 main user type for keys;
- PRECO – Precrypto officer
- User and password management
- PCO or CO -- Crypto officer
- All Key management, key access issuing, material creation, signing, verifying, chaining
- CU – Crypto User
- Basic key management (create, rotate), allowed key exporting, signing, verifying
- AU – Appliance User
- Can perform cloning and sync operations in a cluster
- PRECO – Precrypto officer
- KMS is region specific
- KMS cannot be used for EC2 key/pairs, these are asymmetric, and allowing AWS to generate these would allow AWS access into the EC2 instances
- KMS integrates with EBS, S3, Redshift, Elastic Transcoder, WorkMail, RDS, more
- Keys include; alias, create date, desc, state, material
- Cannot be exported
- AWS-managed CMK have no material
- Customer-managed CMK need symmetric 256-bit key material provided
- You can also build extra resiliency by storing the key yourself outside of AWS
- Avoids 7-30day wait when deleting keys
- No automatic key rotation
- On by default, $3000 a month for advanced options
- Incident response team
- in-depth reporting
- payment reductions when victim of an attack
- Regionally integrate with load balancers or associate with cloudfront distribution for global
- IPv6 is supported, and CIDR blocks /8, /16, /24, /32
- Allow or Block all except specified
- Count requests based on properties matched
- Properties: IP, Country, header values, body length, SQL with known exploits, scripts with known XSS
- AWS Partners and Authorised appliances may be used to conduct testing
- Firewalls, Hardened OS, WAF, Antivirus, Security monitoring, etc
- CIS Benchmarked OS
- Ensure default route table has no public route out to the internet so new subnets that are associated by default are secure by default (AWS default route tables are not secure by default)
- The same applies to the NACL allowing all inbound and outbound traffic
- NAT instances must be behind a security group and in a public subnet, disabled source/destination checks, are not patched, is not scalable, is single AZ, and single subnet.
- Prefer NAT Gateways that are patched by AWS, scalable to 10Gbps, no security groups associated, no need to disable source/destination checks, apply to a route table, and get a public IP by default
- ALBs need at least 2 public subnets
- Flow logs cannot be tagged, and the IAM role or configuration cannot be edited
- VPC Peered Flow logs only work when both VPCs are in the same account
- Flow logs do not monitor DHCP, DNS, instance metadata address 169.254.169.254, Windows activation, or the AWS reserved IP addresses
- VPC Endpoints are either;
- Interface: single ENI
- Gateways: more durable (not single ENI)
- CIA: confidentiality, integrity, availability
- AAA: authentication, authorization, accounting
- Non-repudiation = cannot deny a fact
- Shared Responsibility model changed per Infrastructure, container, abstracted levels;
- Infrastructure (EC2, VPC, EBS)
- Container (RDS, EMR)
- Abstracted (S3, Dynamo, SQS)
- ECDHE for Perfect Forward Secrecy; Elliptic Curve DHE (Diffie-Hellman Ephemeral) key exchange.