Engineering and Product · Portugal, United Kingdom · Fully Remote

Site Reliability Engineer

Ensure reliability, performance, and scalability across our cloud systems as a Site Reliability Engineer — designing resilient AWS solutions and empowering teams to ship with confidence.

A bit about us:

Virtuoso's mission is to enable and lead the world's quality-first revolution. The field of QA has not kept pace with the software industry's transition to CI/CD. We are fixing that.

Virtuoso has reimagined how software is tested by developing a game-changing platform that is already being used by the biggest names in software. We passionately believe that anyone should be able to create and maintain tests regardless of their technical skill, and that quality is a key driver for change and growth. The latest advances in AI and Machine Learning have been leveraged to produce test automation software that thinks like a human, empowers everyone to test, and for the first time delivers on the promise of codeless test automation. Achieving remarkable success has become a business-as-usual activity for us and we need to rapidly expand our team for that to continue to increase. Want to join the quality-first revolution? Then read on.

A company without borders with employees that make an impact worldwide, with offices and a remote team spread across the globe. The nature of our product is reflected in our thorough and agile culture. We do the right things fast and our application process is no different. We want exceptional people and we will act to get them.

About the Role:

As a Site Reliability Engineer (SRE), you will play a critical role in ensuring the reliability, performance, and availability of our systems.

You will be a member of the SRE team and work within a product engineering squad to help them ship and maintain reliable features.

You will have an impact by:

Helping to design and implement cloud-native solutions around user needs
Proposing and delivering architecture/system changes to improve the reliability, stability, and throughput of our systems
Working closely with the SRE team to refine and plan enhancements to the cloud estate as a whole
Responding to incidents and designing remediation plans to ensure they do not recur

You will primarily work in Terraform with the AWS stack, but will also be trusted to understand and make changes to our primarily Java backend codebase.

Key Tasks:

Work with squads to design and deliver infrastructure required for product features
Ensure that deliveries are sufficiently robust and monitored by designing for observability and reliability in collaboration with engineering team members
Lead incident response activities, including root cause analysis, problem resolution, and post-incident reviews.
Identify and deliver initiatives to improve reliability, observability and throughput of key systems
Assist cross-functional teams with upskilling and support in monitoring, CI pipelines, release engineering, and IaC/AWS.

How will success be measured in this role (general points not individual KPIs)?

Coordinate the response to several incidents (if any), aiming for a TTFR of one hour (office hours only), including production of post mortem and preventative measures in future. KPIs: number of incident responses handled, TTFR
Ensure that at least 60% of features delivered by squad produce corresponding Grafana metrics. KPI: number of epics with metric completed tasks
Ensure that SRE knowledge is shared and documented through written documentation, workshops, and individual training. KPI: number of notion articles, workshops, pairing sessions
Identify at least one enhancement to the platform that leads to improved reliability, throughput, or cost. KPI: number of completed SRE team tickets closed with measured impact

Skills required (Learned and Applied Abilities):

Experience with the AWS or an equivalent computing stack
Knowledge of container based and/or serverless compute environments (e.g. ECS)
Basic awareness of RDBMS workloads
Ability to acquire new skills as necessary
Basic understanding of at least one major programming language
Understanding of asynchronous message processing pipelines and their failure modes
Experience with conducting software releases/migrations

Competencies required and to be demonstrated (Traits, Attitudes, Behaviours):

Strong problem-solving and troubleshooting abilities.
Excellent communication and collaboration skills.
Ability to work effectively in a cross-functional team environment.
Adaptability and flexibility to work in a fast-paced, dynamic organization.
Attention to detail and a focus on delivering high-quality results.
Willingness to learn and stay updated on industry best practices and emerging technologies

Qualifications and Experience Required:

5 years of Experience with open-source technologies

Working within a business managing stakeholders etc.

Education level required if any:Masters (or equivalent work experience) in an engineering/computer/software discipline

Workplace Experience if required (industry, role etc, technology space):

5 years of Experience with open-source technologies

What's in it for You...

Competitive Package, including generous and achievable uncapped commission
Employee Share Options- Share in the success of Virtuoso
A defined, transparent, career path to more senior roles
Remote/flexible working
Private health insurance
Training/personal development budget of a minimum of £500 per year
Take your birthday as a holiday every year!
Holiday allowance increases by one day per year of service up to 5 years
Employee Referral Scheme - we put money in your pocket for referring awesome people!

#LI-LP1

#LI-remote

Department: Engineering and Product
Locations: Portugal, United Kingdom
Remote status: Fully Remote

About Virtuoso

Virtuoso was developed by a team passionate about improving the quality of low-code/no-code test automation software without slowing down the development process. As work shifts more to the cloud and teams work remotely, on-premise software has become unwieldy and a bottleneck. We've reimagined test automation software by pioneering the next generation of low-code/no-code testing - all on the cloud. We believe anyone can test, and we're delivering on the promise of low-code/no-code test automation.

Founded in 2017

Co-workers 80

Engineering and Product · Portugal, United Kingdom · Fully Remote

Site Reliability Engineer

Ensure reliability, performance, and scalability across our cloud systems as a Site Reliability Engineer — designing resilient AWS solutions and empowering teams to ship with confidence.

Already working at Virtuoso?

Let’s recruit together and find your next colleague.

Build what the world will run on next