diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 00000000..e69de29b diff --git a/404.html b/404.html new file mode 100644 index 00000000..301c1efc --- /dev/null +++ b/404.html @@ -0,0 +1,4101 @@ + + + +
+ + + + + + + + + + + + + + +Our internships are aimed at current PhD students looking for an industrial placement of around five months with the right to work in the UK. The projects are focussed on innovation, in particular around getting the most value out of NHS data.
+The projects often have a focus on emerging data science techniques and so we advertise mainly to data science programmes, however previous interns have come from other disciplines such as clinical, mathematics, computer science and bioinformatics, which have added huge value through the range of approaches and knowledge.
+For more information and details on how to apply see the Scheme Overview page on the microsite
+For details on open projects see the Projects page on the microsite
+Available outputs from previous projects can also be seen at Previous Projects on the microsite
+Currently our interns are working on the following projects in two waves. These are the original briefs they applied to and their work and outputs will be available on our organisation GitHub.
+Wave 6 | +February - July 2024 | +
---|---|
+ | NHS Language Corpus Extension | +
+ | Understanding Fairness and Explainability in Multi-modal Approaches within Healthcare | +
Wave 7 | +July - December 2024 | +
+ | Evaluating NER-focussed models and LLMs for identifying key entities in histopathology reports – working with GOSH DRIVE | +
+ | Investigating Privacy Concerns and Mitigations for Healthcare Language and Foundation Models | +
We are the NHS England Data Science Team.
+We are passionate about getting the most value out of the data collected by NHS England and the wider NHS through applying innovative techniques in appropriate and well-considered ways.
+Our vision is:
+++ + +Embed ambitious yet accessible data science in health and care to help people live healthier longer lives
+
In NHSE data scientists are concentrated in the central team but also embedded across a number of other areas.
+Data Linkage
+The Data Linkage Hub aims at providing a unified and quality solution to the data linkage needs in NHS England. Data Science is central to achieving this objective, and it covers many aspects, from the mathematical models of entity resolution and record linkage, to identifying and correcting linkage errors, assessing their impact on downstream applications, and ensuring quality.
+ +Central Data Science Team
+We develop and deploy data science products to make a positive impact on NHS patients and workforce. We investigate applying novel techniques that increase the insight we get from health-related data. We prioritise code-first ways of working, transparency and promoting best practice. We champion quality, safety and ethics in the application of methods and use of data. We have the remit to be open and collaborative and have the aim of sharing products with the wider healthcare community.
+See our Projects
+National SDE Team
+Working with customer researchers and analysts to identify how they can do their research, overcome and rectify data issues and use the platform and data to its fullest. There is also work to create products and tools that facilitate research in the environment such as data quality and completeness visualisations, example analysis and machine learning code as well as continuous improvement and increasing automation of the processes to get data both into the SDE and out through output checking.
+See SDE website.
+Other Embedded Data Scientists
+Across the organisation individual data scientists are embedded within specific team (inclu. Workforce, Training and Education (WT&E); Medicines; Patient Safety; AI Lab; Digital Channels..).
+We come together through the data science assembly to align our professional development and standards.
+To support knowledge share of data science in healthcare we've put together a monthly newsletter with valuable insights, training opportunities and events.
+Note
+The newsletter is targeted towards members of the NHS England Data Science team, so some links may only be accessible to those with the necessary login credentials, however the newsletter and its archive are available for all at the link above.
+We also support the NHS Data Science Community hosted in AnalystX, which is the home of spreading data science knowledge within the NHS. You can also learn a lot about data science from the other communities we support:
+ +Name | Role | Team | Github |
---|---|---|---|
Sarah Culkin | Deputy Director | Central Data Science Team | SCulkin-code |
Rupert Chaplin | Assistant Director | Central Data Science Team | rupchap |
Jonathan Hope | Data Science Lead | Central Data Science Team | JonathanHope42 |
Jonathan Pearson | Data Science Lead | Central Data Science Team | JRPearson500 |
Achut Manandhar | Data Science Lead | Central Data Science Team | achutman |
Jennifer Hall | Data Science Lead | Data Linking Hub | |
Simone Chung | Principal Data Scientist (Section Head) | Central Data Science Team | simonechung |
Efrosini Serakis | Principal Data Scientist (Section Head) | Central Data Science Team | efrosini-s |
Sam Hollings | Principal Data Scientist | Central Data Science Team | SamHollings |
Daniel Schofield | Principal Data Scientist (Section Head) | Central Data Science Team | danjscho |
Eladia Valles Carrera | Principal Data Scientist | Central Data Science Team | lilianavalles |
Paul Carroll | Principal Data Scientist (Section Head) | Central Data Science Team | pauldcarroll |
Elizabeth Johnstone | Principal Data Scientist (Section Head) | Central Data Science Team | LiziJohnstone |
Nicholas Groves-Kirkby | Principal Data Scientist (Section Head) | Central Data Science Team | ngk009 |
Divya Balasubramanian | Principal Data Scientist (Section Head) | Central Data Science Team | divyabala09 |
Giulia Mantovani | Principal Data Scientist (Section Head) | Data Linking Hub | GiuliaMantovani1 |
Angeliki Antonarou | Principal Data Scientist | National SDE Data Science Team | AnelikiA |
Kevin Fasusi | Principal Data Scientist | National SDE Data Science Team | KevinFasusi |
Jonny Laidler | Senior Data Scientist | Central Data Science Team | JonathanLaidler |
Mia Noonan | Senior Data Scientist | Central Data Science Team | amelianoonan1-nhs |
Sean Aller | Senior Data Scientist | Central Data Science Team | seanaller |
Hadi Modarres | Senior Data Scientist | Central Data Science Team | hadimodarres1 |
Michael Spence | Senior Data Scientist | Central Data Science Team | mspence-nhs |
Harriet Sands | Senior Data Scientist | Central Data Science Team | harrietrs |
Alice Tapper | Senior Data Scientist | Central Data Science Team | alicetapper1 |
Ben Wallace | Senior Data Scientist | Central Data Science Team | |
Jane Kirkpatrick | Senior Data Scientist | Central Data Science Team | |
Kenneth Quan | Senior Data Scientist | Central Data Science Team | quan14 |
Daniel Goldwater | Senior Data Scientist | Central Data Science Team | DanGoldwater1 |
Shoaib Ali Ajaib | Senior Data Scientist | National SDE Team | |
Marek Salamon | Senior Data Scientist | National SDE Team | |
Adam Hollings | Senior Data Scientist | National SDE Team | AdamHollings |
Oluwadamiloju Makinde | Senior Data Scientist | National SDE Team | |
Joseph Wilson | Senior Data Scientist | National SDE Team | josephwilson8-nhs |
Alistair Jones | Senior Data Scientist | National SDE Team | alistair-jones |
Nickie Wareing | Senior Data Scientist | National SDE Team | nickiewareing |
Helen Richardson | Senior Data Scientist | National SDE Team | helrich |
Humaira Hussein | Senior Data Scientist | National SDE Team | humairahussein1 |
Jake Kasan | Senior Data Wrangler (contract) | National SDE Team | |
Lucy Harris | Senior Data Scientist | Meds Team | |
Vithursan Vijayachandrah | Senior Data Scientist | Workforce, Training & Education Team | VithurshanVijayachandranNHSE |
Warren Davies | Data Scientist | Central Data Science Team | warren-davies4 |
Sami Sultan | Data Scientist | Workforce, Training & Education Team | SamiSultanNHSE |
Chaeyoon Kim | Data Scientist | Workforce, Training & Education Team | ChaeyoonKimNHSE |
Ilja Lomkov | Data Scientist | Workforce, Training & Education Team | IljaLomkovNHSE |
Thomas Bouchard | Data Science Officer | Central Data Science Team | tom-bouchard |
Catherine Sadler | Data Science Officer | Central Data Science Team | CatherineSadler</a |
William Poulett | Data Science Officer | Central Data Science Team | willpoulett |
Amaia Imaz Blanco | Data Science Officer | Central Data Science Team | amaiaita |
Xiyao Zhuang | Data Science Officer | Central Data Science Team | xiyaozhuang |
Scarlett Kynoch | Data Science Officer | Central Data Science Team | scarlett-k-nhs |
Jennifer Struthers | Data Science Officer | Central Data Science Team | jenniferstruthers1-nhs |
Matthew Taylor | Data Science Officer | Central Data Science Team | mtaylor57 |
Elizabeth Kelly | Data Science Officer | National SDE Team | ejkcode |
++Reproducible analytical pipelines (RAP) help ensure all published statistics meet the highest standards of transparency and reproducibility. Sam Hollings and Alistair Bullward share their insights on adopting RAP and give advice to those starting out.
+
Reproducible analytical pipelines (RAP) are automated statistical and analytical processes that apply to data analysis. It’s a key part of national strategy and widely used in the civil service.
+Over the past year, we’ve been going through a change programme and adopting RAP in our Data Services directorate. We’re still in the early stages of our journey, but already we’ve accomplished a lot and had some hard-learnt lessons.
+ + + +This is about analytics and data, but knowledge of RAP isn’t just for those cutting code day-to-day. It’s crucial that senior colleagues understand the levels and benefits of RAP and get involved in promoting this new way of working and planning how we implement it.
+This improves the lives of our data analysts and the quality of our work.
+ + + + + + + + + + + + + + + + + + + +++ + +Over recent years, larger, more data-intensive Language Models (LMs) with greatly enhanced performance have been developed. The enhanced functionality has driven widespread interest in adoption of LMs in Healthcare, owing to the large amounts of unstructured text data generated within healthcare pathways.
+However, with this heightened interest, it becomes critical to comprehend the inherent privacy risks associated with these LMs, given the sensitive nature of Healthcare data. This PhD Internship project sought to understand more about the Privacy-Risk Landscape for healthcare LMs through a literature review and exploration of some technical applications.
+
Studies have shown that LMs can inadvertently memorise and disclose information verbatim from their training data when prompted in certain ways, a phenomenon referred to as training data leakage. This leakage can violate the privacy assumptions under which datasets were collected and can make diverse information more easily searchable.
+As LMs have grown, their ability to memorize training data has increased, leading to substantial privacy concerns. The amount of duplicated text in the training data also correlates with memorization in LMs. This is especially relevant in healthcare due to the highly duplicated text in Electronic Healthcare Records (EHRs).
+If LMs have been trained on private data and are subsequently accessible to users who lack direct access to the original training data, the model could leak this sensitive information. This is a concern even if the user has no malicious intent.
+A malicious user can stage a privacy attack on an LM to extract information about the training data purposely. Researchers can also use these attacks to measure memorization in LMs. There are several different attack types with distinct attacker objectives.
+One of the most well-known attacks is Membership inference attacks (MIAs). MIAs determine whether a data point was included in the training data of the targeted model. Such attacks can result in various privacy breaches; for instance, discerning that a text sequence generated by Clinical LMs (trained on EHRs) originating from the training data can disclose sensitive patient information.
+At the simplest level, MIAs use the confidence of the target model on a target data instance to predict membership. A threshold is set against the confidence of the model to ascertain membership status. For a specific example, if the confidence is greater than the threshold then the attacker assumes the target is a member of the training data, as the model is "unsurprised" to see this example, indicating it has likely seen this example before during training. Currently, the most successful MIAs use reference models. This refers to a second model trained on a dataset similar to the training data of the target model. The reference model filters out uninteresting common examples, which will also be "unsurprising" to the reference model.
+There are three primary approaches to mitigate privacy risks in LMs:
+In this project, we sought to understand more about the Privacy-Risk Landscape for Healthcare LMs and conduct a practical investigation of some existing privacy attacks and defensive methods.
+Initially, we conducted a thorough literature search to understand the privacy risk landscape. Our first applied work package explored data deduplication before model training as a mitigation to reduce memorization and evaluated the approach with Membership Inference Attacks. We showed that RoBERTa models trained on patient notes are highly vulnerable to MIAs, even when only trained for a single epoch. We investigated data deduplication as a mitigation strategy but found that these models were just as vulnerable to MIAs. Further investigation of models trained for multiple epochs is needed to confirm these results. In the future, semantic deduplication could be a promising avenue for medical notes.
+Our second applied work package explored editing/unlearning approaches for healthcare LMs. Unlearning in LMs is poised to become increasingly relevant, especially in light of the growing awareness surrounding training data leakage and the 'Right to be Forgotten'. We found that many repositories for performing such approaches were not adapted for all LM types, and some are still not mature enough to be easy to use as packages. Exploring a Locate-then-Edit approach to Knowledge Neurons, we found this was not well suited to the erasure of information we needed in medical notes. Our findings suggest that the focus from a privacy perspective on these methods should be on those which allow the erasure of specific training data instances instead of relational facts.
+This work primarily explored privacy in pre-trained Masked Language Models. The growing adoption of generative LMs underscores the importance of expanding this work to Encoder and Encoder-Decoder models like the GPT family and T5. Also, due to the common practice of freezing parameters and tuning the last layer of a LM on a private dataset, it is critical to expand investigations of privacy risks to LMs fine-tuned on healthcare data.
+Within the scope of this exploration, the field of Machine Unlearning/Editing applied to LMs was in its infancy, but it is gaining momentum. As this field matures, comparing the efficacy of different methods becomes crucial. Furthermore, it is important to explore the effect of removing the influence of a set of data points. A holistic examination of the effectiveness, privacy implications, and broader impacts of Machine Unlearning/Editing methods on healthcare LMs is essential to inform the development of robust and privacy-conscious LMs in the NHS.
+When considering explainability of models, this often involves generating explanations or counterfactuals alongside the decisions made by the LM. However, integrating explanations into the output of LMs can introduce vulnerabilities related to training data leakage and privacy attacks. Additionally, efforts to enhance privacy, such as employing Privacy-preserving training techniques, can inadvertently impact fairness, particularly in datasets lacking diversity. In healthcare, all three elements are paramount, so investigating the privacy-explainability-fairness trade-off is crucial for developing private, robust and ethically sound LMs.
+Finally, privacy concerns in several emerging trends for LMs need to be understood in Healthcare scenarios. Incorporating external Knowledge Bases to enhance LMs, known as retrieval augmentation, could make LMs more likely to leak private information. Further, Multimodal Large Language Models (MLLM), referring to LM-based models that can take in and reason over multimodal information common in healthcare, could be susceptible to leakage from one input modality through another output modality.
+ + + + + + + + + + + + + + + + + + +++Reproducible analytical pipelines (RAP) help ensure all published statistics meet the highest standards of transparency and reproducibility. Sam Hollings and Alistair Bullward share their insights on adopting RAP and give advice to those starting out.
+
Reproducible analytical pipelines (RAP) are automated statistical and analytical processes that apply to data analysis. It’s a key part of national strategy and widely used in the civil service.
+Over the past year, we’ve been going through a change programme and adopting RAP in our Data Services directorate. We’re still in the early stages of our journey, but already we’ve accomplished a lot and had some hard-learnt lessons.
+ + +++ + +Over recent years, larger, more data-intensive Language Models (LMs) with greatly enhanced performance have been developed. The enhanced functionality has driven widespread interest in adoption of LMs in Healthcare, owing to the large amounts of unstructured text data generated within healthcare pathways.
+However, with this heightened interest, it becomes critical to comprehend the inherent privacy risks associated with these LMs, given the sensitive nature of Healthcare data. This PhD Internship project sought to understand more about the Privacy-Risk Landscape for healthcare LMs through a literature review and exploration of some technical applications.
+
++ + +Over recent years, larger, more data-intensive Language Models (LMs) with greatly enhanced performance have been developed. The enhanced functionality has driven widespread interest in adoption of LMs in Healthcare, owing to the large amounts of unstructured text data generated within healthcare pathways.
+However, with this heightened interest, it becomes critical to comprehend the inherent privacy risks associated with these LMs, given the sensitive nature of Healthcare data. This PhD Internship project sought to understand more about the Privacy-Risk Landscape for healthcare LMs through a literature review and exploration of some technical applications.
+
++ + +Over recent years, larger, more data-intensive Language Models (LMs) with greatly enhanced performance have been developed. The enhanced functionality has driven widespread interest in adoption of LMs in Healthcare, owing to the large amounts of unstructured text data generated within healthcare pathways.
+However, with this heightened interest, it becomes critical to comprehend the inherent privacy risks associated with these LMs, given the sensitive nature of Healthcare data. This PhD Internship project sought to understand more about the Privacy-Risk Landscape for healthcare LMs through a literature review and exploration of some technical applications.
+
++ + +Over recent years, larger, more data-intensive Language Models (LMs) with greatly enhanced performance have been developed. The enhanced functionality has driven widespread interest in adoption of LMs in Healthcare, owing to the large amounts of unstructured text data generated within healthcare pathways.
+However, with this heightened interest, it becomes critical to comprehend the inherent privacy risks associated with these LMs, given the sensitive nature of Healthcare data. This PhD Internship project sought to understand more about the Privacy-Risk Landscape for healthcare LMs through a literature review and exploration of some technical applications.
+
++Reproducible analytical pipelines (RAP) help ensure all published statistics meet the highest standards of transparency and reproducibility. Sam Hollings and Alistair Bullward share their insights on adopting RAP and give advice to those starting out.
+
Reproducible analytical pipelines (RAP) are automated statistical and analytical processes that apply to data analysis. It’s a key part of national strategy and widely used in the civil service.
+Over the past year, we’ve been going through a change programme and adopting RAP in our Data Services directorate. We’re still in the early stages of our journey, but already we’ve accomplished a lot and had some hard-learnt lessons.
+ + +++Reproducible analytical pipelines (RAP) help ensure all published statistics meet the highest standards of transparency and reproducibility. Sam Hollings and Alistair Bullward share their insights on adopting RAP and give advice to those starting out.
+
Reproducible analytical pipelines (RAP) are automated statistical and analytical processes that apply to data analysis. It’s a key part of national strategy and widely used in the civil service.
+Over the past year, we’ve been going through a change programme and adopting RAP in our Data Services directorate. We’re still in the early stages of our journey, but already we’ve accomplished a lot and had some hard-learnt lessons.
+ + +++ + +Over recent years, larger, more data-intensive Language Models (LMs) with greatly enhanced performance have been developed. The enhanced functionality has driven widespread interest in adoption of LMs in Healthcare, owing to the large amounts of unstructured text data generated within healthcare pathways.
+However, with this heightened interest, it becomes critical to comprehend the inherent privacy risks associated with these LMs, given the sensitive nature of Healthcare data. This PhD Internship project sought to understand more about the Privacy-Risk Landscape for healthcare LMs through a literature review and exploration of some technical applications.
+
++Reproducible analytical pipelines (RAP) help ensure all published statistics meet the highest standards of transparency and reproducibility. Sam Hollings and Alistair Bullward share their insights on adopting RAP and give advice to those starting out.
+
Reproducible analytical pipelines (RAP) are automated statistical and analytical processes that apply to data analysis. It’s a key part of national strategy and widely used in the civil service.
+Over the past year, we’ve been going through a change programme and adopting RAP in our Data Services directorate. We’re still in the early stages of our journey, but already we’ve accomplished a lot and had some hard-learnt lessons.
+ + +