Data lake: Design for schema evolution
[EuroPython 2021 - Talk - 2021-07-29 - Parrot [Data Science]]
[Online]
By Prakshi Yadav
Designing a data lake necessitates well-researched storage, management, scalability, and availability solutions. However, managing schema evolution remains a difficult task. The structure of data differs from one company to the next, making it difficult to generalize a solution to the schema evolution problem.
At Episource, we faced a similar challenge - our data of interest is the output from our NLP engine. Episource's machine learning and natural language processing platform processes millions of pages of medical documents, with up to 15 ML/DL models working together to produce the results. The result of such a challenging pipeline is a complex nested JSON series. With each major update, our NLP engine evolves, causing the inference data structure to evolve as well. As data grew in size and complexity, storing it and making it searchable became a pressing necessity. We needed a solution that kept schema compatibility, versioning, and data integrity intact. We wanted to make sure data reads and writes were unaffected by the Schema mismatch problem.
After several iterations and proofs of concept, we settled on a solution that uses the AVRO format to evolve our data's schema. Avro is a format similar to Parquet but can also accommodate schema evolution. To keep track of changes made to the system, schema versions are saved in a Schema registry. To read the AVRO data stored in S3, our data lake uses Athena, a distributed SQL engine based on Presto. The solution makes use of python libraries to glue various components of this pipeline.
The following are some of the things that a participant can expect to learn during this talk:
In a data lake, best practices for storage, control, scalability, and availability
Managing schema evolution in a data lake
The ability to use both "schema-on-write" and "schema-on-read"
License: This video is licensed under the CC BY-NC-SA 4.0 license: https://creativecommons.org/licenses/...
Please see our speaker release agreement for details: https://ep2021.europython.eu/events/s...
Watch video Prakshi Yadav - Data lake: Design for schema evolution online without registration, duration hours minute second in high quality. This video was added by user EuroPython Conference 11 October 2021, don't forget to share it with your friends and acquaintances, it has been viewed on our site 819 once and liked it 16 people.