Singapore – Lakehouse company Databricks has announced the release of Dolly 2.0, the world’s first open-source, instruction-following large language model (LLM) that is fine-tuned on a human-generated instruction dataset licensed for commercial use.
This follows the initial release of Dolly in March 2023, an LLM trained for less than USD$30 to exhibit ChatGPT-like human interactivity.
The 12B parameter language model is based on the EleutherAI Pythia model family and fine-tuned exclusively on a high-quality human-generated instruction-following dataset, which was crowdsourced among Databricks employees.
Moreover, Databricks is also open-sourcing the entirety of Dolly 2.0, including the training code, the dataset, and the model weights, all suitable for commercial use. This enables any organisation to create, own, and customise powerful LLMs that can talk to people without paying for API access or sharing data with third parties.
Meanwhile, its databricks-dolly-15k dataset contains 15,000 high-quality human-generated prompt or response pairs specifically designed for instruction tuning large language models. With this, anyone can use, modify, or extend this dataset for any purpose, including commercial applications.
“Dolly 2.0 is a game changer as it enables all organisations around the world to build their own bespoke models for their particular use cases to automate things and make processes much more productive in the field they’re in,” said Ali Ghodsi, CEO of Databricks.
Ghodsi also added that with Dolly 2.0, any organisation can create, own, and customise a powerful LLM to create a competitive advantage for their business.
Last year, Databricks has also launched the first lakehouse platform for data-driven businesses in the media and entertainment industry, ‘The Lakehouse for Media & Entertainment’.