Large Language Models
Sci-former : (A pocket LLM for scientific papers)
Our research focuses on advancing language model (LLM) pretraining with minimal parameters, aiming for optimal efficiency. The LLM is specifically trained on domain-specific data, exclusively utilizing scientific research papers sourced from ARXIV. To enhance computational efficiency, we employ a sliding window approach, strategically reducing attention multiplication overhead and enabling the handling of significantly larger input sequence lengths. The implementation also incorporates grouped query attention, refining the model’s ability to capture nuanced contextual relationships. Additionally, our work addresses the issue of potential bias towards highly cited research papers, aiming to create a more balanced and inclusive representation of scientific knowledge within the trained language model.
HindiVani LLM
“HindiVani LLM” represents an innovative language learning model tailored specifically for the complexities of the Hindi language. Through continuous training and optimization using a diverse and comprehensive Hindi dataset, it endeavors to grasp the nuances of Hindi language structure, semantics, and cultural context. By remaining updated with the latest linguistic trends and developments within the Hindi-speaking community, HindiVani LLM enables users to generate high-quality content, streamline communication, and utilize insightful applications tailored for Hindi language users. Our project entailed curating a comprehensive Hindi dataset sourced from reputable platforms including news websites, Press Information Bureau (PIB), NPTEL, and other reliable web resources. This dataset encompasses a diverse array of textual content such as news articles, press releases, educational materials, and online resources available in Hindi. With meticulous attention to ensuring richness and authenticity, the content was carefully compiled from esteemed sources to capture the extensive range of topics prevalent in Hindi-language media and education. This curated dataset, with a total size of approximately 75GB, served as a robust foundation for training models capable of comprehending, analyzing, and generating high-quality content in Hindi. Through the development of tailored solutions, our aim is to enhance accessibility and engagement across linguistic and cultural boundaries for Hindi-speaking communities.
PDF_2_LaTeX
Our PDF to LaTeX conversion model is currently undergoing continuous training, refining its capabilities and enhancing its performance. This model employs a sophisticated architecture, integrating advanced neural network techniques to achieve optimal results. The model adeptly captures intricate PDF structures, preserving layout and formatting nuances with exceptional fidelity. Meanwhile, our PDF_2_Latex model excels in reconstructing LaTeX code with a strong emphasis on semantic comprehension, ensuring accuracy in mathematical expressions and technical content. Trained on a comprehensive corpus of arXiv academic papers, encompassing LaTeX source code and PDFs, our model is adept at handling diverse formatting requirements and consistently delivering standardized LaTeX output. Additionally, our technology boasts scalability and adaptability across various document types, seamlessly integrating into existing workflows. With ongoing training, our automated document conversion solution continues to evolve, offering a reliable and efficient tool for transforming complex documents into LaTeX format. Our dataset comprises 4TB of PDFs and 3TB of source code, with the latter undergoing preprocessing our latex source code to reduce its size from 3TB to 200GB.