Use cases show Translator’s potential to expedite clinical research

RENCI investigators are contributing to the development of a platform called Biomedical Data Translator that will allow researchers to easily access and interrelate large amounts of data relevant to advancing biomedical research. Funded by the NIH’s National Center for Advancing Translational Sciences (NCATS), the new system is poised to accelerate translational clinical research by allowing users to approach biomedical questions from a holistic perspective to inspire important new research directions.

The platform is being developed by a 15-team multi-institutional Biomedical Data Translator consortium. Three of these teams include leadership from RENCI investigators. Although still a work in progress, Translator is being designed as an easy-to-use tool that can quickly respond to queries by identifying and synthesizing relevant data from a wide variety of sources.

Finding potential therapies for drug-induced liver injury

In December 2021, consortium members presented use cases to NCATS to demonstrate the platform’s progress and potential. In one, Paul Watkins, MD, from the UNC School of Medicine worked with RENCI collaborator Karamarie Fecho to use Translator to identify drugs that might be repurposed for treating drug-induced liver injury (DILI). There is a critical need for new therapies to heal liver damage caused by medicines. Although the injury sometimes heals when a patient stops taking the medication, it can take months or years to resolve and can leave patients unable to take medicines they need to treat medical conditions.

“There are lab-based ways to identify drugs for repurposing, or a researcher can spend years going through the literature and attempt to synthesize it,” explained Fecho. “Translator offers an alternative method that’s fast and doesn’t require the user to be an expert.” 

Using gene information to identify drug candidates that might hold promise for treating drug-induced liver injury, Translator quickly identified two antioxidant drugs for consideration. This query relied on clinical data that is part of UNC Health’s Integrated Clinical and Environmental Exposures Service (ICEES), which provides open, regulatory-compliant access to clinical data that is integrated with environmental exposures data. Fecho and colleagues from RENCI and the North Carolina Translational and Clinical Sciences Institute previously developed tools that allow Translator to access this important source of clinical data.

In addition to identifying potential drug candidates, Translator also provided experimental evidence that these drugs had been studied for preventing drug-induced liver injury in rat models and were used in clinical trials to treat other diseases. “Having this information showed that the candidate drugs were safe and effective enough to be used in a clinical trial,” said Fecho. “This can help reduce the risk involved in moving forward with clinical trials, which are time-consuming and expensive.”

The Translator findings are now being compiled into a formal report to present to the NIH-funded U.S. DILI Network leadership to inform planning for future clinical trials.

Revealing new directions for rare diseases

In another use case, researchers from the Hugh Kaul Precision Medicine Institute at the University of Alabama, Birmingham, are using Translator to find potential new treatments for rare diseases. Rare diseases are usually caused by gene mutations that aren’t passed on.

“For applications involving rare diseases, a new drug development candidate is not that helpful because it would require too much investment to develop and test a new drug for just a few people,” said RENCI’s Chris Bizon, co-PI of the Translator standards and reference implementation team. “Translator can help by looking for drugs that are already approved for some other purpose and have the potential to be repurposed for off-label use or tested in a clinical trial.”

The researchers were interested in a gene known as RHOBTB2. Children born with overactive variants of this gene sometimes never learn to walk and have severe intellectual disabilities. Researchers used Translator to ask for a list of all the chemicals that down-regulate RHOBTB2. When this didn’t return many leads, they performed another query to look for chemicals that up-regulate a gene that down-regulates RHOBTB2. This process helped reveal intermediate genes that could be targeted to down-regulate RHOBTB2.

“As a clinician, I don’t even know about all the databases that hold critical pieces of the puzzle I’m trying to put together,” said Anne Thessen, a visiting associate professor the University of Colorado School of Medicine. “With Translator I can prepare a query, run the query, and have results to review in an hour.”

Read more about Translator:
Biomedical Translator Platform moves to the next phase

New streamlined statistical method provides improved pattern detection and risk prediction for disease

The novel regression algorithm, CALF, outperforms the current gold standard, LASSO, in statistical tests

Researchers from the Renaissance Computing Institute (RENCI) at UNC-Chapel Hill, Perspectrix, the UNC School of Medicine, and the WVU Rockefeller Neuroscience Institute have collaborated to develop a new method for finding patterns in data which verifiably surpasses the performance of a generally accepted “gold standard.” 

Attempting to find patterns in data is central to all research, and it is particularly important in medical use of biological samples to predict a patient’s risk for disease formation and progression. Today, researchers can utilize advanced technology to produce an ocean of data about one person from various biological samples such as blood, DNA, and saliva, with the goal of identifying particular markers that can be informative about a person’s current health and future outlook. However, this advanced data collection and processing has outpaced current statistical methods for identifying simple but robust patterns and relationships, and this is particularly true for the field of psychiatry. For instance, researchers have yet to fully understand and predict the progression of schizophrenia. 

This new method, CALF, which stands for “coarse approximation linear function,” is described in the Scientific Reports paper, “A greedy regression algorithm with coarse weights offers novel advantages,” published on March 31, 2022. Application of CALF to five quite different examples from psychiatric and neurological studies consistently outperformed the gold standard, LASSO, or “least absolute shrinkage and selection operator” regression, and other methods. 

Read more

Omnibond joins iRODS Consortium

Collaboration enhances synergies for improving end to end data integration

CHAPEL HILL, NC – The software company Omnibond has joined the iRODS Consortium, the membership-based foundation that leads development and support of the integrated Rule-Oriented Data System (iRODS).

Omnibond is a software technology company with four main product areas including CloudyCluster for cloud high-performance computing and data analytics, OrangeFS for research data solutions, NetIQ for identity and access management, and TrafficVision for computer vision and AI solutions for the transportation industry. Company leaders say that enhanced integration with iRODS will help provide better instrument-to-cloud data and computation management, in particular for CloudyCluster and OrangeFS software.

“We help our customers deal with large amounts of data, and collaborating with iRODS for these products will help our customers with better end to end data management,” said Omnibond President and CEO Boyd Wilson. “We are excited to work with the iRODS team going forward and we are impressed with their vision and capabilities.”

The iRODS Consortium is a membership-based organization that guides development and support of iRODS as free open-source software for data discovery, workflow automation, secure collaboration, and data virtualization. The iRODS Consortium provides a production-ready distribution and professional integration services, training, and support. The consortium is administered by founding member RENCI, a research institute for applications of cyberinfrastructure located at the University of North Carolina at Chapel Hill, USA.

“Omnibond can provide deployment and support services where we cannot, and their integration expertise extends the Consortium’s reach into new markets,” said Terrell Russell, executive director of the iRODS Consortium. “After working alongside one another for years, we are very happy to welcome Omnibond to the iRODS Consortium.”

Wilson noted that the open-source model makes iRODS a particularly good fit for Omnibond’s portfolio, which is focused around building synergies between research and open-source technologies. “We currently are the maintainers of OrangeFS, an open-source parallel file system that has been incorporated into the Linux kernel by the Linux kernel team, so we understand the value of open-source software and are excited to partner with the iRODS Consortium,” said Wilson.

In addition to Omnibond, current iRODS Consortium members include Agriculture Victoria, Bayer, Bibiothèque et Archives nationales du Québec, CINES, CUBI at Berlin Institute of Health, DataDirect Networks, Emagine IT, KU Leuven, Maastricht University, Minnesota Supercomputing Institute at the University of Minnesota, the National Institute of Environmental Health Sciences, NetApp, OpenIO, RENCI, SoftIron, the SURF cooperative, Texas Advanced Computing Center, University College London, University of Colorado, Boulder, University of Groningen, Utrecht University, Wellcome Sanger Institute, Western Digital, and four organizations that wish to remain anonymous.

To learn more about iRODS and the iRODS Consortium, please visit irods.org.

To learn more about Omnibond, please visit https://obz.io.

Emagine IT joins iRODS Consortium

Collaboration points to critical role of data management in advancing cybersecurity

Emagine IT (EIT) has joined the iRODS Consortium, the membership-based foundation that leads development and support of the integrated Rule-Oriented Data System (iRODS). In becoming the Consortium’s latest member, EIT brings a cybersecurity lens to driving data management solutions in collaboration with the broader iRODS community.

EIT provides IT modernization, cybersecurity, and full lifecycle IT services to the public and private sectors. Ensuring security and regulatory compliance for disparate confidential and personal data types poses complex challenges, making data management innovation a crucial part of EIT’s business.

“The recent ransomware attacks across the globe speak to the universal importance of secure data management at the intersection of IT operations and cybersecurity,” said Aaron Pendola, director of Health IT at EIT. “EIT believes iRODS has a unique capability to solve complex data challenges related to cybersecurity.”

For instance, EIT can use iRODS to advance common data standards and terminologies, helping to overcome some of the fragmentation that has historically hindered the development of cohesive, global cybersecurity solutions. Open-source software, such as iRODS, is at the forefront of technology innovation. While the idea may seem counterintuitive, Pendola says that open-source models are well positioned to improve data privacy and security by helping users and partners anticipate how technology will evolve.

“We fully recognize how open-source technologies like iRODS have led to profound mission impacts across the industries we serve,” said Pendola. “We are excited to participate in the continuous improvement of iRODS driving its evolution and enhancements by virtue of the open-source, collaborative consortium model.”

The iRODS Consortium is a membership-based organization that guides development and support of iRODS as free open-source software for data discovery, workflow automation, secure collaboration, and data virtualization. The iRODS Consortium provides a production-ready distribution and professional integration services, training, and support. The consortium is administered by founding member RENCI, a research institute for applications of cyberinfrastructure located at the University of North Carolina at Chapel Hill, USA.

“Emagine IT’s focus on federal, state and local, and commercial contracts with expertise in  cybersecurity and IT modernization adds a new element to our membership,” said Terrell Russell, interim executive director of the iRODS Consortium. “We are excited to welcome them to the community and look forward to new collaborations.”

In addition to EIT, current iRODS Consortium members include Agriculture Victoria, Bayer, Bibiothèque et Archives nationales du Québec, CINES, CUBI at Berlin Institute of Health, DataDirect Networks, KU Leuven, Maastricht University, Minnesota Supercomputing Institute at the University of Minnesota, the National Institute of Environmental Health Sciences, NetApp, OpenIO, RENCI, SoftIron, the SURF cooperative, the Swedish National Infrastructure for Computing, Texas Advanced Computing Center, University College London, University of Colorado, Boulder, University of Groningen, Utrecht University, Wellcome Sanger Institute, Western Digital, and five organizations that wish to remain anonymous.

About the iRODS Consortium

The iRODS Consortium is a membership-based organization that guides development and support of iRODS as free open-source software for data discovery, workflow automation, secure collaboration, and data virtualization. The iRODS Consortium provides a production-ready iRODS distribution and iRODS professional integration services, training, and support. The consortium is administered by founding member RENCI, a research institute for applications of cyberinfrastructure located at the University of North Carolina at Chapel Hill, USA.

About Emagine IT

Emagine IT, inc. (EIT) is an information technology services and consulting company based in the Washington, DC metropolitan area. EIT provides IT modernization, cybersecurity, and full lifecycle IT services to the public and private sectors. For more information, please visit their website at www.eit2.com.

iRODS and Fujifilm partner to provide an archive solution

FUJIFILM Recording Media U.S.A., Inc. and the iRODS Consortium today announce a collaboration and integration, creating a joint solution built upon FUJIFILM Object Archive software and the iRODS data management platform. This joint solution leverages the benefits of a tape storage tier for infrequently accessed “cold” data, providing an automated archiving workflow for research, commercial, and governmental organizations that require storing large – and in most cases, rapidly growing – amounts of data.

With this solution, FUJIFILM Object Archive becomes a deep-tier archive storage target while iRODS provides a data management platform for users who produce massive amounts of research and analytics data.

FUJIFILM Object Archive software has been tested with the iRODS S3 plugin and fully supports the AMAZON S3 abstraction that iRODS provides. In addition to regular AMAZON S3 compatibility, Fujifilm and the iRODS Consortium worked together to add functionality comparable to AMAZON GLACIER to the iRODS S3 Resource Plugin.

This new functionality will be available as part of the upcoming iRODS 4.2.11 release.

Moving appropriate data to tape provides the benefits of air-gap security and scalability with lower data center operating costs and less electricity consumption when compared to other storage solutions. Additionally, FUJIFILM Object Archive software supports the new, higher-capacity LTO-9 tape technology, making the solution potentially even more efficient, economical, and scalable.

“We are very excited to be working with Fujifilm on the AMAZON GLACIER features,” said Terrell Russell, interim executive director of the iRODS Consortium. “Together, we are building a long-term relationship that will be good for our users, and for both organizations.”

“The new interoperability between Fujifilm’s Object Archive software and the iRODS data management platform will greatly benefit organizations who use both products, and potentially create new use cases as well,” said Tom Nakatani, vice president of sales & marketing at FUJIFILM Recording Media U.S.A., Inc. “We are pleased to successfully implement this joint solution for the benefit of our collaborators and users.”

Fujifilm is the world’s leading data tape manufacturer (based on market share). Its FUJIFILM Object Archive software allows objects to be seamlessly written to and read from data tape media with Fujifilm’s OTFormat. Using the industry-standard AMAZON S3-compatible API, Object Archive software offers the same operability as cloud storage and easy long-term retention of data similar to AMAZON GLACIER. By using FUJIFILM Object Archive software to optimize existing storage, organizations can eliminate egress fees, offload cold data to tape, maintain chain of custody, realize low ongoing storage costs, and help protect against cyber threats by providing a physical air-gap to data.

About the iRODS Consortium

The iRODS Consortium is a membership-based organization that guides development and support of iRODS as free open-source software for data discovery, workflow automation, secure collaboration, and data virtualization. The iRODS Consortium provides a production-ready iRODS distribution and iRODS professional integration services, training, and support. The consortium is administered by founding member RENCI, a research institute for applications of cyberinfrastructure located at the University of North Carolina at Chapel Hill, USA.

About Fujifilm

FUJIFILM Recording Media U.S.A., Inc. is FUJIFILM Corporation’s U.S.-based manufacturing, marketing and sales operation for data tape media and data management solutions. The company provides data center customers and enterprise industry partners with a wide range of innovative recording media products and archival solutions. Based on a history of thin-film engineering and magnetic particle science such as Fujifilm’s NANOCUBIC™ and Barium Ferrite technology, Fujifilm creates breakthrough data storage products. Worldwide, Fujifilm and its affiliates have surpassed the 170 million milestone for the number of LTO ULTRIUM data cartridges manufactured and sold since introduction, establishing the company as the leading global manufacturer of mid-range and enterprise data tape.

For more information on FUJIFILM Recording Media products, call 800-488-3854 or go to https://www.fujifilm.com/us/en/business/data-storage. For more information about FUJIFILM Object Archive software, visit http://fujifilmobjectarchive.com.

FUJIFILM Holdings Corporation, Tokyo, Japan, brings cutting edge solutions to a broad range of global industries by leveraging its depth of knowledge and fundamental technologies developed in its relentless pursuit of innovation. Its proprietary core technologies contribute to the various fields including healthcare, highly functional materials, document solutions and imaging products. These products and services are based on its extensive portfolio of chemical, mechanical, optical, electronic and imaging technologies. For the year ended March 31, 2021, the company had global revenues of $21 billion, at an exchange rate of 106 yen to the dollar. The Fujifilm global family of companies is committed to responsible environmental stewardship and good corporate citizenship. For more information, please visit: www.fujifilmholdings.com

FUJIFILM, OBJECT ARCHIVE, and NANOCUBIC are the trademarks and registered trademarks of FUJIFILM Corporation and its affiliates.

AMAZON, AMAZON GLACIER and AMAZON S3 are trademarks of Amazon.com, Inc. or its affiliates in the United States and/or other countries.

LTO and ULTRIUM are registered trademarks of Hewlett Packard Enterprise, IBM and Quantum in the United States and/or other countries.

© 2021 FUJIFILM Recording Media U.S.A. Inc. All Rights Reserved

RENCI named as partner in NSF institute to establish new field of imageomics

Imageomics Institute will advance computational methods for studying Earth’s biodiversity

RENCI has been named as a partner on an ambitious new effort to use images of living organisms as the basis for understanding biological processes of life on Earth. The project, to be led by faculty from The Ohio State University’s Translational Data Analytics Institute, has been awarded a $15 million grant from the National Science Foundation as part of NSF’s Harnessing the Data Revolution initiative.

The new entity, which will be called the Imageomics Institute, aims to establish imageomics as a new field of study that has the potential to transform biomedical, agricultural and basic biological sciences. Similar to genomics before it, which applied computation to the study of the human genome, imageomics will leverage computer science to help scientists extract meaning from an otherwise unwieldy amount of natural image data.

“There are many more species out there than scientists have been able to study in-depth,” said Jim Balhoff, a Senior Research Scientist at RENCI who will lead the RENCI component of the project. “If we can leverage machine learning to interpret images of living organisms, that would provide a scalable way to process large amounts of information about species, complementing the work of trained wildlife biologists.”

The Institute’s scientists will apply machine learning techniques to large collections of digital images from museums, labs and other institutions, as well as photos taken by scientists in the field, camera traps, drones and even members of the public who have uploaded their images to platforms such as eBird, iNaturalist and Wildbook. By training algorithms to extract biologically meaningful information from these images, researchers aim to generate new knowledge about organisms and species, including insights about how they evolve and interact within ecosystems.

Critical to this effort is the ability to categorize features of living organisms with standardized, vocabularies, known as a bio-ontologies, that can be “understood” by computers. Having served as a key contributor on the Phenoscape team for several previous NSF-funded projects, Balhoff is steeped in the art of encoding biological information in computable ways.

“There’s a lot of work going on with machine learning, and one of the key pieces of this project is to develop ways to incorporate ontology-based knowledge into machine learning processes,” said Balhoff. “We’re providing expertise in bio-ontologies to incorporate what we know about anatomical relationships into this image analysis system.”

This approach could ultimately enable a computer to identify key features in an image, such as an eye, mouth or dorsal fin, and then use automated reasoning to check that the interpretation makes anatomical sense. Repeating this process for large collections of images can give scientists a powerful platform for investigating new or previously understudied species or help them better understand the relationships between organisms.

As an inaugural institute for data-intensive discovery in science and engineering within NSF’s Harnessing the Data Revolution initiative, the Imageomics Institute will be part of a broader effort to form a national collaborative research network dedicated to computation-enabled discovery.

In addition to The Ohio State University and RENCI, the project will involve biologists and computer scientists from Tulane University, Virginia Tech, Duke University, and Rensselaer Polytechnic Institute; senior personnel from Ohio State, Virginia Tech and six additional institutions; and collaborators from more than 30 universities and organizations around the world.

RENCI to join researchers in a collaboration to increase reliability and efficiency of DOE scientific workflows by leveraging artificial intelligence and machine learning methods

Poseidon will use AI/ML-based techniques to simulate, model, and optimize scientific workflow performance on large, distributed DOE computing infrastructures.

The Department of Energy (DOE) advanced Computational and Data Infrastructures (CDIs) – such as supercomputers, edge systems at experimental facilities, massive data storage, and high-speed networks – are brought to bear to solve the nation’s most pressing scientific problems, including assisting in astrophysics research, delivering new materials, designing new drugs, creating more efficient engines and turbines, and making more accurate and timely weather forecasts and climate change predictions. 

Increasingly, computational science campaigns are leveraging distributed, heterogeneous scientific infrastructures that span multiple locations connected by high-performance networks, resulting in scientific data being pulled from instruments to computing, storage, and visualization facilities.

This image shows the terrain height – an important factor in weather modeling – across almost all of North America with spatial resolution of 4km. Poseidon tools will help improve workflows and lead to even more efficient weather forecasts through reliable and efficient execution of weather models.

Credit: Jiali Wang, Argonne National Laboratory

However, since these federated services infrastructures tend to be complex and managed by different organizations, domains, and communities, both the operators of the infrastructures and the scientists that use them have limited global visibility, which results in an incomplete understanding of the behavior of the entire set of resources that science workflows span. 

“Although scientific workflow systems like Pegasus increase scientists’ productivity to a great extent by managing and orchestrating computational campaigns, the intricate nature of the CDIs, including resource heterogeneity and the deployment of complex system software stacks, pose several challenges in predicting the behavior of the science workflows and in steering them past system and application anomalies,” said Ewa Deelman, research professor of computer science and research director at the University of Southern California’s Information Sciences Institute and lead principal investigator (PI). “Our new project, Poseidon, will provide an integrated platform consisting of algorithms, methods, tools, and services that will help DOE facility operators and scientists to address these challenges and improve the overall end-to-end science workflow.”

Under a new DOE grant, Poseidon aims to advance the knowledge of how simulation and machine learning (ML) methodologies can be harnessed and amplified to improve the DOE’s computational and data science.

Research institutions collaborating on Poseidon include the University of Southern California, the Argonne National Laboratory, the Lawrence Berkeley National Laboratory, and the Renaissance Computing Institute (RENCI) at the University of North Carolina at Chapel Hill.

Poseidon will add three important capabilities to current scientific workflow systems — (1) predicting the performance of complex workflows; (2) detecting and classifying infrastructure and workflow anomalies and “explaining” the sources of these anomalies; and (3) suggesting performance optimizations. To accomplish these tasks, Poseidon will explore the use of novel simulation, ML, and hybrid methods to predict, understand, and optimize the behavior of complex DOE science workflows on DOE CDIs. 

Poseidon will explore hybrid solutions where data collected from DOE and NSF testbeds, as well as from an ML simulator, will be strategically inputted into an ML training system.

High Performance computing systems, such as planned Aurora at the Argonne Leadership Computing Facility, are integral pieces of DOE CDIs. 

Credit: Argonne National Laboratory

“In addition to creating a more efficient timeline for researchers, we would like to provide CDI operators with the tools to detect, pinpoint, and efficiently address anomalies as they occur in the complex DOE facilities landscape,” said Anirban Mandal, Poseidon co-PI, assistant director for network research and infrastructure at RENCI, University of North Carolina at Chapel Hill. “To detect anomalies, Poseidon will explore real-time ML models that sense and classify anomalies by leveraging underlying spatial and temporal correlations and expert knowledge, combine heterogeneous information sources, and generate real-time predictions.”

RENCI will play a pivotal role in the Poseidon project. RENCI researchers Cong Wang and Komal Thareja will lead project efforts in data acquisition from the DOE CDI and NSF testbeds (FABRIC and Chameleon Cloud) and emulation of distributed facility models, enabling ML model training and validation on the testbeds and DOE CDI. Additionally, Poseidon co-PI Anirban Mandal will lead the project portion on performance guidance for optimizing workflows.

Successful Poseidon solutions will be incorporated into a prototype system with a dashboard that will be used for evaluation by DOE scientists and CDI operators. Poseidon will enable scientists working on the frontier of DOE science to efficiently and reliably run complex workflows on a broad spectrum of DOE resources and accelerate time to discovery.

Furthermore, Poseidon will develop ML methods that can self-learn corrective behaviors and optimize workflow performance, with a focus on explainability in its optimization methods. 

Working together, the researchers behind Poseidon will break down the barriers between complex CDIs, accelerate the scientific discovery timeline, and transform the way that computational and data science are done.

Please visit the project website for more information.

Data Matters short-course series returns in August 2021

Annual short-course series aims to bridge the data literacy gap

Now in its eighth year, Data Matters 2021, a week-long series of one and two-day courses aimed at students and professionals in business, research, and government, will take place August 9 – 13 virtually via Zoom. The short course series is sponsored by the Odum Institute for Research in Social Science at UNC-Chapel Hill, the National Consortium for Data Science, and RENCI.

Although the need for data literacy has grown exponentially for employers over the last few years, many academic institutions are struggling to keep up. According to a 2021 report from Forrester, 81% of recruiters rated data skills and data literacy as important capabilities for candidates, while only 48% of academic planners reported that their institution currently has specific data skills initiatives set up. Data Matters helps bridge this gap by providing attendees the chance to learn about a wide range of topics in data science, analytics, visualization, curation, and more from expert instructors.

“As our society becomes more data-driven, we’ve seen a greater need for workers in environments such as industry, health, and law to have a basic understanding of data science techniques and applications,” said Shannon McKeen, executive director of the National Consortium for Data Science. “The Data Matters short-course series allows us to meet the high demand for data science education and to provide pathways for both recent graduates and current professionals to bridge the data literacy gap and enrich their knowledge.”

Data Matters instructors are experts in their fields from NC State University, UNC-Chapel Hill, Duke University, Cisco, Blue Cross NC, and RENCI. Topics to be covered this year include information visualization, data curation, data mining and machine learning, programming in R, systems dynamics and agent-based modeling, and more. Among the classes available are:

  • Introduction to Programming in R, Jonathan Duggins. Statistical programming is an integral part of many data-intensive careers and data literacy, and programming skills have become a necessary component of employment in many industries. This course begins with necessary concepts for new programmers—both general and statistical—and explores some necessary programming topics for any job that utilizes data. 
  • Text Analysis Using R, Alison Blaine. This course explains how to clean and analyze textual data using R, including both raw and structured texts. It will cover multiple hands-on approaches to getting data into R and applying analytical methods to it, with a focus on techniques from the fields of text mining and Natural Language Processing.
  • Using Linked Data, Jim Balhoff. Linked data technologies provide the means to create flexible, dynamic knowledge graphs using open standards. This course offers an introduction to linked data and the semantic web tools underlying its use. 
  • R for Automating Workflow & Sharing Work, Justin Post. The course provides participants an introduction to utilizing R for writing reproducible reports and presentations that easily embed R output, using online repositories and version control software for collaboration, creation of basic websites using R, and the development of interactive dashboards and web applets. 

Data Matters offers reduced pricing for faculty, students, and staff from academic institutions and for professionals with nonprofit organizations. Head to the Data Matters website to register and to see detailed course descriptions, course schedules, instructor bios, and logistical information. 

Registration is now open at datamatters.org. The deadline for registration is August 5 for Monday/Tuesday courses, August 7 for Wednesday courses, and August 8 for Thursday/Friday courses.


About the National Consortium for Data Science (NCDS)

The National Consortium for Data Science (NCDS) is a collaboration of leaders in academia, industry, and government formed to address the data challenges and opportunities of the 21st century. The NCDS helps members take advantage of data in ways that result in new jobs and transformative discoveries. The organization connects diverse communities of data science experts to support a 21st century data-driven economy by building data science career pathways and creating a data-literate workforce, bridging the gap between data scientists in the public and private sectors, and supporting open and democratized data.The NCDS is administered by founding member RENCI. Learn more at datascienceconsortium.org/.

Tagged |

RENCI joins researchers across the US in supporting NSF Major Facilities with data lifecycle management efforts through new NSF-funded Center of Excellence

When it comes to research, having a strong cyberinfrastructure that supports advanced data acquisition, storage, management, integration, mining, visualization, and computational processing services can be vital. However, building cyberinfrastructures (CI) — especially ones that aim to support multiple varied and complex scientific facilities — is a challenge.

In 2018, a team of researchers from institutions across the country came together to launch a pilot program aimed at creating a model for a Cyberinfrastructure Center of Excellence (CI CoE) for the National Science Foundation’s (NSF) Major Facilities. The goal was to identify how the center could serve as a forum for the exchange of CI knowledge across varying fields and facilities, establish best practices for different NSF Major Facilities’ CI, provide CI expertise, and address CI workforce development and sustainability.

“Over the past few years, my colleagues and I have worked to provide expertise and support for the NSF Major Facilities in a way that accelerates the data lifecycle and ensures the integrity and effectiveness of the cyberinfrastructure,” said Ewa Deelman, research professor of computer science and research director at the University of Southern California’s Information Sciences Institute and lead principal investigator. “We are proud to contribute to the overall NSF CI ecosystem and to work with the NSF Major Facilities on solving their CI challenges together, understanding that our work may help support the sustainability and progress of the Major Facilities’ ongoing research and discovery.”

Five NSF Major Facilities were selected for the pilot: the Arecibo Observatory, the Geodetic Facility for the Advancement of Geoscience, the National Center for Atmospheric Research, the National Ecological Observatory Network, and the Seismological Facilities for the Advancement of Geoscience and EarthScope. As the pilot progressed, the program expanded to engage additional NSF Major Facilities.

The pilot found that Major Facilities differ in types of data captured, scientific instruments used, data processing and analyses conducted, and policies and methods for data sharing and use. However, the study also found that there are commonalities between the various Major Facilities in terms of the data lifecycle (DLC). As a result, the pilot developed a DLC model that captured the stages that data within a Major Facility goes through. The model includes stages for 1) data capture; 2) initial processing near the instrument(s); 3) central processing at data centers or clouds; 4) data storage, curation, and archiving; and 5) data access, dissemination, and visualization. Finding these commonalities helped the pilot program develop common challenges and standardized practices for establishing overarching CI requirements and to develop a blueprint for a CI CoE that can address the pressing Major Facilities DLC challenges.

Now, with a new NSF award, the pilot program has begun phase two and become CI CoE: CI Compass, An NSF Center of Excellence dedicated to navigating the Major Facilities’ data lifecycle. CI Compass will apply its three years of initial evaluation and analyses for an improved CI, as needed for the NSF’s Major Facilities.

The research institutions collaborating on CI Compass include the University of Southern California, the Renaissance Computing Institute (RENCI) at the University of North Carolina at Chapel Hill, the University of Notre Dame, Indiana University, Texas Tech University, and the University of Utah.

RENCI will play a pivotal role in the success of CI Compass by leading working groups that offer expertise and services to NSF Major Facilities for processing, data movement, data storage, curation, and archiving elements of the Major Facilities DLC.   

“Cyberinfrastructure is a critical element for fulfilling the science missions for the NSF Major Facilities and a primary goal of CI Compass is to partner with Major Facilities to enhance and evolve their CI,” said Anirban Mandal, assistant director for network research and infrastructure at the Renaissance Computing Institute at University of North Carolina at Chapel Hill, and co-principal investigator and associate director of the project. “In the process, CI Compass will not only act as a ‘knowledge sharing’ hub for brokering connections between CI professionals at Major Facilities, but also will disseminate the knowledge to the broader NSF CI community.”

RENCI team members, in particular Ilya Baldin, who is also PI for the NSF FABRIC project, will offer expertise in networking and cloud computing for innovative Major Facilities CI architecture designs. Under Mandal’s leadership as associate director of CI Compass, RENCI will also be responsible for continuous internal evaluation of the project and measuring the impact of CI Compass on the Major Facilities and the broader CI ecosystem. Erik Scott will take a lead role in CI Compass working groups for data storage, curation, archiving and identity management, while Laura Christopherson will lead the efforts in project evaluation.


Working together, the CI Compass team will enhance the overall NSF CI ecosystem by providing expertise where needed to enhance and evolve the Major Facilities CI, capturing and disseminating CI knowledge and best practices that power scientific breakthroughs for Major Facilities, and brokering connections to enable knowledge sharing between and across Major Facilities CI professionals and the broader CI community. 

Visit ci-compass.org to learn more about the project.


This project is funded by the NSF Office of Advanced Cyberinfrastructure in the Directorate for Computer and Information Science and Engineering under grant number 2127548. The pilot effort was funded by CISE/OAC and the Division of Emerging Frontiers in the Directorate for Biological Sciences under grant #1842042.

iRODS Consortium announces leadership transitions

The Renaissance Computing Institute (RENCI) – the founding member that administers the iRODS Consortium – announced today that Jason Coposky has officially resigned from his post as iRODS Consortium Executive Director effective June 11, 2021.

Coposky has been at RENCI for fifteen years and has served as the Executive Director of the Consortium for the last five and a half years and as Chief Technologist for five years before that. In these leadership roles, Coposky managed the software development team, directed the full software development lifecycle, and coordinated code hardening, testing, and application of formal software engineering practices. He also built and nurtured relationships with existing and potential consortium members and served as the chief spokesperson on iRODS development and strategies to the worldwide iRODS community. The Consortium has more than tripled in size under his leadership.  

In addition to growing the community, Coposky has been instrumental in turning the open source iRODS platform into enterprise software that is now deployed as a data management and data sharing solution at businesses, research centers, and government agencies in the U.S., Europe, and Asia. 

Terrell Russell, who has also been working on iRODS software since the development team transitioned to RENCI in 2008 and has held the role of Chief Technologist for the past five and a half years, has been named Interim Executive Director. 

For more information on iRODS and the iRODS Consortium, please visit irods.org.