Idle GPUs, data and infrastructure challenges: New research reveals why most AI models fail
First ever State of AI Infrastructure Survey reveals the specific challenges faced by companies building and deploying AI models
Tel Aviv – November 26, 2021 – Despite massive investment, fewer than half of AI models make it into production. The first ever State of AI Infrastructure Survey, commissioned by Run:AI, the survey attempts to discover why.
The popular statistic was confirmed, with 77% of respondents saying that most of their AI models never come into use. For a fifth of those who responded, that number drops even lower, with only 10% of their models reaching production.
The results are partly explained by other findings uncovered in the report. Â For example, just 17% of AI companies are able to achieve high utilization of their expensive AI resources. 22% of those building AI solutions say that their infrastructure mostly sits idle, as more than a third of respondents have to manually request access to GPU resources and are stuck with static allocations of hardware accelerators.
That’s despite significant cost investment in AI. 38% of respondents said their company’s annual budget for AI infrastructure alone, including hardware, software and cloud fees, was more than $1M. For 15%, their companies spend more than $10M annually on AI infrastructure. 74% of respondents said their companies were planning to increase GPU capacity or AI infrastructure spend in the near future.
The biggest challenge identified by AI workers in the survey was on the data side, with 61% saying that issues like data collection, data cleansing and governance caused problems. 42% highlighted challenges with their companies’ AI infrastructure and compute capacity. These two extremes of the AI development process – data and infrastructure – were cited more often than issues related to model development and training time, which just 24% of respondents mentioned as a challenge.
The survey also highlighted the role of the cloud in AI, with 53% saying that their AI applications and infrastructure are in the cloud, and 34% planning to move to the cloud in the next few years. Containers have become a standard infrastructure choice for running AI workloads, with 80% of respondents saying they use containers for some AI workloads, and 49% saying most or all of their AI work was run in containers. For orchestration, Kubernetes leads the pack with 42% saying they used the popular solution and another 16% planning to adopt it. Next up was RedHat Openshift, with 13% using it and 6% interested.
“Companies are committing to AI, investing millions into infrastructure and likely millions more in highly-trained staff. But if most AI models never make it into production, the promise of AI is not being realized. Our survey revealed that infrastructure challenges are causing resources to sit idle, data scientists are requesting manual access to GPUs and the journey to the cloud is ongoing. Companies that handle these challenges the most effectively will bring models to market and win the AI race,” said Omri Geller, CEO and co-founder of Run:AI.
The survey of 211 data scientists, AI/Machine Learning/IT practitioners and system architects  from ten countries primarily included experts from companies with over 5,000 employees, though it included respondents from startups through to large multinationals. The survey, conducted by independent research company Global Surveyz, was carried out in July 2021. Run:AI plans to make this an annual survey to keep track of trends in AI infrastructure.