10.1145/3332186.3332241acmotherconferencesArticle/Chapter ViewAbstractPublication PagespearcConference Proceedingsconference-collections
research-article

Petrel: A Programmatically Accessible Research Data Service

Published:28 July 2019Publication History

ABSTRACT

We report on our experiences deploying and operating Petrel, a data service designed to support science projects that must organize and distribute large quantities of data. Building on a high-performance 3.2 PB parallel file system and embedded in Argonne National Laboratory's 100+ Gbps network fabric, Petrel leverages Science DMZ concepts and Globus APIs to provide application scientists with a high-speed, highly connected, and programmatically controllable data store. We describe Petrel's design, implementation, and usage and give representative examples to illustrate the many different ways in which scientists have employed the system.

References

  1. D. Abramson, J. Carroll, C. Jin, and M. Mallon. 2017. A Metropolitan Area Infrastructure for Data Intensive Science. In 13th IEEE International Conference on e-Science (e-Science). 238--247.Google ScholarGoogle Scholar
  2. W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Dumitrescu, I. Raicu, and I. Foster. 2005. The Globus striped GridFTP framework and server. In ACM/IEEE Conference on Supercomputing. IEEE Computer Society, 54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Ananthakrishnan, B. Blaiszik, K. Chard, R. Chard, B. McCollam, J. Pruyne, S. Rosen, S. Tuecke, and I. Foster. 2018. Globus Platform Services for Data Publication. In Practice and Experience on Advanced Research Computing (PEARC '18). Article 14, 7 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Y. Babuji, A. Woodard, Z. Li, D. Katz, B. Clifford, R. Kumar, L. Lacinski, R. Chard, J. Wozniak, I. Foster, M. Wilde, and K. Chard. 2019. Parsl: Pervasive Parallel Programming in Python. In ACM International Symposium on High-Performance Parallel and Distributed Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Y. N. Babuji, K. Chard, A. Gerow, and E. Duede. 2016. Cloud Kotta: Enabling secure and scalable data analytics in the cloud. In IEEE International Conference on Big Data. 302--310.Google ScholarGoogle Scholar
  6. M. Beck, T. Moore, J. Plank, and M. Swany. 2000. Logistical networking. In Active Middleware Services. Springer, 141--154.Google ScholarGoogle Scholar
  7. M. Beck, T. Moore, and J. S. Plank. 2002. An end-to-end approach to globally scalable network storage. In ACM SIGCOMM Computer Communication Review, Vol. 32. ACM, 339--346. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Beckman, T. J. Skluzacek, K. Chard, and I. Foster. 2017. Skluma: A Statistical Learning Pipeline for Taming Unkempt Data Repositories. In 29th International Conference on Scientific and Statistical Database Management. 41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. A. Benson, M. Cavanaugh, K. Clark, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, and E. W. Sayers. 2012. GenBank. Nucleic Acids Research 41, D1 (2012), D36--D42.Google ScholarGoogle ScholarCross RefCross Ref
  10. T. Bicer, D. Gürsoy, R. Kettimuthu, F. De Carlo, and I. T. Foster. 2016. Optimization of tomographic reconstruction workflows on geographically distributed resources. Journal of Synchrotron Radiation 23, 4 (2016), 997--1005.Google ScholarGoogle ScholarCross RefCross Ref
  11. B. Blaiszik, K. Chard, J. Pruyne, R. Ananthakrishnan, S. Tuecke, and I. Foster. 2016. The Materials Data Facility: Data Services to Advance Materials Science Research. Journal of Materials 68, 8 (2016), 2045--2052.Google ScholarGoogle Scholar
  12. K. Chard, E. Dart, I. Foster, D. Shifflett, S. Tuecke, and J. Williams. 2017. The Modern Research Data Portal: A Design Pattern for Networked, Data-Intensive Science. PeerJ Computer Science 4, e144 (2017).Google ScholarGoogle Scholar
  13. K. Chard, M. Lidman, B. McCollam, J. Bryan, R. Ananthakrishnan, S. Tuecke, and I. Foster. 2016. Globus Nexus: A Platform-as-a-Service provider of research identity, profile, and group management. Future Generation Computer Systems 56 (2016), 571--583. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. K. Chard, J. Pruyne, B. Blaiszik, R. Ananthakrishnan, S. Tuecke, and I. Foster. 2015. Globus data publication as a service: Lowering barriers to reproducible science. In 11th International Conference on e-Science. IEEE, 401--410. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. K. Chard, S. Tuecke, and I. Foster. 2014. Efficient and Secure Transfer, Synchronization, and Sharing of Big Data. IEEE Cloud Computing 1, 3 (Sep. 2014), 46--55.Google ScholarGoogle ScholarCross RefCross Ref
  16. K. Chard, S. Tuecke, I. Foster, B. Allen, R. Ananthakrishnan, J. Bester, B. Blaiszik, V. Cuplinskas, R. Kettimuthu, J. Kordas, L. Lacinski, M. Lidman, M. Link, S. Martin, B. McCollam, K. Pickett, D. Powers, J. Pruyne, B. Raumann, G. Rohder, S. Rosen, D. Shifflett, T. Sutton, V. Vasiliadis, and J. Williams. 2016. Globus: Recent Enhancements and Future Plans. In XSEDE16 Conference on Diversity, Big Data, and Science at Scale. 27:1--27:8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Chard, K. Chard, J. Alt, D. Y. Parkinson, S. Tuecke, and I. Foster. 2017. Ripple: Home automation for research data management. In 37th International Conference on Distributed Computing Systems Workshops. IEEE, 389--394.Google ScholarGoogle Scholar
  18. R. Chard, Z. Li, K. Chard, L. T. Ward, Y. N. Babuji, A. Woodard, S. Tuecke, B. Blaiszik, M.J. Franklin, and I. T. Foster. 2019. DLHub: Model and data serving for science. In 33rd IEEE International Parallel and Distributed Processing Symposium.Google ScholarGoogle Scholar
  19. E. Dart, L. Rotman, B. Tierney, M. Hester, and J. Zurawski. 2013. The Science DMZ: A Network Design Pattern for Data-intensive Science. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). ACM, New York, NY, USA, Article 85, 10 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. I. Foster, R. Ananthakrishnan, B. Blaiszik, K. Chard, R. Osborn, S. Tuecke, M. Wilde, and J. Wozniak. 2015. Networking materials data: Accelerating discovery at an experimental facility. Big Data and High Performance Computing (2015).Google ScholarGoogle Scholar
  21. I. Foster and D. Gannon. 2017. Cloud Computing for Science and Engineering. MIT Press. https://cloud4scieng.org. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Gürsoy, F. De Carlo, X. Xiao, and C. Jacobsen. 2014. TomoPy: A framework for the analysis of synchrotron tomographic data. Journal of Synchrotron Radiation 21, 5 (2014), 1188--1193.Google ScholarGoogle ScholarCross RefCross Ref
  23. A. P. Heath, M. Greenway, R. Powell, J. Spring, R. Suarez, D. Hanley, C. Bandlamudi, M. E. McNerney, K. P. White, and R. L. Grossman. 2014. Bionimbus: A cloud for managing, analyzing and sharing large genomics datasets. Journal of the American Medical Informatics Association 21, 6 (2014), 969--975.Google ScholarGoogle ScholarCross RefCross Ref
  24. K. Heitmann, T. D. Uram, H. Finkel, N. Frontiere, S. Habib, A. Pope, E. Rangel, J. Hollowed, D. Korytov, P. Larsen, B. S. Allen, K. Chard, and I. Foster. 2019. HACC Cosmological Simulations: First Data Release. arXiv:arXiv:1904.11966Google ScholarGoogle Scholar
  25. J. H. Howard, M. L. Kazar, S. G. Menees, D. A. Nichols, M. Satyanarayanan, R. N. Sidebotham, and M. J. West. 1988. Scale and performance in a distributed file system. ACM Transactions on Computer Systems 6, 1 (1988), 51--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. T. Ito, H. Ohsaki, and M. Imase. 2005. On parameter tuning of data transfer protocol GridFTP for wide-area grid computing. In 2nd International Conference on Broadband Networks. IEEE, 1338--1344.Google ScholarGoogle Scholar
  27. J. Kim, E. Yildirim, and T. Kosar. 2015. A highly-accurate and low-overhead prediction model for transfer throughput optimization. Cluster Computing 18, 1 (2015), 41--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. G. Klimeck, M. McLennan, S. P. Brophy, G. B. Adams III, and M. S. Lundstrom. 2008. nanohub.org: Advancing education and research in nanotechnology. Computing in Science & Engineering 10, 5 (2008), 17--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. T. Kluyver, B. Ragan-Kelley, F. Pérez, B. E. Granger, M. Bussonnier, J. Frederic, K. Kelley, J. B. Hamrick, J. Grout, S. Corlay, et al. 2016. Jupyter Notebooks--A publishing format for reproducible computational workflows. In ELPUB. 87--90.Google ScholarGoogle Scholar
  30. K. A. Lawrence, M. Zentner, N. Wilkins-Diehr, J. A. Wernert, M. Pierce, S. Marru, and S. Michael. 2015. Science gateways today and tomorrow: Positive perspectives of nearly 5000 members of the research community. Concurrency and Computation: Practice and Experience 27, 16 (2015), 4252--4268.Google ScholarGoogle ScholarCross RefCross Ref
  31. Z. Liu, P. Balaprakash, R. Kettimuthu, and I. Foster. 2017. Explaining Wide Area Data Transfer Performance. In 26th International Symposium on High-Performance Parallel and Distributed Computing. 167--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. McLennan and R. Kennell. 2010. HUBzero: A platform for dissemination and collaboration in computational science and engineering. Computing in Science & Engineering 12, 2 (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. P. A. Meyer, S. Socias, J. Key, E. Ransey, E. C. Tjon, A. Buschiazzo, M. Lei, C. Botka, J. Withrow, D. Neau, K. Rajashankar, K. S. Anderson, R. H. Baxter, S. C. Blacklow, T. J. Boggon, A. M. J. J. Bonvin, D. Borek, T. J. Brett, A. Caflisch, C.-I. Chang, W. J. Chazin, K. D. Corbett, M. S. Cosgrove, S. Crosson, S. Dhe-Paganon, E. D. Cera, C. L. Drennan, M. J. Eck, B. F. Eichman, Q. R. Fan, A. R. Ferré-D'Amaré, J. C. Fromme, K. C. Garcia, R. Gaudet, P. Gong, S. C. Harrison, E. E. Heldwein, Z. Jia, R. J. Keenan, A. C. Kruse, M. Kvansakul, J. S. McLellan, Y. Modis, Y. Nam, Z. Otwinowski, E. F. Pai, P. J. B. Pereira, C. Petosa, C. S. Raman, T. A. Rapoport, A. Roll-Mecak, M. K. Rosen, G. Rudenko, J. Schlessinger, T. U. Schwartz, Y. Shamoo, H. Sondermann, Y. J. Tao, N. H. Tolia, O. V. Tsodikov, K. D. Westover, H. Wu, I. Foster, J. S. Fraser, F. R. N. C. Maia, T. Gonen, T. Kirchhausen, K. Diederichs, M. Crosas, and P. Sliz. 2016. Data publication with the Structural Biology Data Grid supports live analysis. Nature Communications 7 (2016).Google ScholarGoogle Scholar
  34. M. Russell, G. Allen, G. Daues, I. Foster, E. Seidel, J. Novotny, J. Shalf, and G. Von Laszewski. 2001. The Astrophysics Simulation Collaboratory: A science portal enabling community software development. In 10th IEEE International Symposium on High Performance Distributed Computing. 207--215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. G. A. Schmidt, D. Bader, L. J. Donner, G. S. Elsaesser, J.-C. Golaz, C. Hannay, A. Molod, R. Neale, and S. Saha. 2017. Practice and philosophy of climate model tuning across six US modeling centers. Geoscientific Model Development 10, 9 (2017), 3207--3223.Google ScholarGoogle ScholarCross RefCross Ref
  36. F. B. Schmuck and R. L. Haskin. 2002. GPFS: A Shared-Disk File System for Large Computing Clusters. In USENIX Conference on File and Storage Technologies, Vol. 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. C. Simms, G. G. Pike, and D. Balog. 2007. Wide area filesystem performance using Lustre on the TeraGrid. In TeraGrid Conference.Google ScholarGoogle Scholar
  38. S. C. Simms, G. G. Pike, S. Teige, B. Hammond, Y. Ma, L. L. Simms, C. Westneat, and D. A. Balog. 2007. Empowering distributed workflow with the Data Capacitor: Maximizing Lustre performance across the wide area network. In Workshop on Service-oriented Computing Performance: Aspects, Issues, and Approaches. ACM, 53--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. A. S. Szalay. 2014. From simulations to interactive numerical laboratories. In Winter Simulation Conference. IEEE Press, 875--886. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S. Tuecke, R. Ananthakrishnan, K. Chard, M. Lidman, B. McCollam, S. Rosen, and I. Foster. 2016. Globus auth: A research identity and access management platform. In IEEE 12th International Conference on e-Science (e-Science). 203--212.Google ScholarGoogle Scholar
  41. T. D. Uram and M. E. Papka. 2016. Expanding the Scope of High-Performance Computing Facilities. Computing in Science and Engineering 18, 3 (May 2016), 84--87.Google ScholarGoogle ScholarCross RefCross Ref
  42. N. Wilkins-Diehr, D. Gannon, G. Klimeck, S. Oster, and S. Pamidighantam. 2008. TeraGrid science gateways and their impact on science. Computer 41, 11 (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. D. N. Williams, R. Drach, R. Ananthakrishnan, I. T. Foster, D. Fraser, F. Siebenlist, D. E. Bernholdt, M. Chen, J. Schwidder, S. Bharathi, A. L. Chervenak, R. Schuler, M. Su, D. Brown, L. Cinquini, P. Fox, J. Garcia, D. E. Middleton, W. G. Strand, N. Wilhelmi, S. Hankin, R. Schweitzer, P. Jones, A. Shoshani, and A. Sim. 2009. The Earth System Grid: Enabling Access to Multimodel Climate Simulation Data. Bulletin of the American Meteorological Society 90, 2 (2009), 195--205.Google ScholarGoogle ScholarCross RefCross Ref
  44. J. M. Wozniak, K. Chard, B. Blaiszik, R. Osborn, M. Wilde, and I. Foster. 2015. Big data remote access interfaces for light source science. In IEEE/ACM 2nd International Symposium on Big Data Computing. IEEE, 51--60.Google ScholarGoogle Scholar
  45. E. Yildirim, E. Arslan, J. Kim, and T. Kosar. 2016. Application-level optimization of big data transfers through pipelining, parallelism and concurrency. IEEE Transactions on Cloud Computing 4, 1 (2016), 63--75. Google ScholarGoogle ScholarDigital LibraryDigital Library

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    PEARC '19: Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning)
    July 2019
    775 pages
    ISBN:9781450372275
    DOI:10.1145/3332186
    • General Chair:
    • Tom Furlani

    Copyright © 2019 ACM

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 28 July 2019

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate 133 of 202 submissions, 66%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!