research-article

Data station: delegated, trustworthy, and auditable computation to enable data-sharing consortia with a data escrow

Published:01 July 2022Publication History
Skip Abstract Section

Abstract

Pooling and sharing data increases and distributes its value. But since data cannot be revoked once shared, scenarios that require controlled release of data for regulatory, privacy, and legal reasons default to not sharing. Because selectively controlling what data to release is difficult, the few data-sharing consortia that exist are often built around data-sharing agreements resulting from long and tedious one-off negotiations.

We introduce Data Station, a data escrow designed to enable the formation of data-sharing consortia. Data owners share data with the escrow knowing it will not be released without their consent. Data users delegate their computation to the escrow. The data escrow relies on delegated computation to execute queries without releasing the data first. Data Station leverages hardware enclaves to generate trust among participants, and exploits the centralization of data and computation to generate an audit log.

We evaluate Data Station on machine learning and data-sharing applications while running on an untrusted intermediary. In addition to important qualitative advantages, we show that Data Station: i) outperforms federated learning baselines in accuracy and runtime for the machine learning application; ii) is orders of magnitude faster than alternative secure data-sharing frameworks; and iii) introduces small overhead on the critical path.

References

  1. [n.d.]. FATE. https://fate.fedai.org/ Online; accessed 29 May 2022.Google ScholarGoogle Scholar
  2. [n.d.]. Python-fuse interface to libfuse. https://github.com/libfuse/python-fuse. Online; accessed 29 May 2022.Google ScholarGoogle Scholar
  3. Rakesh Agrawal, Jerry Kiernan, Ramakrishnan Srikant, and Yirong Xu. 2002. Hippocratic databases. In 28th International Conference on Very Large Databases. Elsevier, 143--154.Google ScholarGoogle ScholarCross RefCross Ref
  4. Yael Amsterdamer and Osnat Drien. 2020. Towards Fine-Grained Data Access Control Through Active Peer Probing.. In EDBT. 403--406.Google ScholarGoogle Scholar
  5. Panagiotis Antonopoulos, Arvind Arasu, Kunal D Singh, Ken Eguro, Nitish Gupta, Rajat Jain, Raghav Kaushik, Hanuma Kodavalla, Donald Kossmann, Nikolas Ogg, et al. 2020. Azure SQL database always encrypted. In ACM SIGMOD International Conference on Management of Data. 1511--1525.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Sergei Arnautov, Bohdan Trach, Franz Gregor, Thomas Knauth, Andre Martin, Christian Priebe, Joshua Lind, Divya Muthukumaran, Dan O'keeffe, Mark L Stillwell, et al. 2016. SCONE: Secure Linux containers with Intel SGX. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 689--703.Google ScholarGoogle Scholar
  7. Johes Bater, Xi He, William Ehrich, Ashwin Machanavajjhala, and Jennie Rogers. 2018. Shrinkwrap: Differentially-private query processing in private data federations. Proceedings of the VLDB Endowment 12, 3 (2018), 307--320.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Johes Bater, Yongjoo Park, Xi He, Xiao Wang, and Jennie Rogers. 2020. Saqe: practical privacy-preserving approximate query processing for data federations. Proceedings of the VLDB Endowment 13, 12 (2020), 2691--2705.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Andrew Baumann, Marcus Peinado, and Galen Hunt. 2015. Shielding Applications from an Untrusted Cloud with Haven. ACM Trans. Comput. Syst. 33, 3, Article 8 (aug 2015), 26 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Daniel J Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Titouan Parcollet, Pedro PB de Gusmão, and Nicholas D Lane. 2020. Flower: A friendly federated learning research framework. arXiv preprint arXiv.2007.14390 (2020).Google ScholarGoogle Scholar
  11. CCG CCE Tech Pubs - Intel Corp. 2022. 12th Generation Intel® Core Processors --- Datasheet, Volume 1 of 2. https://www.intel.com/content/www/us/en/products/docs/processors/core/core-technical-resources.html. Online; accessed 28 February 2022.Google ScholarGoogle Scholar
  12. Chris Clifton, Murat Kantarcioğlu, AnHai Doan, Gunther Schadow, Jaideep Vaidya, Ahmed Elmagarmid, and Dan Suciu. 2004. Privacy-preserving data integration and sharing. In 9th ACM SIGMOD workshop on Research Issues in Data Mining and Knowledge Discovery. 19--26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ankur Dave, Chester Leung, Raluca Ada Popa, Joseph E Gonzalez, and Ion Stoica. 2020. Oblivious coopetitive analytics using hardware enclaves. In 15th European Conference on Computer Systems. 1--17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Tim Dierks and Eric Rescorla. 2008. The transport layer security (TLS) protocol version 1.2. (2008).Google ScholarGoogle Scholar
  15. Peter F Edemekong, Pavan Annamaraju, and Micelle J Haydel. 2018. Health insurance portability and accountability act. (2018).Google ScholarGoogle Scholar
  16. Muhammad El-Hindi, Carsten Binnig, Arvind Arasu, Donald Kossmann, and Ravi Ramamurthy. 2019. BlockchainDB: A shared database on blockchains. Proceedings of the VLDB Endowment 12, 11 (2019), 1597--1609.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018. Aurum: A data discovery system. In IEEE 34th International Conference on Data Engineering. IEEE, 1001--1012.Google ScholarGoogle Scholar
  18. Ian Foster. 2018. Research infrastructure for the safe analysis of sensitive data. The Annals of the American Academy of Political and Social Science 675, 1 (2018), 102--120.Google ScholarGoogle ScholarCross RefCross Ref
  19. Benny Fuhry, HA Jayanth Jain, and Florian Kerschbaum. 2021. Encdbdb: Searchable encrypted, fast, compressed, in-memory database using enclaves. In 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE, 438--450.Google ScholarGoogle ScholarCross RefCross Ref
  20. Craig Gentry. 2009. A fully homomorphic encryption scheme. Ph.D. Dissertation. Stanford University. https://crypto.stanford.edu/craig/craig-thesis.pdf.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Christian Göttel, Rafael Pires, Isabelly Rocha, Sébastien Vaucher, Pascal Felber, Marcelo Pasin, and Valerio Schiavoni. 2018. Security, performance and energy trade-offs of hardware-assisted memory protection mechanisms. In IEEE 37th Symposium on Reliable Distributed Systems. IEEE, 133--142.Google ScholarGoogle ScholarCross RefCross Ref
  22. Vipul Goyal, Omkant Pandey, Amit Sahai, and Brent Waters. 2006. Attribute-based encryption for fine-grained access control of encrypted data. In 13th ACM Conference on Computer and Communications Security. 89--98.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Tyler Hunt, Zhiting Zhu, Yuanzhong Xu, Simon Peter, and Emmett Witchel. 2018. Ryoan: A distributed sandbox for untrusted computation on secret data. ACM Transactions on Computer Systems 35, 4 (2018), 1--32.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Intel Corporation. 2017. Enclave Memory Measurement Tool for Intel® Software Guard Extensions (Intel® SGX) Enclaves. https://www.intel.com/content/dam/develop/external/us/en/documents/enclave-measurement-tool-intel-sgx-737361.pdf. Online; accessed 24 February 2022.Google ScholarGoogle Scholar
  25. Intel Corporation. 2021. Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3D: System Programming Guide, Part 4. https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html. Online; accessed 23 February 2022.Google ScholarGoogle Scholar
  26. Inter-university Consortium for Political and Social Research. 2022. ICPSR Data Enclaves. https://www.icpsr.umich.edu/web/pages/ICPSR/access/restricted/enclave.html. Online; accessed 18 February 2022.Google ScholarGoogle Scholar
  27. Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. 2021. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning 14, 1--2 (2021), 1--210.Google ScholarGoogle Scholar
  28. Poul-Henning Kamp and Robert NM Watson. 2000. Jails: Confining the omnipotent root. In 2nd International SANE Conference, Vol. 43. 116.Google ScholarGoogle Scholar
  29. David Kaplan, Jeremy Powell, and Tom Woller. 2016. AMD memory encryption. White paper (2016).Google ScholarGoogle Scholar
  30. Kate Keahey, Jason Anderson, Zhuo Zhen, Pierre Riteau, Paul Ruth, Dan Stanzione, Mert Cevik, Jacob Colleran, Haryadi S. Gunawi, Cody Hammock, Joe Mambretti, Alexander Barnes, Francois Halbach, Alex Rocha, and Joe Stubbs. 2020. Lessons Learned from the Chameleon Testbed. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC '20). USENIX Association.Google ScholarGoogle Scholar
  31. Colin Ian King. [n.d.]. stress-ng. https://github.com/ColinIanKing/stress-ng Online; accessed 29 May 2022.Google ScholarGoogle Scholar
  32. Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. (2009).Google ScholarGoogle Scholar
  33. Julia Lane, Pascal Heus, and Tim Mulcahy. 2008. Data Access in a Cyber World: Making Use of Cyberinfrastructure. Transactions on Data Privacy 1, 1 (2008), 2--16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Federated Learning. 2017. Collaborative machine learning without centralized training data. Publication date: Thursday, April 6 (2017).Google ScholarGoogle Scholar
  35. Dayeol Lee, David Kohlbrenner, Shweta Shinde, Krste Asanović, and Dawn Song. 2020. Keystone: An open framework for architecting trusted execution environments. In 15th European Conference on Computer Systems. 1--16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Mengyuan Li, Yinqian Zhang, and Zhiqiang Lin. 2021. CROSSLINE: Breaking" Security-by-Crash" based Memory Isolation in AMD SEV. In ACM SIGSAC Conference on Computer and Communications Security. 2937--2950.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Mengyuan Li, Yinqian Zhang, Huibo Wang, Kang Li, and Yueqiang Cheng. 2021. {CIPHERLEAKS}: Breaking Constant-time Cryptography on AMD SEV via the Ciphertext Side Channel. In 30th USENIX Security Symposium (USENIX Security 21). 717--732.Google ScholarGoogle Scholar
  38. John Liagouris, Vasiliki Kalavri, Muhammad Faisal, and Mayank Varia. 2021. Secrecy: Secure collaborative analytics on secret-shared data. arXiv preprint arXiv.2102.01048 (2021).Google ScholarGoogle Scholar
  39. Sujaya Maiyya, Victor Zakhary, Mohammad Javad Amiri, Divyakant Agrawal, and Amr El Abbadi. 2019. Database and distributed computing foundations of blockchains. In International Conference on Management of Data. 2036--2041.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. 2016. Communication-Efficient Learning of Deep Networks from Decentralized Data. (2016). Google ScholarGoogle ScholarCross RefCross Ref
  41. Chandrasekaran Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, and Peter Schwarz. 1992. ARIES: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM Transactions on Database Systems (TODS) 17, 1 (1992), 94--162.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Nightingale Open Science. 2022. https://www.nightingalescience.org/. Online; accessed 25 February 2022.Google ScholarGoogle Scholar
  43. Helen Nissenbaum. 2004. Privacy as contextual integrity. Wash. L. Rev. 79 (2004), 119.Google ScholarGoogle Scholar
  44. NORC. 2022. NORC Data Enclave. https://www.norc.org/Research/Capabilities/Pages/data-enclave.aspx. Online; accessed 18 February 2022.Google ScholarGoogle Scholar
  45. Nisha Panwar, Shantanu Sharma, Guoxi Wang, Sharad Mehrotra, Nalini Venkatasubramanian, Mamadou H Diallo, and Ardalan Amiri Sani. 2021. IoT notary: Attestable sensor data capture in IoT environments. ACM Transactions on Internet of Things 3, 1 (2021), 1--30.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Mark Raasveldt and Hannes Mühleisen. 2019. DuckDB: An embeddable analytical database. In International Conference on Management of Data. 1981--1984.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. G Anthony Reina, Alexey Gruzdev, Patrick Foley, Olga Perepelkina, Mansi Sharma, Igor Davidyuk, Ilya Trushkin, Maksim Radionov, Aleksandr Mokrov, Dmitry Agapov, Jason Martin, Brandon Edwards, Micah J. Sheller, Sarthak Pati, Prakash Narayana Moorthy, Shih han Wang, Prashant Shah, and Spyridon Bakas. 2021. OpenFL: An open-source framework for Federated Learning. arXiv:2105.06413 [cs.LG]Google ScholarGoogle Scholar
  48. Mark Russinovich, Edward Ashton, Christine Avanessians, Miguel Castro, Amaury Chamayou, Sylvan Clebsch, Manuel Costa, Cédric Fournet, Matthew Kerner, Sid Krishna, et al. 2019. CCF: A framework for building confidential verifiable replicated services. Technical report, Microsoft Research and Microsoft Azure (2019).Google ScholarGoogle Scholar
  49. Felix Schuster, Manuel Costa, Cédric Fournet, Christos Gkantsidis, Marcus Peinado, Gloria Mainar-Ruiz, and Mark Russinovich. 2015. VC3: Trustworthy data analytics in the cloud using SGX. In IEEE Symposium on Security and Privacy. IEEE, 38--54.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. AMD SEV-SNP. 2020. Strengthening VM isolation with integrity protection and more. White Paper, January (2020).Google ScholarGoogle Scholar
  51. Alex Shamis, Peter Pietzuch, Miguel Castro, Edward Ashton, Amaury Chamayou, Sylvan Clebsch, Antoine Delignat-Lavaud, Cedric Fournet, Matthew Kerner, Julien Maffre, et al. 2021. PAC: Practical Accountability for CCF. arXiv preprint arXiv.2105.13116 (2021).Google ScholarGoogle Scholar
  52. Yuanyuan Sun, Sheng Wang, Huorong Li, and Feifei Li. 2021. Building enclave-native storage engines for practical encrypted databases. Proceedings of the VLDB Endowment 14, 6 (2021), 1019--1032.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. SUSE. 2022. AMD SEV Guide. https://documentation.suse.com/sles/15-SP2/html/SLES-amd-sev/art-amd-sev.html. Online; accessed 28 February 2022.Google ScholarGoogle Scholar
  54. Miklos Szeredi. 2010. FUSE: Filesystem in userspace. http://fuse.sourceforge.net (2010).Google ScholarGoogle Scholar
  55. Carol Tenopir, Suzie Allard, Kimberly Douglass, Arsev Umur Aydinoglu, Lei Wu, Eleanor Read, Maribeth Manoff, and Mike Frame. 2011. Data sharing by scientists: practices and perceptions. PloS one 6, 6 (2011), e21101.Google ScholarGoogle Scholar
  56. UCI. 2022. Adult Income Dataset. https://www.kaggle.com/wenruliu/adult-income-dataset. Online; accessed 1 March 2022.Google ScholarGoogle Scholar
  57. Nikolaj Volgushev, Malte Schwarzkopf, Ben Getchell, Mayank Varia, Andrei Lapets, and Azer Bestavros. 2019. Conclave: secure multi-party computation on big data. In 14th EuroSys Conference. 1--18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Frank Wang, James Mickens, Nickolai Zeldovich, and Vinod Vaikuntanathan. 2016. Sieve: Cryptographically enforced access control for user data in untrusted clouds. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16). 611--626.Google ScholarGoogle Scholar
  59. Kang Wei, Jun Li, Ming Ding, Chuan Ma, Howard H Yang, Farhad Farokhi, Shi Jin, Tony QS Quek, and H Vincent Poor. 2020. Federated learning with differential privacy: Algorithms and performance analysis. IEEE Transactions on Information Forensics and Security 15 (2020), 3454--3469.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Alan F Westin. 1968. Privacy and freedom. Washington and Lee Law Review 25, 1 (1968), 166.Google ScholarGoogle Scholar
  61. Alex Wong. [n.d.]. COVID-Net. https://github.com/AlexSWong/COVID-Net Online; accessed 21 May 2022.Google ScholarGoogle Scholar
  62. Jiaan Zeng, Guangchen Ruan, Alexander Crowell, Atul Prakash, and Beth Plale. 2014. Cloud computing data capsules for non-consumptiveuse of texts. In 5th ACM workshop on Scientific Cloud Somputing. 9--16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Wenting Zheng, Ankur Dave, Jethro G Beekman, Raluca Ada Popa, Joseph E Gonzalez, and Ion Stoica. 2017. Opaque: An oblivious and encrypted distributed analytics platform. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 283--298.Google ScholarGoogle Scholar
  64. Wenting Zheng, Ryan Deng, Weikeng Chen, Raluca Ada Popa, Aurojit Panda, and Ion Stoica. 2021. Cerebro: A Platform for {Multi-Party} Cryptographic Collaborative Learning. In 30th USENIX Security Symposium (USENIX Security 21). 2723--2740.Google ScholarGoogle Scholar
  65. Jinwei Zhu, Kun Cheng, Jiayang Liu, and Liang Guo. 2021. Full Encryption: An end to end encryption mechanism in GaussDB. Proceedings of the VLDB Endowment 14, 12 (2021), 2811--2814.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Alexander Ziller, Andrew Trask, Antonio Lopardo, Benjamin Szymkow, Bobby Wagner, Emma Bluemke, Jean-Mickael Nounahon, Jonathan Passerat-Palmbach, Kritika Prakash, Nick Rose, et al. 2021. Pysyft: A library for easy federated learning. In Federated Learning Systems. Springer, 111--139.Google ScholarGoogle Scholar

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Article Metrics

    • Downloads (Last 12 months)83
    • Downloads (Last 6 weeks)8

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!