Mobile App Security

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Monday, 5 August 2013

Hadoop: A Government Primer

Posted on 09:01 by Unknown

Published on August 2nd, 2013 | by Travis Korte, Center for Data Innovation

What Is Hadoop?

Hadoop is open-source software that data scientists can use for large-scale distributed computing. It has two main components: Hadoop Distributed File System (HDFS) which is used to store data across multiple computers, and MapReduce, which is used to divide labor among those computers to perform large computations quickly.

Hadoop was created in 2005 by computer scientists Mike Cafarella and Doug Cutting (then at Yahoo), and is based in part on two Google technologies: the famous 2004 MapReduce algorithm and the distributed Google File System. Since its creation, it has grown to over 2 million lines of code written by 70 contributors. It is maintained by the Apache Software Foundation, a nonprofit organization that helps coordinate the activities of a large community of developers who work on Hadoop and other open-source projects. Its free availability and astonishing performance on certain types of large-scale data analysis problems have made it the single most important tool in coping with and generating knowledge from the increasingly large datasets today’s computing infrastructure produces.

HDFS

When a user stores data with Hadoop, the system automatically breaks the data up into blocks and apportions them out to the different computers. It makes multiple copies of the data and stores them in different places to make sure that even if individual computers fail, the system as a whole will keep functioning. These copies are necessary because the system is designed to run on inexpensive hardware to keep costs down; it turns out to be significantly more cost effective to start with cheap computers and plan for their failure than to run the system on expensive computers with lower failure rates. HDFS keeps track of what data is being stored where, including where to find each copy of a particular data block.

MapReduce

Once the data is stored, it becomes convenient to think about conducting large-scale data analysis and manipulation by doing much smaller-scale operations on each of the different computers simultaneously.

It’s similar—although not quite analogous—to an assembly line in a factory, insofar as it’s a system in which individual “workers” are carefully coordinated to do tasks much smaller than the final product. Where Hadoop differs from an assembly line is that the workers (the different computers housing the data) are all doing the same task—and simultaneously, not sequentially—which means they need an additional worker to assemble all of their output together when they’re finished.

The procedure that converts the large task into smaller tasks and assigns them to individual computers is called the “mapper;” the additional worker that pulls everything together at the end is called the “reducer;” and the whole division-of-labor paradigm itself is called MapReduce. Part of the Hadoop engineer’s job is to indicate what task should be assigned by the mapper and in what way the reducer should combine the results to get back something useful.

For many applications, computing in this way instead of on a single computer leads to substantial speed-ups and cost-savings; for others, it enables computations that no single computer in the world could perform on its own. Limited by processor speeds as well as memory reserves, even the most advanced supercomputers have upper bounds on their computational capabilities; but clusters of computers running Hadoop can scale into the thousands of units while maintaining good performance.

What Types of Problems Can Government Agencies Use Hadoop to Solve?

In many cases, when a question can be answered without looking at all the data at once, Hadoop can help. Text mining, which can be undertaken using Hadoop, can be applied to financial fraud detection, research paper classification, student sentiment analysis and smarter search engines for all manner of government records, and machine learning can be used for decision support systems for healthcare, model generation for climate science, speech recognition for security and mobile data entry across agencies.

In many cases Hadoop’s HDFS data storage system can be implemented securely enough for many applications on cloud-based servers which keeps costs minimal. Moreover, popular pay-per-use services such as Amazon Web Services’ Elastic MapReduce require little commitment and lend themselves to experimentation.

Example: Hadoop Makes It Easier to Find Government Documents

Many government agencies need to provide a means for officials and the public to find documents (e.g., patent applications, research reports, etc.), but there are far too many for an individual (or even a thousand individuals) to catalog all on their own, and even the task of manually creating indices of documents in various topic areas is daunting. One solution is to create a search engine that can automatically find documents relevant to a particular set of query terms; in order to build such a search engine, it is necessary to create a list of documents for each term, indicating the documents where the term can be found. The collection of all these terms and the documents that contain them is known as an “inverted index.” With millions of documents, creating an inverted index could be an enormous task, but it can be generated quickly and straightforwardly with Hadoop.

Here’s how it would work.

When the collection of documents to be indexed is uploaded with Hadoop, it gets divided into pieces, backed up and assigned to different computers using HDFS (the storage component). Then, a Hadoop engineer can specify the task for the mapper (scan through a document, keeping track of each word that appears in that paper) and a way for the reducer to combine the results from different computers (for each word, make a list of all the different documents that the mapper associated with it). The result is an inverted index, containing all the words from all the documents on all the different computers, generated automatically and in a fraction of the time it would have taken to conduct all the operations sequentially.

That’s it. (Almost. MapReduce has a couple of intermediate steps that involve shuffling and sorting, but we can ignore these without loss of the bigger picture.)

What Are Hadoop’s Limitations?

The major drawback of Hadoop is the rigidity of MapReduce. The paradigm does not lend itself to all computational problems; it is difficult, for example, to apply to the analysis of networks, which cannot always be easily separated into independent blocks. Network analysis has broad applications in the sciences as well as public health and national security, and government agencies that wish to answer network-related questions will not always be able to rely on Hadoop.

Fortunately, this limitation may soon disappear with the introduction of YARN, another Apache Software Foundation project that has been dubbed “MapReduce 2.0.” YARN was designed to enable the development of other parallel computing techniques besides MapReduce, while operating on the same fundamental HDFS storage system. YARN, which is still in beta, has been eagerly awaited by the Hadoop community, although the talent shortage that plagues Hadoop in general will likely be even more deeply felt during the early years of YARN.

Tags: big data, data science, distributed computing, hadoop, hadoop government, mapreduce, parallel computing

About the Author


Travis Korte Travis Korte is a research analyst at ITIF's Center for Data Innovation specializing in data science applications and open data. He has a background in journalism, computer science and statistics. Prior to joining ITIF, he launched the Science vertical of The Huffington Post and served as its Associate Editor, covering a wide range of science and technology topics. He has worked on data science projects with HuffPost and other organizations. Before this, he graduated with highest honors from the University of California, Berkeley, having studied critical theory and completed coursework in computer science and economics. His research interests are in computational social science and using data to engage with complex social systems.
Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest
Posted in Big Data, Computer Science, ICT Education, Industry News, Innovation, James Jones, Storage | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • LearningWorks: THE MISSING PIECE: Quantifying Non-Completion Pathways to Success
    ” . . . in the California Community College system . . . nearly one-third of students took an average of just two courses over about two yea...
  • Cisco Career Certifications Awarded American National Standards Institute Accreditation
    Achievement Demonstrates Compliance With Rigorous, Internationally Recognized Standards SAN JOSE, CA--(Marketwire - Jan 16, 2013) - Unders...
  • CyberWatch West Free Student 2 Student Webinar October 30th
    Online Workshop Oct 30 at 10:30 am PDT Man-in-the-Middle Attacks Using Mobile Devices Register @ cyberwatchwest.webex.com Student 2 Student ...
  • Spring 2013 NEW CCCApply Webinar Series
      Monday, 28 January 2013, TechEDge Written by Tim Calhoon Saturday, 26 January 2013 The New CCCApply online admissions application...
  • Community college grads out-earn bachelor's degree holders
    By Jon Marcus at The Hechinger Institute @CNNMoney February 26, 2013: 6:23 AM ET Nearly 30% of Americans with associate's degrees now ...
  • ACM CCECC Alice Summer Workshops Registration now open
    Registration has opened for the Alice Summer Workshops! A week has been set aside for a Community College focused workshop at Walt Disn...
  • CA Career Cafe: CALJOBS Job Search Service Now Available
    “ Somewhere someone is looking for exactly what you have to offer. ”                                                                    - ...
  • Code.org Launches To Help Make Computer Programming Accessible To Everyone
    Drew Olanoff ,  TechCrunch       Drew Olanoff has over 10 years of marketing, PR, customer service and support, relationship buildin...
  • EDGE goals addressed in 2013-14 California State Budget
    California's 2013-14 State Budget and an accompanying trailer bill, AB 86, address key EDGE goals of 1) beginning to restore dedicated f...
  • NCRIC Cyber Internship Program
    Northern California Regional Intelligence Center Cyber Internship Program Northern California Regional Intelligence Center (“NCRIC”) Mission...

Categories

  • Big Data
  • CATV
  • CENIC
  • Certifications
  • Cloud
  • Computational Thinking
  • Computer Engineering
  • Computer Science
  • CTE
  • Database
  • Digital Divide
  • Digital Literacy
  • Digital Media
  • Diversity
  • Educational Technology
  • elearning
  • Electronics
  • Entrepreneur
  • ethics
  • funding opportunity
  • Gaming
  • GIS
  • Grants
  • Hacking
  • Healthcare IT
  • ICT Applications
  • ICT Core Competencies
  • ICT Education
  • ICT Infrastructure
  • ICT Jobs
  • ICT pathways
  • ICT Regulation
  • ICT Research
  • Industry News
  • Innovation
  • Internships
  • James Jones
  • K-12
  • law
  • Linux
  • Mobility
  • MOOC
  • MPICT Announcements
  • Multimedia
  • Networking
  • networking security
  • Olivia Herriford
  • Open Source
  • Operating Systems
  • Pierre Thiry
  • Piracy
  • Public Policy
  • Security
  • Security; Identity Management
  • Smart Grid
  • Social Media
  • Soft Skills
  • Software Assurance
  • Software Engineering
  • Spanish
  • STEM Education
  • Storage
  • Teaching and Learning
  • Telecom
  • Tools
  • virtualization
  • Web
  • WIB
  • Wireless
  • women
  • Women in ICT
  • Workforce Development

Blog Archive

  • ▼  2013 (418)
    • ►  November (41)
    • ►  October (53)
    • ►  September (44)
    • ▼  August (21)
      • CSSIA Linux Professional Institute Train-the-Train...
      • Mouse Squad of California
      • California Career Briefs: Get Acquainted
      • LabSim Courseware Instructor Access Opportunity
      • Free CyberWatch West CAE Meeting at CSU San Bernar...
      • Google outage across search, Gmail and Talk causes...
      • National Cyber League (NCL) Fall 2013 Schedule Ann...
      • CCC JSPAC Conference - December 3-4, 2013
      • President Barack Obama Recognizes DVC and DVC Stud...
      • Infographic: The Greatest Security Breaches In Int...
      • Wadhwani Foundation Educational Technology ICT Del...
      • Resistance greets pumped-up effort to streamline c...
      • BAVC Bridges at City College of San Francisco
      • ITIF: How Much Will PRISM Cost U.S. Cloud Computi...
      • Hadoop: A Government Primer
      • Los Angeles public housing residents to receive di...
      • 2013 National Cyber League (NCL) Fall Season Annou...
      • Pluralsight Technology Training and Education Grants
      • Identity Week: Burlingame, CA; Nov 11-15
      • Alliance for California Computing Education for St...
      • Educause: Top-Ten IT Issues, 2013: Welcome to the...
    • ►  July (30)
    • ►  June (28)
    • ►  May (43)
    • ►  April (43)
    • ►  March (35)
    • ►  February (43)
    • ►  January (37)
  • ►  2012 (82)
    • ►  December (25)
    • ►  November (40)
    • ►  October (17)
Powered by Blogger.

About Me

Unknown
View my complete profile