国产探花免费观看_亚洲丰满少妇自慰呻吟_97日韩有码在线_资源在线日韩欧美_一区二区精品毛片,辰东完美世界有声小说,欢乐颂第一季,yy玄幻小说排行榜完本

首頁 > 學院 > 開發設計 > 正文

Apache Hadoop DistCP(分布式拷貝)示例

2019-11-08 20:34:37
字體:
來源:轉載
供稿:網友

本篇文章來自:https://examples.javacodegeeks.com/enterPRise-java/apache-hadoop/apache-hadoop-distcp-example/

在這個例子中,我們將向您展示如何使用分布式復制工具在Hadoop的集群/集群內設置中復制大型文件。

1.介紹

DistCP是Apache Hadoop上下文中的Distributed Copy(分布式拷貝)的縮寫。它基本上是一個工具,可以使用在我們需要復制大量的數據/文件在集群內/集群設置。在后臺,DisctCP使用MapReduce分發和復制數據,這意味著操作分布在集群中的多個可用節點上。這使得它更有效和有效的復制工具。DistCP獲取文件列表(在多個文件的情況下),并在多個Map任務之間分發數據,這些映射任務將分配給它們的數據部分復制到目標。

2.語法和示例

在本節中,我們將檢查DistCP的語法以及一些示例。

2.1基本

以下是distCp命令的基本語法。

hadoop distcp hdfs://namenode:port/source hdfs://namenode:port/destination遵循distcp第一個參數應該是源的完全限定地址,包括namenode和端口號。第二個參數應該是目標地址。distcp的基本語法是很容易和簡單。 它使用MapReduce自動處理所有分發和復制。如果在同一個集群之間復制,源和目的地的namenode和端口號將是相同的,并且在不同集群的情況下,兩者將不同。基本distcp的示例:

hadoop distcp hdfs://quickstart.cloudera:8020/user/access_logs hdfs://quickstart.cloudera:8020/user/destination_access_logs以下是命令執行的日志:

15/12/01 17:13:07 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[hdfs://quickstart.cloudera:8020/user/access_logs], targetPath=hdfs://quickstart.cloudera:8020/user/destination_access_logs, targetPathExists=false, preserveRawXattrs=false, filtersFile='null'}15/12/01 17:13:07 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:803215/12/01 17:13:08 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 2; dirCnt = 115/12/01 17:13:08 INFO tools.SimpleCopyListing: Build file listing completed.15/12/01 17:13:08 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb15/12/01 17:13:08 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor15/12/01 17:13:08 INFO tools.DistCp: Number of paths in the copy list: 215/12/01 17:13:08 INFO tools.DistCp: Number of paths in the copy list: 215/12/01 17:13:08 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:803215/12/01 17:13:09 INFO mapreduce.JobSubmitter: number of splits:215/12/01 17:13:09 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1449017643353_000115/12/01 17:13:10 INFO impl.YarnClientImpl: Submitted application application_1449017643353_000115/12/01 17:13:10 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1449017643353_0001/15/12/01 17:13:10 INFO tools.DistCp: DistCp job-id: job_1449017643353_000115/12/01 17:13:10 INFO mapreduce.Job: Running job: job_1449017643353_000115/12/01 17:13:20 INFO mapreduce.Job: Job job_1449017643353_0001 running in uber mode : false15/12/01 17:13:20 INFO mapreduce.Job:  map 0% reduce 0%15/12/01 17:13:32 INFO mapreduce.Job:  map 50% reduce 0%15/12/01 17:13:34 INFO mapreduce.Job:  map 100% reduce 0%15/12/01 17:13:34 INFO mapreduce.Job: Job job_1449017643353_0001 completed successfully15/12/01 17:13:35 INFO mapreduce.Job: Counters: 33	File System Counters		FILE: Number of bytes read=0		FILE: Number of bytes written=228770		FILE: Number of read Operations=0		FILE: Number of large read operations=0		FILE: Number of write operations=0		HDFS: Number of bytes read=39594819		HDFS: Number of bytes written=39593868		HDFS: Number of read operations=28		HDFS: Number of large read operations=0		HDFS: Number of write operations=7	Job Counters 		Launched map tasks=2		Other local map tasks=2		Total time spent by all maps in occupied slots (ms)=20530		Total time spent by all reduces in occupied slots (ms)=0		Total time spent by all map tasks (ms)=20530		Total vcore-seconds taken by all map tasks=20530		Total megabyte-seconds taken by all map tasks=21022720	Map-Reduce Framework		Map input records=2		Map output records=0		Input split bytes=276		Spilled Records=0		Failed Shuffles=0		Merged Map outputs=0		GC time elapsed (ms)=94		CPU time spent (ms)=1710		Physical memory (bytes) snapshot=257175552		Virtual memory (bytes) snapshot=3006455808		Total committed heap usage (bytes)=121503744	File Input Format Counters 		Bytes Read=675	File Output Format Counters 		Bytes Written=0	org.apache.hadoop.tools.mapred.CopyMapper$Counter		BYTESCOPIED=39593868		BYTESEXPECTED=39593868		COPY=2日志中的行號35表示執行的映射任務的數量,在這種情況下為2。要檢查復制是否成功,我們可以在HDFS中運行以下命令:

hadoop fs -ls /user/destination_access_logs以下是如果復制成功并且數據存在于目標文件夾中的輸出:

注意:當文件在兩個不同的集群之間復制時,兩個集群上的HDFS版本應該相同或在不同版本的情況下,較高版本應向后兼容。

2.2多源

如果有多個文件源并且需要轉到相同的目標源,則所有源都可以作為參數傳遞,如下面的示例語法中所示:

hadoop distcp -f hdfs://namenode:port/sourceListFile hdfs://namenode:port/destination因此,來自所有三個源的文件將被復制到指定的目的地。還有另一種選擇,如果有很多來源和寫長命令成為一個問題。以下是替代方法:
hadoop distcp hdfs://namenode:port/source1 hdfs://namenode:port/source2 hdfs://namenode:port/source3 hdfs://namenode:port/destination其中,sourceListFile是一個包含所有源的列表的簡單文件。在這種情況下,源列表文件需要傳遞標志-f,表示源不是要復制的文件,而是包含所有源的文件。注意:當distcp與多個源一起使用時,如果源沖突,distcp將使用錯誤消息中止副本。但是在目的地發生沖突的情況下,不會中止復制,但根據指定的選項解決沖突。如果未指定任何選項,則默認為跳過目標已存在的文件。

2.3更新和覆蓋標志

如名稱所示,更新將更新目標文件夾中的文件,但僅在滿足更新條件時更新。要執行的更新的條件是,更新檢查id目的地具有相同的文件名,如果文件大小和內容與源文件相同,如果一切相同則文件不更新,但如果不同的文件被更新源到目的地。覆蓋將覆蓋目的地文件的目的地具有相同的文件名,如果是,則文件將被覆蓋。

hadoop distcp -update hdfs://namenode:port/source hdfs://namenode:port/destination
hadoop distcp -overwrite hdfs://namenode:port/source hdfs://namenode:port/destination

2.4忽略故障標志

在distcp中,任何映射任務失敗,它也停止其他映射任務,并且復制過程完全停止并出現錯誤。在情況下,即使一個或多個映射任務失敗,也需要繼續復制其他數據塊,我們有一個忽略失敗標志,即-i。

hadoop distcp -i hdfs://namenode:port/source hdfs://namenode:port/destination

2.5最大映射任務

如果用戶想要指定可以分配給distcp執行的映射任務的最大數量,則有另一個標志 -m <max_num>。

hadoop distcp -m 5 hdfs://namenode:port/source hdfs://namenode:port/destination此示例命令將向distcp命令最多分配5個映射任務。distcp中設置最大映射任務的示例:

hadoop distcp -m 1 hdfs://quickstart.cloudera:8020/user/access_logs hdfs://quickstart.cloudera:8020/user/destination_access_logs_3這里我們將map任務限制為1.從上面的示例日志輸出我們知道這個文件數據的默認map任務是2。下面是命令執行的日志:

15/12/01 17:19:33 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, maxMaps=1, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[hdfs://quickstart.cloudera:8020/user/access_logs], targetPath=hdfs://quickstart.cloudera:8020/user/destination_access_logs_3, targetPathExists=false, preserveRawXattrs=false, filtersFile='null'}15/12/01 17:19:33 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:803215/12/01 17:19:34 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 2; dirCnt = 115/12/01 17:19:34 INFO tools.SimpleCopyListing: Build file listing completed.15/12/01 17:19:34 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb15/12/01 17:19:34 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor15/12/01 17:19:34 INFO tools.DistCp: Number of paths in the copy list: 215/12/01 17:19:34 INFO tools.DistCp: Number of paths in the copy list: 215/12/01 17:19:34 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:803215/12/01 17:19:35 INFO mapreduce.JobSubmitter: number of splits:115/12/01 17:19:35 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1449017643353_000315/12/01 17:19:35 INFO impl.YarnClientImpl: Submitted application application_1449017643353_000315/12/01 17:19:35 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1449017643353_0003/15/12/01 17:19:35 INFO tools.DistCp: DistCp job-id: job_1449017643353_000315/12/01 17:19:35 INFO mapreduce.Job: Running job: job_1449017643353_000315/12/01 17:19:44 INFO mapreduce.Job: Job job_1449017643353_0003 running in uber mode : false15/12/01 17:19:44 INFO mapreduce.Job:  map 0% reduce 0%15/12/01 17:19:52 INFO mapreduce.Job:  map 100% reduce 0%15/12/01 17:19:52 INFO mapreduce.Job: Job job_1449017643353_0003 completed successfully15/12/01 17:19:52 INFO mapreduce.Job: Counters: 33	File System Counters		FILE: Number of bytes read=0		FILE: Number of bytes written=114389		FILE: Number of read operations=0		FILE: Number of large read operations=0		FILE: Number of write operations=0		HDFS: Number of bytes read=39594404		HDFS: Number of bytes written=39593868		HDFS: Number of read operations=20		HDFS: Number of large read operations=0		HDFS: Number of write operations=5	Job Counters 		Launched map tasks=1		Other local map tasks=1		Total time spent by all maps in occupied slots (ms)=5686		Total time spent by all reduces in occupied slots (ms)=0		Total time spent by all map tasks (ms)=5686		Total vcore-seconds taken by all map tasks=5686		Total megabyte-seconds taken by all map tasks=5822464	Map-Reduce Framework		Map input records=2		Map output records=0		Input split bytes=138		Spilled Records=0		Failed Shuffles=0		Merged Map outputs=0		GC time elapsed (ms)=45		CPU time spent (ms)=1250		Physical memory (bytes) snapshot=123002880		Virtual memory (bytes) snapshot=1504280576		Total committed heap usage (bytes)=60751872	File Input Format Counters 		Bytes Read=398	File Output Format Counters 		Bytes Written=0	org.apache.hadoop.tools.mapred.CopyMapper$Counter		BYTESCOPIED=39593868		BYTESEXPECTED=39593868		COPY=2此示例中的map任務最大為1,如上述日志的第34行所示。

3.最后說明

在這個例子中,我們看到在Apache Hadoop中使用distcp命令來復制大量的數據。有關distcp命令和所有可用選項的更多幫助和詳細信息,請使用以下命令檢查內置幫助:

hadoop distcp


發表評論 共有條評論
用戶名: 密碼:
驗證碼: 匿名發表
主站蜘蛛池模板: 桓仁| 榆中县| 松滋市| 乌审旗| 同江市| 衡东县| 精河县| 团风县| 浑源县| 仙居县| 卢氏县| 涟源市| 名山县| 夹江县| 江北区| 江达县| 南丹县| 平安县| 巴青县| 濮阳市| 麦盖提县| 酒泉市| 太保市| 临武县| 甘泉县| 牡丹江市| 文化| 邳州市| 吉隆县| 安龙县| 湟源县| 丰原市| 高台县| 津南区| 花莲市| 平邑县| 文山县| 台北县| 广西| 沈丘县| 镇沅|