Home Korpus

Welcome to Corpus C4 – the joint corpus of the Digitales Wörterbuchs der deutschen Sprache des 20. Jahrhunderts (DWDS), the Austrian Academy Corpus (AAC), the Korpus Südtirol and the Swiss Text Corpus (CHTK).

You will find general information about Corpus C4 on these pages – as well as links to the corpus search interface.

Digitales Wörterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS)

DWDS aims at developing a dictionary system based on large and balanced electronic corpora. The project is planned to be realised in three consecutive phases: Building of text corpora, data processing with computational linguistic techniques and lexicographical work. The publicly available data is linguistically pre-processed and can be accessed via Internet (www.dwds.de). Since 2007, DWDS has been an Academy Project of the BBAW.

Austrian Academy Corpus (AAC)

The Austrian Academy Corpus (AAC) is compiled in Austria in a national corpus programme especially paying attention to the aspects of "preservation of cultural heritage" and "text technology". At its completion the AAC will contain about 1 billion running words.

Korpus Südtirol

The 'Korpus Südtirol' initiative aims at the collection, filing and corpus linguistic processing of South Tyrolean texts in order to make them available to the public and to document the use of written German in South Tyrol. It was started in September 2005 by researchers of the Free University of Bolzano, the European Academy Bozen/Bolzano and the University of Innsbruck.

By now, the text corpus thus developed allows for the evaluation of the available authentic language data according to different criteria, so as to be used for studies in historical linguistics, socio-linguistics, contact linguistics and language varieties. It will also serve as a basis for language teaching and consulting. The texts collected in South Tyrol, as well as further corpora in other languages, can be investigated through a user-friendly corpus search interface, which offers the opportunity to strengthen people's language awareness within the multilingual environment in South Tyrol.

Swiss Text Corpus (CHTK)

The Swiss sub-project Swiss Text Corpus (CHTK) contains texts in German written by Swiss authors in the 20th century. This digital collection is structured in analogy to the partner projects in Germany, Austria and Italy (using the same formal, time and content criteria). It is a balanced representation of standard German vocabulary in Switzerland and can serve as a base resource for specifically Swiss lexicographical needs.

The Swiss Text Corpus was built by a research group of the Deutsches Seminar of the University of Basel and was funded mainly by the Swiss National Science Foundation. Since 2014 it has been hosted by the Schweizerisches Idiotikon with financial support of the Swiss Academy of Humanities and Social Sciences.

Sub-corpus DWDS

 

documents/ words 1900-1924

documents/ words 1925-1949

documents/ words 1950-1974

documents/ words 1975-1999

total

functional texts

51

1'003'921

14

1'016'655

22

1'002'770

13

1'002'308

100

science

26

1'018'092

11

1'013'052

15

1'013'032

14

1'030'364

66

fiction

58

1'507'331

49

1'500'076

14

1'501'462

22

1'500'048

143

journalistic texts

NA

1'508'012

NA

1'503'202

NA

1'500'002

NA

1'500'000

NA

total

 

5'037'356

 

5'032'985

 

5'017'266

 

5'032'720

 

Sub-corpus AAC

 

documents/ words 1900-1924

documents/ words
1925-1949

documents/ words
1950-1974

documents/ words
1975-1999

total

functional texts

2

106'275

1

27'137

0

0

0

0

3

factual texts

23

1'811'433

16

794'136

1

27'037

0

0

40

fiction

4

153'754

0

0

0

0

0

0

4

journalistic texts

2

363'309

4

355'019

0

0

0

0

6

 

5

320'344

2

92'399

1

65'275

0

0

6

total

36

2'755'115

23

1'268'691

2

92'312

0

0

61

 

Sub-corpus Korpus Südtirol

 

documents/ words 1900-1924

documents/ words  1925-1949

documents/ words  1950-1974

documents/ words  1975-1999

total

functional texts

946

259'842

0

0

8

152'209

2

10'280

956

factual texts

0

0

0

0

3

70'955

3

126'489

6

fiction

0

0

1

52'975

0

0

21

222'546

22

journalistic texts

0

0

0

0

58

75'040

383

447'463

441

total

946

259'842

1

52'975

69

298'204

409

806'778

1'425

Sub-corpus CHTK

 

documents/ words
1900-1924

documents/ words
1925-1949

documents/ words
1950-1974

documents/ words
1975-1999

total

functional texts

1'042

1'170'099

1'465

1'267'731

969

1'193'200

1'417

1'087'395

4'893

factual texts

167

1'450'562

433

2'052'909

804

1'954'529

276

1'891'373

1'680

fiction

188

1'116'820

50

1'248'911

159

1'122'447

59

1'149'111

456

journalistic texts

833

513'728

1'107

1'020'160

993

982'098

1'929

1'135'426

4'862

total

2'230

4'251'209

3'055

5'589'711

2'925

5'252'274

3'681

5'263'305

11'891

 

Project structure

Members of the Corpus-C4 initiative are the Digitales Wörterbuch der deutschen Sprache des 20. Jahrhunderts (DWDS) from Berlin, the Austrian Academy Corpus (AAC) from Vienna, the Korpus Südtirol from Bolzano/Bozen and the Swiss Text Corpus (CHTK) from Basel/Zurich.

The project aims at providing access to a balanced corpus of 20th century Standard German, especially taking into account regional variation. At the moment, the corpus contains 20 million running words from DWDS, 4.1 million words from AAC, 1.7 million words from Korpus Südtirol and 20 million words from CHTK.

Corpus query system

 A special technical feature of Corpus C4 is its distributed query system. All sub-project host their data on their own servers and only at query time results are collected from the different servers and put together on a single page. Hence, the Corpus C4 as a whole is a virtual corpus. From a technological view this is mainly achieved using the functions of the linguistic search engine DDC developed by the Berlin sub-project DWDS.